New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Tokenizer tests and TokenizeLine updates #11133

Merged

owen-d merged 16 commits into main from paul1r/bloom_updates

Nov 8, 2023

Collaborator

paul1r commented Nov 3, 2023

What this PR does / why we need it:
The thrust of this PR is to ensure we have tests for each major function of the Bloom Tokenizer. In addition, there was some cleanup, in that constants are used to set some common parameters.

Lastly, the TokenizeLine() call was updated to correctly tokenize a line when a "skip tokenizer" is utilized.

Which issue(s) this PR fixes:
Fixes #

Special notes for your reviewer:

Checklist

Reviewed the CONTRIBUTING.md guide (required)
Documentation added
Tests updated
CHANGELOG.md updated
- If the change is worth mentioning in the release notes, add add-to-release-notes label
Changes that require user attention or interaction to upgrade are documented in docs/sources/setup/upgrade/_index.md
For Helm chart changes bump the Helm chart version in production/helm/loki/Chart.yaml and update production/helm/loki/CHANGELOG.md and production/helm/loki/README.md. Example PR
If the change is deprecating or removing a configuration option, update the deprecated-config.yaml and deleted-config.yaml files respectively in the tools/deprecated-config-checker directory. Example PR

paul1r added 2 commits

November 3, 2023 10:43


          Add test functions for bloom_tokenizer.go

2197d35


          Tokenize line with a skip tokenizer properly

2eec275

paul1r requested a review from a team as a code owner

November 3, 2023 18:05

pull-request-size bot added the size/L label

paul1r and others added 8 commits

November 3, 2023 14:07


          Merge branch 'main' into paul1r/bloom_updates


          Lint

1b6b86c


          Format

bd97dd4


          Return each set of skip tokens as their own slice in the slice of slices

24f995f


          Adjust readlib to call bloom_tokenizer and use that tokenization code…

376deda

…. Add test cases for chunkID prefix work


          Merge branch 'main' into paul1r/bloom_updates

9bac29f


          Make format

2cb997f


          Stylistic updates

231c838

poyzannur approved these changes

View reviewed changes

Contributor

poyzannur left a comment

LGTM with a minor question on the use of the two methods you added.
Am i correct to assume TokenizeLineWithChunkPrefix should be used when writing blooms, whereas TokenizeLine should be used to tokenize search queries before querying the blooms?

Collaborator Author

paul1r commented Nov 8, 2023

LGTM with a minor question on the use of the two methods you added. Am i correct to assume TokenizeLineWithChunkPrefix should be used when writing blooms, whereas TokenizeLine should be used to tokenize search queries before querying the blooms?

PopulateSeriesWithBloom would be used when writing blooms. TokenizeLine is just to get a quick set of tokens, for the quick sniff test if we need to dig into a chunk or not. The TokenizeLineWithChunkPrefix is to validate those tokens exist for a specific chunk. Should be more obvious once we start wiring this together

owen-d previously requested changes

View reviewed changes

pkg/storage/bloom/v1/bloom_tokenizer.go Outdated Show resolved Hide resolved

pkg/storage/bloom/v1/bloom_tokenizer.go Outdated Show resolved Hide resolved

pkg/storage/bloom/v1/bloom_tokenizer.go Outdated Show resolved Hide resolved

pkg/storage/bloom/v1/bloom_tokenizer.go Outdated Show resolved Hide resolved

pkg/storage/bloom/v1/bloom_tokenizer_test.go Outdated Show resolved Hide resolved

owen-d dismissed their stale review

November 8, 2023 19:23

Realized this is for reads, not writes.

owen-d reviewed

View reviewed changes

pkg/storage/bloom/v1/bloom_tokenizer.go Outdated

+              // If the tokenizer has a skip value, then the line will be tokenized multiple times,
+              // starting at the beginning of the line, with "skip" number of iterations, offset by one each time
+              // Each offset is kept as a separate slice of tokens, and all are returned in a slice of slices
+              func (bt *BloomTokenizer) TokenizeLineWithChunkPrefix(line string, chk logproto.ChunkRef) [][]Token {

Member

owen-d Nov 8, 2023 •

edited

Loading

There's no need for two functions here -- you can just use something like the following and apply it to any tokenizer (chunk_prefix_tokenizer or regular)

func SearchesForTokenizerAndLine(t Tokenizer, line string) (res [][]Token) {
  for i := 0; i < t.Skip()+1; i++ {
    res = append(res, t.Tokens(line[i:])) // this needs to account for runes vs bytes, but you get the idea 
  }
  return
}

owen-d reviewed

View reviewed changes

pkg/storage/bloom/v1/bloom_tokenizer.go Outdated

+              func (bt *BloomTokenizer) TokenizeLineWithChunkPrefix(line string, chk logproto.ChunkRef) [][]Token {
+              	allTokens := make([][]Token, 0, 10)
+              	if len(line) >= bt.chunkIDTokenizer.GetMin() && len(line) >= bt.chunkIDTokenizer.GetSkip() {

Member

owen-d Nov 8, 2023

Two things:

This actually needs to ensure the length is >= min + skip.
len(str) doesn't return the number of runes, but the number of bytes. We need to account for runes since that's how we index. See this for more detail.

paul1r and others added 5 commits

November 8, 2023 15:29


          PR comments

c03d772


          Merge branch 'main' into paul1r/bloom_updates

8bdacb9


          linting

a98dd74


          lint

f5397a5


          lint

b4fb9e4

owen-d reviewed

View reviewed changes

pkg/storage/bloom/v1/bloom_tokenizer.go Outdated

Comment on lines 124 to 125

		// This is a multi-dimensional slice where the first slice is the offset into the line, and the
		// second slice is the tokens for that offset.

Member

owen-d Nov 8, 2023

This is only true if all of the skip offsets return at least one token. Otherwise, the length of the result will be less than the number of skips+1.

Collaborator Author

paul1r Nov 8, 2023

yep, I've updated the doc accordingly, good catch

pkg/storage/bloom/v1/bloom_tokenizer.go

+              // The offset is used if the Tokenizer has a skip value being utilized.
+              func SearchesForTokenizerAndLine(t Tokenizer, line string) (res [][]Token) {
+              	res = make([][]Token, 0, 10)
+              	for i := range line { // iterate by runes

Member

owen-d Nov 8, 2023

This unnecessarily iterates all runes in the line, including offsets beyond skip+1.

Collaborator Author

paul1r Nov 8, 2023

ack, added a break clause


          PR comments

7a29691

owen-d approved these changes

View reviewed changes

owen-d merged commit c4ed0d0 into main

7 checks passed

owen-d deleted the paul1r/bloom_updates branch

November 8, 2023 22:03

rhnasc pushed a commit to inloco/loki that referenced this pull request


          Tokenizer tests and TokenizeLine updates (grafana#11133)

159e88c

**What this PR does / why we need it**:
The thrust of this PR is to ensure we have tests for each major function
of the Bloom Tokenizer. In addition, there was some cleanup, in that
constants are used to set some common parameters.

Lastly, the TokenizeLine() call was updated to correctly tokenize a line
when a "skip tokenizer" is utilized.

**Which issue(s) this PR fixes**:
Fixes #<issue number>

**Special notes for your reviewer**:

**Checklist**
- [ ] Reviewed the
[`CONTRIBUTING.md`](https://github.com/grafana/loki/blob/main/CONTRIBUTING.md)
guide (**required**)
- [ ] Documentation added
- [ ] Tests updated
- [ ] `CHANGELOG.md` updated
- [ ] If the change is worth mentioning in the release notes, add
`add-to-release-notes` label
- [ ] Changes that require user attention or interaction to upgrade are
documented in `docs/sources/setup/upgrade/_index.md`
- [ ] For Helm chart changes bump the Helm chart version in
`production/helm/loki/Chart.yaml` and update
`production/helm/loki/CHANGELOG.md` and
`production/helm/loki/README.md`. [Example
PR](grafana@d10549e)
- [ ] If the change is deprecating or removing a configuration option,
update the `deprecated-config.yaml` and `deleted-config.yaml` files
respectively in the `tools/deprecated-config-checker` directory.
[Example
PR](grafana@0d4416a)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels