Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reducing size #8

Merged
merged 25 commits into from
May 26, 2022
Merged

Reducing size #8

merged 25 commits into from
May 26, 2022

Conversation

hengfeiyang
Copy link
Contributor

@hengfeiyang hengfeiyang commented May 3, 2022

  • compress storedFields with chunk
  • use zstd replace snappy
  • packed int for docNum, offset of docValue
  • packed int for TF, Loc of posting list
  • compress chunk of posting list
  • RunOptimize for roaring.Bitmap

After that,
we can reduce index size from 28 MB -> 9 MB with the same test data.
on a bigger data set,
we can reduce index size from 900 MB -> 160 MB with the same test data.

test data:

https://github.com/zinclabs/zinc/releases/download/v0.1.1/olympics.ndjson.gz

@hengfeiyang
Copy link
Contributor Author

I fixed CI

@hengfeiyang
Copy link
Contributor Author

hengfeiyang commented May 6, 2022

Lint Error is not my fault, The golang-ci version is too old, not compatible with go 1.18.1.

please upgrade Github action.

i run golang-ci in my local is ok.

@mschoch
Copy link
Member

mschoch commented May 6, 2022

Lint Error is not my fault, The golang-ci version is too old, not compatible with go 1.18.1.

please upgrade Github action.

i run golang-ci in my local is ok.

Don't worry, we are aware of the issues, have them on the main Bluge project as well: blugelabs/bluge#112

Hopefully I can get that sorted out, and apply it here as well.

I plan to start reviewing this weekend.

Copy link

@voldyman voldyman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Is this going to be backwards compatible with existing indexes? or would users need to rebuild their index?

documentcoder.go Outdated Show resolved Hide resolved
new.go Outdated Show resolved Hide resolved
documentcoder.go Outdated Show resolved Hide resolved
AUTHORS Show resolved Hide resolved
intcoder.go Show resolved Hide resolved
@hengfeiyang
Copy link
Contributor Author

Looks good! Is this going to be backwards compatible with existing indexes? or would users need to rebuild their index?

Yes, the MR not compatible old version.

@mschoch
Copy link
Member

mschoch commented May 14, 2022

OK, this is going to take longer than I originally thought, but I did an overview today and it all seems reasonable. A few initial questions/concerns, meanwhile I'll start trying to go feature by feature and be more thorough.

  • reducing the index size is great, but I'm wondering if on the zinc side you have any performance tests? It's reasonable to wonder if there are some higher latency or reduced throughput that come with the trade-offs here.
  • you say "compress storedFields with trunk" and "compress trunk of posting list" and also the code/comments make multiple references to 'trunk'. what does this word mean in this context? I am not familiar with how you are using it.

Thanks again for the contribution, I will dig deeper into the individual features next.

@hengfeiyang
Copy link
Contributor Author

hengfeiyang commented May 15, 2022

  1. Sorry, trunk -> chunk, i update my description.
  2. As we test, after compressed, the performance has be a bit better, i think because the size of file much reduced, so the read time reduced to.

@mschoch
Copy link
Member

mschoch commented May 15, 2022

Ah, it is chunk, thanks, that makes sense.

@voldyman
Copy link

looks good to me, thanks for the great work @hengfeiyang !

@hengfeiyang
Copy link
Contributor Author

hengfeiyang commented May 24, 2022

Copy link
Member

@mschoch mschoch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks great. I have done as much review as possible, thank you @voldyman for assisting. I think there is more value in getting this merged, so more people can use it, than there is value in waiting longer for me to review completely.

I will starting fixing the CI so we can proceed.

read.go Outdated

metaLenData, err := s.data.Read(int(storedOffset), int(storedOffset+binary.MaxVarintLen64))
// document chunk coder
thunkI := docNum / uint64(defaultDocumentChunkSize)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename to chunkI

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

renamed

@mschoch
Copy link
Member

mschoch commented May 24, 2022

OK, so my plan is to do the following tomorrow (please comment if you think there are other steps):

  • Tag the existing master branch as v1.0.0 (this is more a formality to have a clean semantic version number to refer to what bluge has been using for the past year)
  • Merge this PR
  • Tag master as v2.0.0
  • Update bluge config.go to support both ice v1 and v2
  • Release new version of Bluge, with default to new format

@mschoch
Copy link
Member

mschoch commented May 26, 2022

Alright I don't have perms to fix the lint issue on this branch, so I'll clean it up after merging

@hengfeiyang
Copy link
Contributor Author

Alright I don't have perms to fix the lint issue on this branch, so I'll clean it up after merging

i merged ci udpate from master

@mschoch mschoch merged commit 8b0e8c2 into blugelabs:master May 26, 2022
@FZambia
Copy link

FZambia commented Jul 4, 2022

@hengfeiyang hello, seems like this PR introduced several critical issues in Bluge, see discussion in blugelabs/bluge#123, so the latest version is broken.

@hengfeiyang
Copy link
Contributor Author

@FZambia i create a new pr sync our version in zinc.

@FZambia
Copy link

FZambia commented Jul 4, 2022

Thanks! Do you mean that in zinc you already have a version of Bluge which already solves those panics/errors?

@hengfeiyang
Copy link
Contributor Author

hengfeiyang commented Jul 5, 2022

I'm not sure fixed your case, we just fixed some panic. Also i can do more test with your given test code. blugelabs/bluge#123 (comment)

@hengfeiyang
Copy link
Contributor Author

@FZambia I tested you test code, it work well with my PR #18

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants