Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Suggestion: use pickle instead of json for better performance in initialization #36

Open
m-gerald opened this issue Dec 25, 2023 · 2 comments

Comments

@m-gerald
Copy link

m-gerald commented Dec 25, 2023

Hi, I saw your project and it is very useful. however I had some performance issue when I was trying to use it. basically loading huge data from json is not efficient and takes a lot of time. so I loaded your json files into dicts and dumped them into a pickle file (using pickle library) and I got around 40% better performance when loading data from pickle files instead of json

I put the pickle files inside the package (I didn't zip them but they should work fine when zipped as well) and changed your method like this. also changed the paths you declared to these files

@staticmethod
def _read_json_from_zip(zip_file):
    with open(zip_file, 'rb') as f:
        return pickle.loads(f.read())
@philipperemy
Copy link
Owner

Thanks @m-gerald will see what I can do here!

@philipperemy
Copy link
Owner

philipperemy commented Jan 4, 2024

One solution is to pickle and compress with GZ the files. The file size will remain unchanged. Applying pickle (without compression) will increase the size of the files and is not practical.

  • It takes 7.1s to init with the current ZIP impl and 5.3s from PKL.GZ. I'm closer to a 25% speed gain.

Encoding

import json, pickle, gzip
gzip.GzipFile('first_names.pkl.gz', 'wb').write(pickle.dumps(json.load(open('first_names.json'))))
gzip.GzipFile('last_names.pkl.gz', 'wb').write(pickle.dumps(json.load(open('last_names.json'))))

Decoding (just a quick impl, final impl will differ)

@staticmethod
def _read_json_from_zip(zip_file):
    return pickle.load(gzip.open(zip_file, 'rb'))

With LZ4, it's possible to reach 4.7s but the file sizes are 50% larger.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants