-
Notifications
You must be signed in to change notification settings - Fork 71
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup and reduce size of storage [DNM] #704
Conversation
Changed system serialization write and read methods System read method to compensate for sequential writes by making 2 passes over the data to ensure order does not mater
actually deserialize the system (derp)
All variables are initialized to their initial chunksize when the file is initially created. Could that explain the 32MB? |
Maybe. But each of the itterable dimensions auto size as we add iterations. so the only thing that |
And if it is the reference system, we can just force that into a netcdf fixed length char dim and let the netcdf compress that |
One chunk worth of every variable is allocated initially. Are you saying no variables but the energies are created when the file is created initially? |
Just reporting this here as well for the records. I run some res = pickle.dumps({'system': alch_ser}) res = yaml.dump({'system': alch_ser}) and 3) byte_alch_ser = alch_ser.encode('utf-8')
compressed = zlib.compress(byte_alch_ser, 4)
res = yaml.dump({'system': compressed}) here is the result and the dimension of the
Here For the 32MB, I guess both the For the state serialization of the states, we could add the first two lines of solution 3) directly into For the topology, I don't have an idea on the top of my head since neither MDtraj nor OpenMM support string serialization, and we currently have to go through a dataframe representation. It might be good to check how much space it takes on disk and how much |
Worth mentioning also that |
Although now that I remember, with a fixed length string and same level of |
@andrrizzi okay, so what do you recommend as far as an approach here. I get the various timings and what not, but I'm not following what your suggestion is. |
Okay. So we cannot store the pre-compressed string directly (improper characters when netcdf tries to read it), but we can do the following, which is really dumb, but lets us keep the framework with very minimal edits:
On load:
If that looks like a solution (since it would not break anything), let me know and I'll put it in. It takes a 64MB string down to 4MB, so its not as high of compression as the gzip/pickle in another file approach, but its better |
To keep the thread synced, I've updated #702 with a snippet that should use both python and netcdf compression. |
That's fine, I may just close this down depending on the choices in that thread |
Closing this down and going with a cleaner approach as per #702 |
This PR tries to accomplish 2 things:
This PR does that, but currently has some problems (Do Not Merge yet!)
The serialized systems are not processed through NetCDF anymore, and instead are stored in GZIP'd picked objects as an extra storage file. I mange to get 120MB of strings down to 5.6 MB in another file. It decompresses lossless.
There is speed up since the serialized systems are no longer run through the YAML dumper which appears to recursively search through the string, so that is a huge time save (no quantitative measurement, but it is very noticeable for me).
There is still something being saved to the main analysis file which is occupying a large amount of space. For instance, the freshly initialized explicit solvent simulation, the analysis file is still 32 MB. At least its not 200MB, but that number still isn't as small as it should be. Some other object is taking up a large amount of space.
This is not ready to merge as the tests are not passing (due to a change in how the file is saved from the test). And I don't know why the file is still so large.
@andrrizzi or @jchodera if you could look over this before I spend too much time trying to fix the rest of the problems and debugging what else is happening that takes up so much space.
Fixes #702
Progress towards #582 and #41