Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup and reduce size of storage [DNM] #704

Closed
wants to merge 7 commits into from
Closed

Conversation

Lnaden
Copy link
Contributor

@Lnaden Lnaden commented Jun 27, 2017

This PR tries to accomplish 2 things:

  • Reduce the file size of the storage files by compressing the serialized system strings
  • Reduce the boot up time of YANK by cutting down on how much processing of the system strings is done.

This PR does that, but currently has some problems (Do Not Merge yet!)

The serialized systems are not processed through NetCDF anymore, and instead are stored in GZIP'd picked objects as an extra storage file. I mange to get 120MB of strings down to 5.6 MB in another file. It decompresses lossless.

There is speed up since the serialized systems are no longer run through the YAML dumper which appears to recursively search through the string, so that is a huge time save (no quantitative measurement, but it is very noticeable for me).

There is still something being saved to the main analysis file which is occupying a large amount of space. For instance, the freshly initialized explicit solvent simulation, the analysis file is still 32 MB. At least its not 200MB, but that number still isn't as small as it should be. Some other object is taking up a large amount of space.

This is not ready to merge as the tests are not passing (due to a change in how the file is saved from the test). And I don't know why the file is still so large.

@andrrizzi or @jchodera if you could look over this before I spend too much time trying to fix the rest of the problems and debugging what else is happening that takes up so much space.

Fixes #702
Progress towards #582 and #41

@jchodera
Copy link
Member

All variables are initialized to their initial chunksize when the file is initially created. Could that explain the 32MB?

@Lnaden
Copy link
Contributor Author

Lnaden commented Jun 27, 2017

Maybe. But each of the itterable dimensions auto size as we add iterations. so the only thing that should be occupying initial space is the energies, and even that isnt that large. I'd bet its another uncompressed string somewhere somehow (like repex options/metadata or something), possibly the reference system?

@Lnaden
Copy link
Contributor Author

Lnaden commented Jun 27, 2017

And if it is the reference system, we can just force that into a netcdf fixed length char dim and let the netcdf compress that

@jchodera
Copy link
Member

Maybe. But each of the itterable dimensions auto size as we add iterations. so the only thing that should be occupying initial space is the energies, and even that isnt that large

One chunk worth of every variable is allocated initially. Are you saying no variables but the energies are created when the file is created initially?

@andrrizzi
Copy link
Contributor

andrrizzi commented Jun 29, 2017

Just reporting this here as well for the records. I run some timeit on my laptop of

res = pickle.dumps({'system': alch_ser})
res = yaml.dump({'system': alch_ser})

and 3)

byte_alch_ser = alch_ser.encode('utf-8')
compressed = zlib.compress(byte_alch_ser, 4)
res = yaml.dump({'system': compressed})

here is the result and the dimension of the res string.

  1. 0.016s, 38.4MB
  2. 65.13s, 50.5MB
  3. 4.759s, 3.1MB

Here alch_ser is the serialized alchemical system of testsystems.SrcExplicit. Note that 1) is dumps not dump. 3) will likely be as fast as 1) if we count that we’ll have to store to hard disk 3MB instead of 38MB.

For the 32MB, I guess both the reference_state and the topography (which contains the Topology object), which are saved as metadata, are pretty big.

For the state serialization of the states, we could add the first two lines of solution 3) directly into ThermodynamicState.__getstate__(..., zlib=False) through an extra parameter (similarly to what we did with skip_system) to make it available to everyone in the group. Since it's so fast, we could even increase the compression level (4 in this example) and reduce the compressed string dimension some more, or make the zlib argument an integer indicating the compression level instead of a boolean. This has also the advantage that it'll be already compatible with the storage layer in openmmtools.

For the topology, I don't have an idea on the top of my head since neither MDtraj nor OpenMM support string serialization, and we currently have to go through a dataframe representation. It might be good to check how much space it takes on disk and how much yaml.dump(topography_serialization) takes.

@andrrizzi
Copy link
Contributor

andrrizzi commented Jun 29, 2017

Worth mentioning also that netcdf probably adds few MB of overhead. During my experiments with compression, I noticed that the nc file size was always few MBs bigger than the string I was saving.

@andrrizzi
Copy link
Contributor

Although now that I remember, with a fixed length string and same level of zlib compression, the final dimension on disk was much smaller, so that overhead could affect only variable length strings.

@Lnaden
Copy link
Contributor Author

Lnaden commented Jun 30, 2017

@andrrizzi okay, so what do you recommend as far as an approach here. I get the various timings and what not, but I'm not following what your suggestion is.

@Lnaden
Copy link
Contributor Author

Lnaden commented Jun 30, 2017

Okay. So we cannot store the pre-compressed string directly (improper characters when netcdf tries to read it), but we can do the following, which is really dumb, but lets us keep the framework with very minimal edits:

  1. Encode the strings into byte types (Python 3 only)
  2. Compress the strings with zlib
  3. Run the compressed string through the yaml.dump to yield a string with the compressed data that
  4. Save the "string" amalgamation to the netcdf variable

On load:

  1. Read variable, yaml.load it, decompress, decode.

If that looks like a solution (since it would not break anything), let me know and I'll put it in. It takes a 64MB string down to 4MB, so its not as high of compression as the gzip/pickle in another file approach, but its better

@andrrizzi
Copy link
Contributor

To keep the thread synced, I've updated #702 with a snippet that should use both python and netcdf compression.

@Lnaden
Copy link
Contributor Author

Lnaden commented Jun 30, 2017

That's fine, I may just close this down depending on the choices in that thread

@Lnaden
Copy link
Contributor Author

Lnaden commented Jun 30, 2017

Closing this down and going with a cleaner approach as per #702

@Lnaden Lnaden closed this Jun 30, 2017
@Lnaden Lnaden deleted the speed_storage branch June 30, 2017 17:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Fix netcdf string compression
3 participants