Speedup and reduce size of storage [DNM] #704

Lnaden · 2017-06-27T22:01:22Z

This PR tries to accomplish 2 things:

Reduce the file size of the storage files by compressing the serialized system strings
Reduce the boot up time of YANK by cutting down on how much processing of the system strings is done.

This PR does that, but currently has some problems (Do Not Merge yet!)

The serialized systems are not processed through NetCDF anymore, and instead are stored in GZIP'd picked objects as an extra storage file. I mange to get 120MB of strings down to 5.6 MB in another file. It decompresses lossless.

There is speed up since the serialized systems are no longer run through the YAML dumper which appears to recursively search through the string, so that is a huge time save (no quantitative measurement, but it is very noticeable for me).

There is still something being saved to the main analysis file which is occupying a large amount of space. For instance, the freshly initialized explicit solvent simulation, the analysis file is still 32 MB. At least its not 200MB, but that number still isn't as small as it should be. Some other object is taking up a large amount of space.

This is not ready to merge as the tests are not passing (due to a change in how the file is saved from the test). And I don't know why the file is still so large.

@andrrizzi or @jchodera if you could look over this before I spend too much time trying to fix the rest of the problems and debugging what else is happening that takes up so much space.

Fixes #702
Progress towards #582 and #41

Changed system serialization write and read methods System read method to compensate for sequential writes by making 2 passes over the data to ensure order does not mater

…invoked

actually deserialize the system (derp)

jchodera · 2017-06-27T22:47:24Z

All variables are initialized to their initial chunksize when the file is initially created. Could that explain the 32MB?

Lnaden · 2017-06-27T22:54:02Z

Maybe. But each of the itterable dimensions auto size as we add iterations. so the only thing that should be occupying initial space is the energies, and even that isnt that large. I'd bet its another uncompressed string somewhere somehow (like repex options/metadata or something), possibly the reference system?

Lnaden · 2017-06-27T22:54:45Z

And if it is the reference system, we can just force that into a netcdf fixed length char dim and let the netcdf compress that

jchodera · 2017-06-27T22:55:30Z

Maybe. But each of the itterable dimensions auto size as we add iterations. so the only thing that should be occupying initial space is the energies, and even that isnt that large

One chunk worth of every variable is allocated initially. Are you saying no variables but the energies are created when the file is created initially?

andrrizzi · 2017-06-29T15:31:56Z

Just reporting this here as well for the records. I run some timeit on my laptop of

res = pickle.dumps({'system': alch_ser})

res = yaml.dump({'system': alch_ser})

and 3)

byte_alch_ser = alch_ser.encode('utf-8')
compressed = zlib.compress(byte_alch_ser, 4)
res = yaml.dump({'system': compressed})

here is the result and the dimension of the res string.

0.016s, 38.4MB
65.13s, 50.5MB
4.759s, 3.1MB

Here alch_ser is the serialized alchemical system of testsystems.SrcExplicit. Note that 1) is dumps not dump. 3) will likely be as fast as 1) if we count that we’ll have to store to hard disk 3MB instead of 38MB.

For the 32MB, I guess both the reference_state and the topography (which contains the Topology object), which are saved as metadata, are pretty big.

For the state serialization of the states, we could add the first two lines of solution 3) directly into ThermodynamicState.__getstate__(..., zlib=False) through an extra parameter (similarly to what we did with skip_system) to make it available to everyone in the group. Since it's so fast, we could even increase the compression level (4 in this example) and reduce the compressed string dimension some more, or make the zlib argument an integer indicating the compression level instead of a boolean. This has also the advantage that it'll be already compatible with the storage layer in openmmtools.

For the topology, I don't have an idea on the top of my head since neither MDtraj nor OpenMM support string serialization, and we currently have to go through a dataframe representation. It might be good to check how much space it takes on disk and how much yaml.dump(topography_serialization) takes.

andrrizzi · 2017-06-29T15:38:34Z

Worth mentioning also that netcdf probably adds few MB of overhead. During my experiments with compression, I noticed that the nc file size was always few MBs bigger than the string I was saving.

andrrizzi · 2017-06-29T16:12:19Z

Although now that I remember, with a fixed length string and same level of zlib compression, the final dimension on disk was much smaller, so that overhead could affect only variable length strings.

Lnaden · 2017-06-30T11:51:57Z

@andrrizzi okay, so what do you recommend as far as an approach here. I get the various timings and what not, but I'm not following what your suggestion is.

Lnaden · 2017-06-30T13:23:59Z

Okay. So we cannot store the pre-compressed string directly (improper characters when netcdf tries to read it), but we can do the following, which is really dumb, but lets us keep the framework with very minimal edits:

Encode the strings into byte types (Python 3 only)
Compress the strings with zlib
Run the compressed string through the yaml.dump to yield a string with the compressed data that
Save the "string" amalgamation to the netcdf variable

On load:

Read variable, yaml.load it, decompress, decode.

If that looks like a solution (since it would not break anything), let me know and I'll put it in. It takes a 64MB string down to 4MB, so its not as high of compression as the gzip/pickle in another file approach, but its better

andrrizzi · 2017-06-30T17:40:02Z

To keep the thread synced, I've updated #702 with a snippet that should use both python and netcdf compression.

Lnaden · 2017-06-30T17:42:28Z

That's fine, I may just close this down depending on the choices in that thread

Lnaden · 2017-06-30T17:45:53Z

Closing this down and going with a cleaner approach as per #702

Lnaden added 7 commits June 27, 2017 14:40

Changed the system serialization to store to a gzip pickle file.

e8af51e

Changed system serialization write and read methods System read method to compensate for sequential writes by making 2 passes over the data to ensure order does not mater

Fix some places where all files were treated as nc

0eb735e

Force the thermodynamic states writer to open/close file on demand

760f4bb

Fix bug in load command

bb5108d

Fix a few typos in the logic for state_ids and pickle load

781753d

Make all the gzip streams open and close on demand using with when …

756c7ac

…invoked

Fix double up in read state counter

8a41180

actually deserialize the system (derp)

Lnaden closed this Jun 30, 2017

Lnaden deleted the speed_storage branch June 30, 2017 17:45

jchodera mentioned this pull request Sep 4, 2017

Find ways to reduce file sizes #41

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup and reduce size of storage [DNM] #704

Speedup and reduce size of storage [DNM] #704

Lnaden commented Jun 27, 2017

jchodera commented Jun 27, 2017

Lnaden commented Jun 27, 2017

Lnaden commented Jun 27, 2017

jchodera commented Jun 27, 2017

andrrizzi commented Jun 29, 2017 •

edited

Loading

andrrizzi commented Jun 29, 2017 •

edited

Loading

andrrizzi commented Jun 29, 2017

Lnaden commented Jun 30, 2017

Lnaden commented Jun 30, 2017

andrrizzi commented Jun 30, 2017

Lnaden commented Jun 30, 2017

Lnaden commented Jun 30, 2017

Speedup and reduce size of storage [DNM] #704

Speedup and reduce size of storage [DNM] #704

Conversation

Lnaden commented Jun 27, 2017

jchodera commented Jun 27, 2017

Lnaden commented Jun 27, 2017

Lnaden commented Jun 27, 2017

jchodera commented Jun 27, 2017

andrrizzi commented Jun 29, 2017 • edited Loading

andrrizzi commented Jun 29, 2017 • edited Loading

andrrizzi commented Jun 29, 2017

Lnaden commented Jun 30, 2017

Lnaden commented Jun 30, 2017

andrrizzi commented Jun 30, 2017

Lnaden commented Jun 30, 2017

Lnaden commented Jun 30, 2017

andrrizzi commented Jun 29, 2017 •

edited

Loading

andrrizzi commented Jun 29, 2017 •

edited

Loading