Experiment with data sampling methods #9

tskluzac · 2019-07-01T02:13:36Z

One thing we can do to speed up the extractor is to implement optional sampling -- the best way for doing this requires further exploration. The rationale for this, is that the metadata don't necessarily need to be perfect to tell us all of the information about files. We should explore the following:

Check the existing Pandas methods for sampling data (should they exist).
Explore processing only a subset of the chunked dataframes. (fastest?)
In each dataframe, consider reducing the dataframe and only reading part of it. (least biased?)

I think we can call this low priority for now, but could be interesting for a future paper.

tskluzac added the enhancement New feature or request label Jul 1, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Experiment with data sampling methods #9

Experiment with data sampling methods #9

tskluzac commented Jul 1, 2019

Experiment with data sampling methods #9

Experiment with data sampling methods #9

Comments

tskluzac commented Jul 1, 2019