You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
One thing we can do to speed up the extractor is to implement optional sampling -- the best way for doing this requires further exploration. The rationale for this, is that the metadata don't necessarily need to be perfect to tell us all of the information about files. We should explore the following:
Check the existing Pandas methods for sampling data (should they exist).
Explore processing only a subset of the chunked dataframes. (fastest?)
In each dataframe, consider reducing the dataframe and only reading part of it. (least biased?)
I think we can call this low priority for now, but could be interesting for a future paper.
The text was updated successfully, but these errors were encountered:
One thing we can do to speed up the extractor is to implement optional sampling -- the best way for doing this requires further exploration. The rationale for this, is that the metadata don't necessarily need to be perfect to tell us all of the information about files. We should explore the following:
I think we can call this low priority for now, but could be interesting for a future paper.
The text was updated successfully, but these errors were encountered: