Python library for working with OpenData #2

AartGoossens · 2018-05-18T09:14:12Z

Continuing this discussion here.

I am working on some Python code to make working with OpenData easier. It's far from finished (it only sort of works for my use-case now) but I would like to share it and putting in this repository makes sense. Before I spend more time on polishing it I'd like some input on what the library should look like.

Features I would like to have in the library:

View metadata of all athletes: currently the metadata lives in the blob for each athlete so you need to download all the data to view it. I propose to create a metadata file in the root of this repo that is updated every once in a while to reflect new/changed files in the OSF directory.
Tool to selectively download data: Only download a specific athlete, or only athletes with specific data types, date ranges, amounts of data, etc. based on the metadata.
Should return the activities in a general purpose data format. I propose to use a pandas.DataFrame for this.
Tool to make running computations on large amounts of activities easier: Not sure how to do this yet but with the amount of data that's already in OpenData it's impossible to have it all in memory so some clever batch-processing is needed there and I think some tooling might help there and has it's place in this library.

Any input is welcome!

The text was updated successfully, but these errors were encountered:

liversedge · 2018-05-18T09:43:46Z

I think we should separate augmenting the raw data files with an index since we will likely publish the index rather than ask users to generate it.

Then we have separate tooling for retrieving and formatting data that uses the index to support filtering and so on.

As soon as the tooling starts to post process data then I think it belongs in scikit-sports?

AartGoossens · 2018-05-18T09:57:13Z

I'm not sure if I understand what you mean but assuming you're talking about my point (1): I'm not proposing to change the data in OSF but want to create a file in this repository with a 'summary' of the data that is available.
This library should not do anything with the data but should make accessing the data as easy as possible while returning the data as raw as possible. Everything else should indeed live in scikit-sports.

glemaitre · 2018-05-18T13:29:20Z

Actually it was the IO lib that I wanted aside of scikit-sports. Regarding the issue with the amount of data to be loaded, I see 2 solutions: memmap and dask dataframe and array. In the second case, dask ml will be a good AI solution as well.

glemaitre · 2018-05-18T13:34:02Z

For the IO I really think that the design of imageIO could help. Basically a single wrapper function which should follow those requirements for data type. Then each plugin should implement those parts.

AartGoossens · 2018-05-18T16:11:27Z

I don't agree with you there @glemaitre . In my opinion this library should be a light wrapper around osfclient which makes it possible for anyone with a little coding experience to start loading some activities from OpenData. From this point of view, solutions like Dask and numpy.memmap are overly complicated and overkill for this purpose.

This indeed means that there will probably be a separate more imageIO-like library that is targeted at ML/AI and usage from within scikit-sports. That library could use the code in this repo, but not necessarily. Another argument for this is that much of the lots-of-activities-in-memory challenges will be more generic (also applicable to loading e.g. FIT files) and therefore also should live outside this library.

glemaitre · 2018-05-18T17:20:58Z

I might have get confused actually.

A wrapper around osfclient will be a sort of dataset fetcher, isn't it. In this case, I agree that having a wrapper which allows to get specific data (user, sensors, ...) will be super useful.

Where I am getting confused is on reading of those data. It is where I would expect to use an IO which can return a specific format. Basically, once the data downloaded, I would expect to use the IO biking library.

Regarding memmap or dask dataframe, it will be transparent to the user. A numpy array read in memmap mode does look exactly as a numpy array. A dask.dataframe or dask.array will follow the same API than numpy and dask (apart of the constructor where you give the number of chunks). However, I agree that this is a bit stupid to use those when it fits in memory. So it might be an option to give when reading the data by allowing to return those type on demand.

mpuchowicz · 2018-05-19T22:55:59Z

I don't really have the background to know what features would be best for the library. Instead I will share a couple of projects that I would like to attempt with this data and what I would need to know about the data set in order to include or exclude it.

Potential projects:

Effect of MMP time range (min, max, mean, median) used for model fitting on the CP model and 3-parameter models parameter estimates.

For this project, I would want to be able to pull data sets by a season or year. The data set requirement would be at least 50 (or some other arbitrarily high number) of power files that are at least 1 hour long. Demographic information such as age, sex, height, weight, competitive category etc would also be helpful but not a requirement.

Effect of data inclusion window (30 days, 60 days, 90 days, 120 days, etc) on MMP and model parameter estimates.

Again, it would be able to pull data sets by a season or year. The data set requirement would be 180 days of power files with a rolling 14 day average of at least 3 power files (ie a week off wouldn't be an exclusion but several weeks off would).

Effect of prior work (stress score, heart rate, w'bal etc) on MMP and and model parameter estimates.

Here the data set would need to be pulled by 60 day blocks. The data set requirement would be power files with a 7 day rolling average of at least 3 power files (ie a could days off wouldn't be an exclusion but a week off would). Obviously to do the heart rate then each power file would have to have a matching heart rate file.

So in general, what would be helpful would be some way to filter based on time blocks or seasons, and by the length, consistency, and density of the power files over the block or season.

Thanks in advance for all the work that is going into the open data project, it is very appreciated. I anticipate that it will be a great resource.

mp
twitter(@dpveloclinic)

AartGoossens · 2018-05-24T13:38:02Z

@glemaitre Ah now I get your point. I think the discussion then is whether this library is specifically meant to be used from/in combination with scikit-sports or if it's use case is more generic. I was thinking of making it more generic.

AartGoossens · 2018-05-24T13:39:41Z

@mpuchowicz Thanks a lot for your input. This helps a lot in thinking about how the interface should be and which features are needed.

AartGoossens · 2018-05-29T19:14:12Z

I created a WIP PR here. It's far from polished but it shows the direction I'm heading.

Some of the features:

Download some or all of the activities. Data is stored in a .opendata directory.
List and load the downloaded data
Generate 1 big metadatafile from the downloaded data for distribution in this repo (enabling you to selectively download athletes based on metadata)

Any feedback would be appreciated.

To figure out:

The metadata file should be easy to handle so I'll probably change this from the nested json it is now to a csv that is easy to load in e.g. a pandas DataFrame.
Store the data as original csv or as e.g. parquet file (smaller size and faster loading)?
Where should the data be stored? Cwd? Home directory? Ask the user to specify the location?
...

liversedge · 2018-06-01T13:38:57Z

Just push it and we can play and update ?

I have some views on what should be put into the one-big-metadatafile:

athlete info (age, gender, years of training, career bests)
training history info YoY (bests per year, volume, files with power, hr etc)

I think I need to look at this stuff now, as we now have over 250k workouts and nearly 400 athletes data !

AartGoossens · 2018-06-01T15:44:44Z

I'm fine with merging my PR now but I suspect some rewriting will happen so do not rely on the stability of the interface for now...

I think in the end there will be 2 metadata files: one with general data about athletes and a more extensive one with summary statistics for all activities.
The metadata csv in the PR is of the second kind. This file contains all metadata from all activities but for 3 athletes this file is already 1.6MB, so to limit the file size we probably need to prune most of the columns (which is fine I think). For local usage the generate_metadata() method might already be useful and sufficiently good as is.

liversedge · 2018-06-01T15:49:03Z

That's a big metadata file :)

I'm cool with things changing rapidly, anything is better than nothing !

glemaitre · 2018-06-01T16:04:37Z

Where should the data be stored? Cwd? Home directory? Ask the user to specify the location?

You could make something similar to this. In this way, the user can set it and you have a default location. I think that our default is fine but I will probably not hide it (i.e. .open....).

glemaitre · 2018-06-01T16:14:05Z

Store the data as original csv or as e.g. parquet file (smaller size and faster loading)?

Parquet is nice. I would go for it if we are not going to do anything with the metadata (IO and visualization with Excel).

liversedge · 2018-06-01T16:18:12Z

I also vote for something like parquet -- the data is likely to grow to millions of workout files over the next 2-3 years.

AartGoossens · 2018-06-01T17:34:43Z

You could make something similar to this. In this way, the user can set it and you have a default location. I think that our default is fine but I will probably not hide it (i.e. .open....).

Good idea, I like that approach. I'm also fine with not hiding the directory. I'll tackle this in another PR.

The change to parquet is quite easy and can be done in a later PR.

AartGoossens · 2018-12-12T18:25:43Z

Since I did a complete rewrite of the Python library I am tempted to close this issue even though some of the discussion here (e.g. about Parquet files) have not been resolved (although I do not think they are completely relevant anymore).

@liversedge @glemaitre are you ok with closing?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Python library for working with OpenData #2

Python library for working with OpenData #2

AartGoossens commented May 18, 2018

liversedge commented May 18, 2018

AartGoossens commented May 18, 2018

glemaitre commented May 18, 2018

glemaitre commented May 18, 2018

AartGoossens commented May 18, 2018

glemaitre commented May 18, 2018

mpuchowicz commented May 19, 2018

AartGoossens commented May 24, 2018

AartGoossens commented May 24, 2018

AartGoossens commented May 29, 2018 •

edited

Loading

liversedge commented Jun 1, 2018

AartGoossens commented Jun 1, 2018

liversedge commented Jun 1, 2018

glemaitre commented Jun 1, 2018

glemaitre commented Jun 1, 2018

liversedge commented Jun 1, 2018

AartGoossens commented Jun 1, 2018

AartGoossens commented Dec 12, 2018

Python library for working with OpenData #2

Python library for working with OpenData #2

Comments

AartGoossens commented May 18, 2018

liversedge commented May 18, 2018

AartGoossens commented May 18, 2018

glemaitre commented May 18, 2018

glemaitre commented May 18, 2018

AartGoossens commented May 18, 2018

glemaitre commented May 18, 2018

mpuchowicz commented May 19, 2018

AartGoossens commented May 24, 2018

AartGoossens commented May 24, 2018

AartGoossens commented May 29, 2018 • edited Loading

liversedge commented Jun 1, 2018

AartGoossens commented Jun 1, 2018

liversedge commented Jun 1, 2018

glemaitre commented Jun 1, 2018

glemaitre commented Jun 1, 2018

liversedge commented Jun 1, 2018

AartGoossens commented Jun 1, 2018

AartGoossens commented Dec 12, 2018

AartGoossens commented May 29, 2018 •

edited

Loading