Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Switch to a better dataset #2

Open
colinsauze opened this issue Jul 16, 2019 · 6 comments
Open

Switch to a better dataset #2

colinsauze opened this issue Jul 16, 2019 · 6 comments
Labels
help wanted Extra attention is needed

Comments

@colinsauze
Copy link
Member

colinsauze commented Jul 16, 2019

The lesson ideally needs to use one dataset throughout. Its currently a bit of a mixture with gapminder, world bank, hand written digits and randomly generated data.

Suggestions from Carpentry Connect Manchester include:

@vinisalazar
Copy link
Collaborator

I visited some of these links, here are some quick impressions:

Looks interesting but it's mainly time and location based. I would probably favor a dataset with mostly counts data, and maybe some categorical variables.

These seems to be available in netCDF only. Although Python has excellent tools to deal with the format, it seems like an unnecessary cognitive load.

Same as the Edinburgh data. The URL does not seem very maintainable. Also, I'd avoid presenting a third-party analysis (although the post is a really good one) at the start / setup of the lesson, as it may distract learners. It could perhaps be presented afterwards.

I quite like this one, and specially the fact that it is deposited in the UCI MLR, because it is very well-known and seems very stable. The only downsides are the lack of categorical variables and the lack of a header with column names in the raw data file.

This dataset seems very appropriate, but I dislike the fact of needing to accept the Kaggle Terms of Service in order to be able to download it. It would be much nicer to simply have an URL or repository that can be downloaded with wget or some other tool. A second disadvantage is that it is already split into Training and Test datasets. I guess it would be nicer to have a 'full' dataset and introduce the concept of splitting it further in the lesson.

  • Breast cancer data from sklearn

This is the dataset I like the most from that list. Being a biologist, I am inevitably biased towards using it :) . I also really like the fact that it is already built into Scikit Learn.

  • Kaggle competition datasets

Same comment as the Titanic dataset. One that I really like is Palmer Penguins!

@colinsauze
Copy link
Member Author

Thinking about the requirements for a dataset it ideally needs to work with all of the following:

linear regression
logarithmic regression
clustering
(non deep learning) neural networks
unsupervised dimensionality reduction such as PCA or t-SNE

Assuming the licensing permits we can always redistribute the dataset along with this lesson (as is currently being done). This still lets us use the wget/curl method to download while having a stable URL.

I also like the idea of the Palmer Penguins, its being used in the introduction to deep learning lesson too and I envisage that these two lessons should be complementary.

@bkmgit
Copy link
Collaborator

bkmgit commented Jun 25, 2021

An interesting data set:

@vinisalazar
Copy link
Collaborator

That MedMNIST dataset is quite interesting indeed. However, after reflecting and some conversations with other members of the community, I would tend to avoid medical datasets (including the Breast Cancer data that I endorsed in a previous comment), as people can be sensitive to them.

@colinsauze
Copy link
Member Author

In the long term I do wonder if there is a way we could have custom versions of this lesson using different datasets. Then a medical group could use a version with medical data and another group could use their own dataset. But this would add a lot of complexity and I think we've got a lot of much more basic problems to solve first.

Just to add another dataset into the list, there is a weather prediction dataset (https://github.com/florian-huber/weather_prediction_dataset) which is being used by the Deep Learning incubator lesson.

@bkmgit
Copy link
Collaborator

bkmgit commented Jun 28, 2021

There are a number of example datasets used for educational purposes. Assuming the lesson will become part of data carpentry, then one should expect at least a social science track, an ecology track, a genomics track and possibly a geospatial track. Astronomy, economics and image processing tracks are also in development.

Minor changes can be accommodated with selecting options when forking the repository to prepare a lesson - in the same way options are chosen to create a workshop website.

colinsauze pushed a commit that referenced this issue Sep 25, 2024
Merge new changes from Mikes repo
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Extra attention is needed
Projects
None yet
Development

No branches or pull requests

3 participants