-
-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Switch to a better dataset #2
Comments
I visited some of these links, here are some quick impressions:
Looks interesting but it's mainly time and location based. I would probably favor a dataset with mostly counts data, and maybe some categorical variables.
These seems to be available in netCDF only. Although Python has excellent tools to deal with the format, it seems like an unnecessary cognitive load. Same as the Edinburgh data. The URL does not seem very maintainable. Also, I'd avoid presenting a third-party analysis (although the post is a really good one) at the start / setup of the lesson, as it may distract learners. It could perhaps be presented afterwards. I quite like this one, and specially the fact that it is deposited in the UCI MLR, because it is very well-known and seems very stable. The only downsides are the lack of categorical variables and the lack of a header with column names in the raw data file.
This dataset seems very appropriate, but I dislike the fact of needing to accept the Kaggle Terms of Service in order to be able to download it. It would be much nicer to simply have an URL or repository that can be downloaded with
This is the dataset I like the most from that list. Being a biologist, I am inevitably biased towards using it :) . I also really like the fact that it is already built into Scikit Learn.
Same comment as the Titanic dataset. One that I really like is Palmer Penguins! |
Thinking about the requirements for a dataset it ideally needs to work with all of the following: linear regression Assuming the licensing permits we can always redistribute the dataset along with this lesson (as is currently being done). This still lets us use the wget/curl method to download while having a stable URL. I also like the idea of the Palmer Penguins, its being used in the introduction to deep learning lesson too and I envisage that these two lessons should be complementary. |
An interesting data set:
|
That MedMNIST dataset is quite interesting indeed. However, after reflecting and some conversations with other members of the community, I would tend to avoid medical datasets (including the Breast Cancer data that I endorsed in a previous comment), as people can be sensitive to them. |
In the long term I do wonder if there is a way we could have custom versions of this lesson using different datasets. Then a medical group could use a version with medical data and another group could use their own dataset. But this would add a lot of complexity and I think we've got a lot of much more basic problems to solve first. Just to add another dataset into the list, there is a weather prediction dataset (https://github.com/florian-huber/weather_prediction_dataset) which is being used by the Deep Learning incubator lesson. |
There are a number of example datasets used for educational purposes. Assuming the lesson will become part of data carpentry, then one should expect at least a social science track, an ecology track, a genomics track and possibly a geospatial track. Astronomy, economics and image processing tracks are also in development. Minor changes can be accommodated with selecting options when forking the repository to prepare a lesson - in the same way options are chosen to create a workshop website. |
Merge new changes from Mikes repo
The lesson ideally needs to use one dataset throughout. Its currently a bit of a mixture with gapminder, world bank, hand written digits and randomly generated data.
Suggestions from Carpentry Connect Manchester include:
The text was updated successfully, but these errors were encountered: