-
Notifications
You must be signed in to change notification settings - Fork 42
Home
This project is a collection of scripts to download, process, and load microarray data sets. Most of them are small sample, high-dimensional data sets (i.e. the small n, large p problem). Additionally, most of the data sets presented here have been widely studied in the genetics/microarray, bioinformatics, statistics, and computer science literatures.
For each data set, I have included a small set of scripts in the main project folder that automatically download, clean, and save the data set if necessary. The filename scheme of the scripts each begin with numbers that indicate the order in which the files should be sourced.
Currently, my code to download, clean, and save the microarray data sets is in R. In the future, I intend to package these data sets into a single R package. Furthermore, I would like to make the data easily accessible from Python.
The data sets that I have collected are: (More coming soon)
- Chiaretti et al. (2004) - Acute Lymphoblastic Leukemia (ALL)
- Golub et al. (1999) - Leukemia
- Khan et al. (2001) - SRBCT
- Cho et al. (1998) - Yeast Cell Cycle
- Bhattacharjee et al. (2001) - Lung Cancer
- Wen et al. (1998) - Rat CNS
- Yeoh et al. (2002) - St. Jude Leukemia