Home

Collection of Microarray Data Sets

This project is a collection of scripts to download, process, and load microarray data sets. Most of them are small sample, high-dimensional data sets (i.e. the small n, large p problem). Additionally, most of the data sets presented here have been widely studied in the genetics/microarray, bioinformatics, statistics, and computer science literatures.

For each data set, I have included a small set of scripts in the main project folder that automatically download, clean, and save the data set if necessary. The filename scheme of the scripts each begin with numbers that indicate the order in which the files should be sourced.

Currently, my code to download, clean, and save the microarray data sets is in R. In the future, I intend to package these data sets into a single R package. Furthermore, I would like to make the data easily accessible from Python.

The data sets that I have collected are: (More coming soon)

Chiaretti et al. (2004) - Acute Lymphoblastic Leukemia (ALL)
Golub et al. (1999) - Leukemia
Khan et al. (2001) - SRBCT
Cho et al. (1998) - Yeast Cell Cycle
Bhattacharjee et al. (2001) - Lung Cancer
Wen et al. (1998) - Rat CNS
Yeoh et al. (2002) - St. Jude Leukemia

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Home

Collection of Microarray Data Sets

Clone this wiki locally