Analysis of GDSC Data

A brief video summary of this project can be found here on YouTube.

/assets

Contains relevant csv files

Raw data

GDSC2_fitted_dose_response_25Feb20.csv - contains the raw IC₅₀ data for 135,242 drug / cell line combinations; data can be downloaded here
cell_list.csv - contains GDSC-generated information about the included cell lines, including tissue and TCGA classification; data can be downloaded here
drug_data.csv - contains GDSC-generated information about the included drugs, including drug pathways and targets; data can be found here

Processed data

cell.csv - results of the analysis from GDSC_Project.ipynb for the cell data, including contains two dimensions from PCA and t-SNE for plotting, the cluster identities from k-nearest neighbors using full, PCA-transformed, and low rank approximations of the data, and the mean lnIC₅₀ per cell line
cell_lrm.csv - low-rank approximation of the cell matrix

![formula](https://render.githubusercontent.com/render/math?math=\color{white}\large\U \Sigma^{\frac{1}{2}})

drug.csv - results of the analysis from GDSC_Project.ipynb for the drug data, including contains two dimensions from PCA and t-SNE for plotting, the cluster identities from k-nearest neighbors using full, PCA-transformed, and low rank approximations of the data, and the mean lnIC₅₀ per compound
drug_lrm.csv - low-rank approximation of the drug matrix

![formula](https://render.githubusercontent.com/render/math?math=\color{white}\large\Sigma^{\frac{1}{2}} V^T)

/src

Contains relevant scripts and notebooks

GDSC_Project.R - contains
GDSC_Project.ipynb - contains
kmeans.py
- find_kmeans: find an optimal number clusters via the elbow method and fit k-means with this many clusters
- plot_kmeans: plot the SSE vs. clusters and elbow point for cell and drug data
lowrank.py
- fit_svd: fit SVD model iteratively for a given rank r
pca.py
- find_pc: returns the eigenvalues and eigenvectors of the covariance matrix
- project_pca: transforms the original matrix via projection using a specified number of principal components
- plot_pca: plot the variance by principal component and cumulativev variance by principal component
utils.py
- import_data: loads GDSC data and pre-process into a wide matrix
- process_data: produces mean-centered data and masks

Summary

![formula](https://render.githubusercontent.com/render/math?math=\color{white}\large\D_{i, j} \approx \sum_{l=1}^ra_l[i]b_l[j] = U \Sigma V^T) ![formula](https://render.githubusercontent.com/render/math?math=\color{white}\large\U \Sigma^{\frac{1}{2}}) ![formula](https://render.githubusercontent.com/render/math?math=\color{white}\large\Sigma^{\frac{1}{2}} V^T)

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
src		src
.gitignore		.gitignore
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Analysis of GDSC Data

/assets

Raw data

Processed data

/src

Summary

About

Releases

Packages

Languages

jessicaw9910/dsmath_project1

Folders and files

Latest commit

History

Repository files navigation

Analysis of GDSC Data

/assets

Raw data

Processed data

/src

Summary

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages