A brief video summary of this project can be found here on YouTube.
Contains relevant csv files
- GDSC2_fitted_dose_response_25Feb20.csv - contains the raw IC50 data for 135,242 drug / cell line combinations; data can be downloaded here
- cell_list.csv - contains GDSC-generated information about the included cell lines, including tissue and TCGA classification; data can be downloaded here
- drug_data.csv - contains GDSC-generated information about the included drugs, including drug pathways and targets; data can be found here
- cell.csv - results of the analysis from
GDSC_Project.ipynb
for the cell data, including contains two dimensions from PCA and t-SNE for plotting, the cluster identities from k-nearest neighbors using full, PCA-transformed, and low rank approximations of the data, and the mean lnIC50 per cell line - cell_lrm.csv - low-rank approximation of the cell matrix
![formula](https://render.githubusercontent.com/render/math?math=\color{white}\large\U \Sigma^{\frac{1}{2}})
- drug.csv - results of the analysis from
GDSC_Project.ipynb
for the drug data, including contains two dimensions from PCA and t-SNE for plotting, the cluster identities from k-nearest neighbors using full, PCA-transformed, and low rank approximations of the data, and the mean lnIC50 per compound - drug_lrm.csv - low-rank approximation of the drug matrix
![formula](https://render.githubusercontent.com/render/math?math=\color{white}\large\Sigma^{\frac{1}{2}} V^T)
Contains relevant scripts and notebooks
- GDSC_Project.R - contains
- GDSC_Project.ipynb - contains
- kmeans.py
find_kmeans
: find an optimal number clusters via the elbow method and fit k-means with this many clustersplot_kmeans
: plot the SSE vs. clusters and elbow point for cell and drug data
- lowrank.py
fit_svd
: fit SVD model iteratively for a given rank r
- pca.py
find_pc
: returns the eigenvalues and eigenvectors of the covariance matrixproject_pca
: transforms the original matrix via projection using a specified number of principal componentsplot_pca
: plot the variance by principal component and cumulativev variance by principal component
- utils.py
import_data
: loads GDSC data and pre-process into a wide matrixprocess_data
: produces mean-centered data and masks
![formula](https://render.githubusercontent.com/render/math?math=\color{white}\large\D_{i, j} \approx \sum_{l=1}^ra_l[i]b_l[j] = U \Sigma V^T) ![formula](https://render.githubusercontent.com/render/math?math=\color{white}\large\U \Sigma^{\frac{1}{2}}) ![formula](https://render.githubusercontent.com/render/math?math=\color{white}\large\Sigma^{\frac{1}{2}} V^T)