updated README

infocusp · Sep 12, 2024 · e290da6 · e290da6
1 parent 8a19e0e
commit e290da6
Show file tree

Hide file tree

Showing 2 changed files with 35 additions and 36 deletions.
diff --git a/README.md b/README.md
@@ -23,79 +23,77 @@ The following flowchart explains the major steps of the scaLR platform.
 
 ```
 conda create -n scaLR_env python=3.9
-
 ```
 
-- Clone the git repository and install the required packages by activating the conda env
+- Clone the git repository and install the required packages by activating the conda environment.
 
 ```
 conda activate scaLR_env
 
 pip install -r requirements.txt
-
 ```
 
 ## Input Data
-- Currently the pipeline expects all datasets in [anndata](https://anndata.readthedocs.io/en/latest/tutorials/notebooks/getting-started.html) formats (`.h5ad` files only)
+- Currently the pipeline expects all datasets in [anndata](https://anndata.readthedocs.io/en/latest/tutorials/notebooks/getting-started.html) formats (`.h5ad` files only).
 - The anndata object should contain cell samples as `obs` and genes as `var`.
-- `adata.X` contains all gene counts/expression values.
-- `adata.obs` contains any metadata regarding cells, including a column for `target` which will be used for classification. The index of `adata.obs` is cell_barcodes.
-- `adata.var` contains all gene_names as Index.
+- `adata.X`: contains all gene counts/expression values.
+- `adata.obs`: contains any metadata regarding cells, including a column for `target` which will be used for classification. The index of `adata.obs` is cell_barcodes.
+- `adata.var`: contains all gene_names as Index.
 
 
-## Platform Scripts (Output Structure)
-**pipeline.py**:
-Main script to run the entire pipeline.
-    - `exp_dir`: root experiment directory for storage of all phases of the pipeline. Specified from the config.
-    - `config.yml`: copy of config file to reproduce the experiment
+## Output Structure
+- **pipeline.py**:
+The main script that perform end to end run.
+    - `exp_dir`: root experiment directory for the storage of all step outputs of the platform specified in the config.
+    - `config.yml`: copy of config file to reproduce the user defined experiment.
 
 - **data_ingestion**:
-Reads the data, and splits it into Train/Validation/Test sets for the pipeline. Then performs sample-wise normalization on the data
+Reads the data, and splits it into Train/Validation/Test sets for the pipeline. Then performs sample-wise normalization on the data.
     - `exp_dir`
         - `data`
-            - `train_val_test_split.json`: contains sample indices for train/validation/test splits
-            - `label_mappings.json`: contains mappings of all metadata columns between labels and IDs
+            - `train_val_test_split.json`: contains sample indices for train/validation/test splits.
+            - `label_mappings.json`: contains mappings of all metadata columns between labels and IDs.
             - `train_val_test_split`: directory containing the train, validation, and test samples and data files.
 
 - **feature_extraction**:
-Performs feature selection and extraction of new datasets containing subset features
+Performs feature selection and extraction of new datasets containing subset of features.
     - `exp_dir`
         - `feature_extraction`
-            - `chunked_models`: contains weights of each model trained on feature subset data (refer to feature subsetting algorithm)
-            - `feature_subset_data`: directory containing the new feature-subsetted train, val, and test samples anndatas
-            - `score_matrix.csv`: combined scores of all individual models, for each feature and class. shape: n_classes X n_features
+            - `chunked_models`: contains weights of each model trained on feature subset data (refer to feature subsetting algorithm).
+            - `feature_subset_data`: directory containing the new feature-subsetted train, val, and test samples anndatas.
+            - `score_matrix.csv`: combined scores of all individual models, for each feature and class. shape: n_classes X n_features.
             - `top_features.json`: a file containing a list of top features selected / to be subsetted from total features.
 
 - **final_model_training**:
 Trains a final model based on `train_datapath` and `val_datapath` in config.
     - `exp_dir`
         - `model`
-            - `logs`: directory containing Tensorboard Logs for the training of the model
+            - `logs`: directory containing Tensorboard Logs for the training of the model.
             - `checkpoints`: directory containing model weights checkpointed at every interval specified in config.
-            - `best_model`: The best model checkpoint contains information to use model for inference/resume training.
-                - `config.yml`: config file containing model parameters
-                - `label_mappings.json`: contains mapping of class_names to class_ids used by model during training
-                - `model.pt`: contains model weights
+            - `best_model`: the best model checkpoint contains information to use model for inference/resume training.
+                - `model_config.yaml`: config file containing model parameters.
+                - `mappings.json`: contains mapping of class_names to class_ids used by model during training.
+                - `model.pt`: contains model weights.
 
 - **eval_and_analysis**:
-Performs evaluation of best model trained on user-defined metrics on the test set. Also performs various downstream tasks
+Performs evaluation of best model trained on user-defined metrics on the test set. Also performs various downstream tasks.
    - `exp_dir`
         - `analysis`
-            - `classification_report.csv`: Contains classification report showing Precision, Recall, F1, and accuracy metrics for each class, on the test set.
-            - `gene_recall_curve.svg`: Contains gene recall curve plots.
-            - `gene_recall_curve_info.json`: Contains reference genes list which are present in top_K ranked genes per class for each model.
+            - `classification_report.csv`: contains classification report showing Precision, Recall, F1, and accuracy metrics for each class, on the test set.
+            - `gene_recall_curve.svg`: contains gene recall curve plots.
+            - `gene_recall_curve_info.json`: contains reference genes list which are present in top_K ranked genes per class for each model.
             - `gene_analysis`
-                - `score_matrix.csv`: score of the final model, for each feature and class. shape: n_classes X n_features
-                - `top_features.json`: a file containing a list of selected top features/biomarkers
+                - `score_matrix.csv`: score of the final model, for each feature and class. shape: n_classes X n_features.
+                - `top_features.json`: a file containing a list of selected top features/biomarkers.
             -  `heatmaps`
-                - `class_name.svg`: Heatmap for top genes of particular class w.r.t those genes association in other classes. E.g. B.svg, C.svg etc.
-            - `roc_auc.svg`: Contains ROC-AUC plot for all classes.
+                - `class_name.svg`: heatmap for top genes of particular class w.r.t those genes association in other classes. E.g. B.svg, C.svg etc.
+            - `roc_auc.svg`: contains ROC-AUC plot for all classes.
             - `pseudobulk_dge_result`
-                - `pbkDGE_celltype_factor_categories_0_vs_factor_categories_1.csv`
-                - `pbkDGE_celltype_factor_categories_0_vs_factor_categories_1.svg`
+                - `pbkDGE_celltype_factor_categories_0_vs_factor_categories_1.csv`: contains Pseudobulk DGE results between selected factor categories for a celltype.
+                - `pbkDGE_celltype_factor_categories_0_vs_factor_categories_1.svg`: volcano plot of Log2Foldchange vs -log10(p-value) of genes.
             - `lmem_dge_result`
-                - `lmem_DGE_celltype.csv`
-                - `lmem_DGE_fixed_effect_factor_X.svg`
+                - `lmem_DGE_celltype.csv`: contains LMEM DGE results between selected factor categories for a celltype.
+                - `lmem_DGE_fixed_effect_factor_X.svg`: volcano plot of coefficient vs -log10(p-value) of genes.
 
 ## How to run
 

diff --git a/requirements.txt b/requirements.txt
@@ -33,6 +33,7 @@ Markdown==3.6
 MarkupSafe==2.1.5
 matplotlib==3.8.4
 matplotlib-inline==0.1.7
+matplotlib_venn==1.1.1
 memory-profiler==0.61.0
 mpmath==1.3.0
 munkres==1.1.4