From c9ffe0ac9f6dce7bbebfd5992abe5b5ed4895921 Mon Sep 17 00:00:00 2001
From: macelik <a.muhammetcelik@gmail.com>
Date: Mon, 13 Nov 2023 16:03:13 +0100
Subject: [PATCH] pre-processing README

---
 src/data_preprocessing/README.md | 26 +++++++++++++++-----------
 1 file changed, 15 insertions(+), 11 deletions(-)

diff --git a/src/data_preprocessing/README.md b/src/data_preprocessing/README.md
index e1c1885..7d67907 100644
--- a/src/data_preprocessing/README.md
+++ b/src/data_preprocessing/README.md
@@ -11,29 +11,33 @@ This section provides an overview of the preprocessing steps applied to each dat
 - **Notebook:** `./$dataset/2.filtering.ipynb`
 - **Language:** R
 - **Tasks:** Further filtering of the processed data. Filtering includes:
-  - **Cell Filtering:** CELLS THAT ARE PRESENT INTHE COUNT MATRIX, BUT NOT IN THE CELL ANNOTATION FILE ARE FILTERED OUT
-  - **Cell Type Filtering:** Excludes cell types with fewer than 5 cells per sample or present in fewer than 30 samples. THESE NUMBERS ARE DIFFERENT FOR DIFFERENT DATASETS, AREN'T THEY? I WOULD THUS NOT MENTION EXACT NUMBERS BUT JUST DESCRIBE WHAT IS DONE HERE.
-  - **Gene Filtering:** Filters genes with low cumulative expression in constructed pseudo-bulks. AS FAR AS I REMEMBER, THE GENES ARE NOT FILTERED USING PSEUDOBULKS BUT ACTUAL CELLS -> PLEASE DOUBLE CHECK
-  - **Sample Filtering:** Excludes samples with fewer than 5 cell types. AGAIN, DON'T USE THE EXCACT NUMBER SINCE IT MAY BE DIFFERENT FOR DIFFERENT DATASET, BUT EXPLAIN WHAT IS DONE IN GENERAL.
-- **Output:** Processed data is saved in `.RData` format under `/results/data_preprocessing/$dataset/`.
+  - **Cell Filtering:** Removes cells that appear in the count matrix but are absent in the cell annotation file.
+  - **Cell Type Filtering:** Excludes cell types that do not meet the minimum prevalence criteria across samples, which may vary depending on the dataset characterics.
+  - **Gene Filtering:** Filters out genes based on their cumulative expression accross cells, ensuring only genes with sufficient overall expression are retained for analyisis. 
+  - **Sample Filtering:** Removes samples that fall below a certain threshold of cell type diversity, with specific riteria adjusted according to each dataset.
+- **Output:** Processed data is saved in `.RData` format under `/results/data_preprocessing/$dataset/`. The Lasry dataset comes as normalized, therefore, the filtered count matrix from the this step is used. This pre-processed count matrix alongside with the ColData can be directly downloaded from [here](https://zenodo.org/records/7962808) 
 
 ### 3. Data Normalization
 - **Notebook:** `./$dataset/3.normalization.ipynb`
-- **Language:** R, using the Scran package LINK
+- **Language:** R, using the [Scran package](https://bioconductor.org/packages/release/bioc/html/scran.html)
 - **Input:** Filtered data from the previous step.
-- ANY OUTPUT YOU CREATE HERE? SAY THAT FOR SMILLIE, THIS IS THE OUTPUT THAT YOU USE FOR THE COMMUNITY. MENTION THAT LASRY DATA WAS ALREADY NORMALIZED, SO NO NORMALIZATION STEP WAS NEEDED. GIVE THE LINK TO THE ZENODO AGAIN, WHERE THE USE CAN DOWNLOAD THE FILES
+- **Output:** Normalized expression matrices. Smillie dataset does not need furhter pre-processing step and the output from this step is used as an input, as no further preprocessing is required. Downloadable files can be found on [Zenodo](https://zenodo.org/records/7962808)
+
 
 ### 4. Batch Correction
 - **Notebook:** `./$dataset/4.1.batch_correction.ipynb`
-- **Language:** Python, using the scgen library LINK
+- **Language:** Python, using the [scgen library](https://github.com/theislab/scgen)
 - **Input:** Normalized data from the previous step.
 - **Warning:** This step may take a significant amount of time. It required nearly 14 hours on a system with 128GB RAM and 30 CPUs. For faster processing, consider using a GPU node.
-- ANY OUTPUT YOU CREATE HERE? SAY THAT FOR CANGALEN-OETJEN, THIS IS THE OUTPUT THAT YOU USE FOR THE COMMUNITY. GIVE THE LINK TO THE ZENODO AGAIN, WHERE THE USE CAN DOWNLOAD THE FILES. PLEASE DOHBLE CHECK IF THE BATCH CORRECTION WAS ALSO DONE FOR SMILLIE AND LASRY, IF YES, SAY THAT IT WAS ONLY DONE FOR THE VISUALIZATION PURPOSES IN THE STEP 5, BUT NOT FOR THE COMMUNITY ANALYSIS.
+- **Output:** For the VanGalen-Oetjen dataset, this step is crucial, and the batch-corrected output is then used as input for the community tool. The pre-processed files are also stored on [Zenodo](https://zenodo.org/records/10013368)
+
 
 ### 5. Data Visualization
 - **Notebook:** `./$dataset/4.2.visualization.ipynb`
 - **Language:** R
 - **Tasks:** This notebook visualizes the processed data, offering insights into the results of the preprocessing pipeline.
 
-
-ADD A BRIEF SUMMARY OF WHAT FILES ARE PASSED TO COMMUNITY, WHAT IS THE STRUCTURE OF THESE FILES AND WHAT COLUMN NAMES ARE ESSENTIAL.
+### Brief Summary of Structure of Input Files
+**counts file:** normalized expression data frame containing gene symbols in the rows and cells in the columns.
+**anno samples:** data frame of the sample annotation from all samples (rows are sample IDs, columns: must contain "sample_ID" and "case_or_control" columns).
+**anno cells:** data frame of the cell annotation from all samples (rows are cell IDs, columns: must contain "cell_ID", "cell_type" and "sample_ID" columns).