This project aims to simulate RNA-seq data with specific differential expression patterns using the polyester
R package. The simulation process involves extracting gene sets from a JSON file, mapping these genes to their indices in a reference genome FASTA file, and then using these mappings to simulate RNA-seq data with defined fold changes.
.
├── data
│ ├── c2.cp.kegg_medicus.v2023.2.Hs.json
│ └── genome
│ └── gencode.v38.transcripts.fa
├── doc
│ └── environment.yml
├── indexed_gset.ipynb
├── License
├── output
│ └── gene_sets_with_indices.tsv
├── README.md
├── scripts
│ ├── rnaseq_simulation.R
│ ├── submit_simulation_EXAMPLE.sh
│ └── submit_simulation.sh
└── simulate_RNAseq.Rmd
- Python 3.x
- Conda
- R with the following packages:
- dplyr
- polyester
- Biostrings
- Python 3.x
- Conda
Create the Conda environment using the environment.yml
file:
conda env create -f environment.yml
In R session:
BiocManager::install("polyester")
Use the Jupyter notebook indexed_gset.ipynb
to:
- Define file paths and target gene sets.
- Extract gene names from the JSON file.
- Map gene names to their first occurrence index in the FASTA file.
- Output a TSV file with the gene set, target gene, and index.
Use rnaseq_simulation.Rmd
or the R script rnaseq_simulation.R
in the scripts
directory to:
- Read the TSV file with gene sets and indices.
- Define fold change values for each gene set.
- Create a table with fold change values for each gene.
- Load the FASTA sequences.
- Create a fold changes matrix.
- Simulate the RNA-seq experiment with specified replicates and fold changes.
Use the submit_simulation.sh
script in the scripts
directory to submit the job to the server:
- From
scripts
directory: submit the job using the command:sbatch submit_simulation.sh