Skip to content

3.5. Data Preparation

paolomanlapaz edited this page Nov 25, 2024 · 32 revisions

This page enumerates the recipes for preparing the data (i.e., populating the static folder).

Basic System Requirements

  • Storage: ~30.2 GB (~6.2 GB for the Docker image and ~24 GB for the accompanying dataset)
  • Memory: ≥ 8 GB
  • Operating System: Linux, macOS, or Windows

Downloading the (Data Preparation) Workflow Dataset

  1. Start by downloading the latest version of the dataset needed for the data preparation workflow from here.

    This dataset is different from the one on the Installation page. This dataset includes all the data (i.e., even the raw dataset files), whereas the one on the Installation page includes only those necessary for the app to run.

    💡 If you want to verify the integrity of the downloaded dataset, compute the SHA-512 checksum of the tgz archive file using a hashing utility like certutil in Windows, shasum in Mac, or sha512sum in Linux. You should obtain the following checksum:

    4242a9eb61338a48a6a8176c5d8add08f8febcecd1e31ce5a0ea4aff562bed83961009621d96205984ec0fe9fb2a699af812986e337ea62877eb5dfff6591d79
    
  2. Extract the contents of the data folder:

    • [For Linux, macOS, and Windows 10 onwards] Launch a terminal from the download location, and run the following command:

      tar -xvzf static.tgz
      
    • [For Windows versions older than Windows 10] Use an unpacking tool like WinRAR or 7-Zip.

  3. The extraction process should result in a folder named static. Inside this should be two folders named app_data and raw_data.

Setting up the Environment

  1. Download and install Docker, a platform for building and running containerized apps:

    • [For macOS and Windows] Install Docker Desktop.
    • [For Linux] For easier installation, we recommend installing Docker Engine instead of Docker Desktop. Instructions for different Linux distributions can be found here.
  2. Start the Docker daemon:

    • [For macOS and Windows] Open Docker Desktop to start the daemon.
    • [For Linux] Follow the instructions here.
  3. Launch a terminal (from anywhere), and pull the latest Docker image for the workflow by running:

    docker pull ghcr.io/bioinfodlsu/rice-pilaf/workflow:latest
    
  4. Spin up a container from the image by running:

    docker create --name rice-pilaf-workflow -v path/to/static/in/local:/app/static ghcr.io/bioinfodlsu/rice-pilaf/workflow:latest
    

    Replace path/to/static/in/local with the path to the static folder generated following the steps in the previous section. It may be more convenient to use the absolute path. If you are using Windows, replace the backward slashes (\) in the path with forward slashes (/).

  5. Launch a terminal (from anywhere), and start the RicePilaf workflow container by running:

    docker start rice-pilaf-workflow
    
  6. Open a shell that will execute commands in the container by running:

    docker exec -it rice-pilaf-workflow bash
    

    Doing so should change the working directory to root@<container_id>:/app/prepare_data/workflow/scripts.

    ⚠️ IMPORTANT: All the commands in the data preparation recipes listed on this page should be run on this shell (i.e., they should be executed in the container).

  7. Once you are done using the RicePilaf workflow container, stop the container by running:

    docker stop rice-pilaf-workflow
    
  8. If you want to use the RicePilaf workflow container again, follow Steps 5 and 6.

Some Useful Tips

  1. You can use the -h or --help flag to display more information about a data processing script (e.g., its arguments, output files, and their descriptions), like so:

    For Python scripts (replace <FILENAME> with the filename of the script):

    python3 <FILENAME> --help
    

    For R scripts (replace <FILENAME> with the filename of the script):

    Rscript --vanilla <FILENAME> --help
    
  2. Several output files are pickled files. You can use the Visual Studio Code extension vscode-pydata-viewer to display their contents without needing to write a Python script.

1️⃣ Gene List and Lift-Over

Mapping Accessions

Click here to show/hide the recipes

a. Mapping Cultivar-Specific Accessions to OGIs

python3 ogi_mapping/generate-ogi-dicts.py ../../../static/raw_data/gene_ID_mapping_fromRGI ../../../static/app_data/gene_id_mapping/ogi_mapping

b. Mapping Cultivar-Specific Genes to Nipponbare Orthologs (MSU Accessions)

python3 ogi_mapping/generate-nb-dicts.py ../../../static/app_data/gene_id_mapping/ogi_mapping ../../../static/app_data/gene_id_mapping/nb_mapping

c. Mapping MSU Accessions to RAP-DB Accessions

python3 gene_id_mapping/msu-to-rapdb-id.py ../../../static/raw_data/enrichment_analysis/rap_db/RAP-MSU_2023-03-15.txt ../../../static/app_data/gene_id_mapping/msu_mapping

Gene Descriptions and Info

Click here to show/hide the recipes

a. Getting the Protein Domain and Family Info (InterPro and Pfam) Related to Each Gene

python3 ogi_mapping/generate-nb-to-iric-dicts.py ../../../static/raw_data/gene_ID_mapping_fromRGI ../../../static/app_data/gene_id_mapping/iric_mapping
python3 iric_description/map-gene-to-interpro.py ../../../static/raw_data/iric_data/iric_data_original.pkl ../../../static/raw_data/iric_data/interpro2name.txt ../../../static/app_data/iric_data
python3 iric_description/map-gene-to-pfam.py ../../../static/raw_data/iric_data/iric_data_original.pkl ../../../static/raw_data/iric_data/pfam2name.json ../../../static/app_data/iric_data

b. Getting the Quantitative Trait Loci from Published Literature (QTARO) Related to Each Gene

python3 qtaro/prepare-qtaro.py .imports --remove-./../../static/raw_data/qtaro/Qtaro_Mar2016_convMSU_1849.csv ../../../static/app_data/qtaro

c. Getting the Description of Each Gene

python3 gene_description/prepare_desc_uniprot_dict.py ../../../static/app_data/gene_descriptions/Nb/Nb_gene_descriptions.csv ../../../static/app_data/gene_descriptions/Nb

2️⃣ Gene Retrieval by Text Mining

PubMed Articles

Click here to show/hide the recipes

a. Getting the PubMed Articles Related to Each Gene

python3 text_mining/get-pubmed-per-gene.py ../../../static/raw_data/text_mining/gene_index_table.csv ../../../static/app_data/text_mining/annotated_abstracts.tsv ../../../static/raw_data/text_mining/match_filtering/symbol_replacement.tsv ../../../static/raw_data/text_mining/match_filtering/symbol_exclusion.tsv ../../../static/raw_data/text_mining/pubmed_per_gene
python3 text_mining/consolidate-pubmed-dictionaries.py ../../../static/raw_data/text_mining/pubmed_per_gene ../../../static/app_data/text_mining

Note that text_mining/get-pubmed-per-gene.py may take several days to run. Hence, we provide the option to start and end the script's execution at user-specified genes (<START_GENE> and <END_GENE>, respectively), as in the recipe below:

python3 text_mining/get-pubmed-per-gene.py ../../../static/raw_data/text_mining/gene_index_table.csv ../../../static/app_data/text_mining/annotated_abstracts.tsv ../../../static/raw_data/text_mining/match_filtering/symbol_replacement.tsv ../../../static/raw_data/text_mining/match_filtering/symbol_exclusion.tsv ../../../static/raw_data/text_mining/pubmed_per_gene --continue_from <START_GENE> --end_at <END_GENE>
python3 text_mining/consolidate-pubmed-dictionaries.py ../../../static/raw_data/text_mining/pubmed_per_gene ../../../static/app_data/text_mining

b. Mapping Gene Symbols to MSU Accessions

python3 text_mining/generate-symbol-to-msu.py ../../../static/raw_data/text_mining/gene_index_table.csv ../../../static/app_data/gene_id_mapping/msu_mapping

3️⃣ Co-Expression Network Analysis

Note that:

  • <NETWORK> can either be OS-CX (RiceNet v2) or RCRN (Rice Combined Mutual Ranked Network)
  • <ALGO> can be fox, demon, coach, or clusterone.
  • <PARAM> is the name of the directory containing the module list after running the algorithm with the specified parameter (i.e., after running the module detection recipes here).
    • For example, if <ALGO> is clusterone and the parameter (minimum density) is 0.3, then <PARAM> is 30.
  • <MODULE_NUM> refers to the number of the module on which the enrichment analysis will be performed.

Module Detection

Click here to show/hide the recipes

a. Data Preparation

This recipe converts the co-expression network to the respective formats required to run the module detection algorithms and generates the required mapping dictionaries to convert across the different network representation formats:

python3 network_util/convert-to-int-edge-list.py ../../../static/app_data/networks/<NETWORK>.txt ../../../static/raw_data/network_modules/<NETWORK>/mapping
python3 module_util/generate-mapping-from-networkx-int-edge-graph.py ../../../static/raw_data/network_modules/<NETWORK>/mapping/int-edge-list.txt ../../../static/raw_data/network_modules/<NETWORK>/mapping/int-edge-list-node-mapping.pickle ../../../static/raw_data/network_modules/<NETWORK>/mapping
mkdir -p ../../../static/raw_data/network_modules/<NETWORK>/temp/fox
mkdir -p ../../../static/raw_data/network_modules/<NETWORK>/temp/clusterone

b. Detecting Modules via ClusterONE

Publication: Nature Methods

Dependency: ClusterONE (Java)

java -jar module_detection/cluster_one-1.0.jar --output-format csv --min-density <MIN_DENSITY> ../../../static/app_data/networks/<NETWORK>.txt > ../../../static/raw_data/network_modules/<NETWORK>/temp/clusterone/clusterone-results-<MIN_DENSITY * 100>.csv
python3 module_util/get-modules-from-clusterone-results.py ../../../static/raw_data/network_modules/<NETWORK>/temp/clusterone/clusterone-results-<MIN_DENSITY * 100>.csv ../../../static/app_data/network_modules/<NETWORK>/MSU/clusterone/<MIN_DENSITY * 100>

Replace <MIN_DENSITY> with the minimum density:

  • If <MIN_DENSITY> is 0.3, then <MIN_DENSITY * 100> is 30. This is just a convention in the app to avoid having decimal points in the directory and file names.

c. Detecting Modules via COACH

Publication: BMC Bioinformatics

Dependency: CDlib (Python)

python3 module_detection/detect-modules-via-coach.py --affinity_threshold <AFFINITY_THRESHOLD> ../../../static/raw_data/network_modules/<NETWORK>/mapping/int-edge-list.txt ../../../static/raw_data/network_modules/<NETWORK>/temp/coach
python3 module_util/restore-node-labels-in-modules.py ../../../static/raw_data/network_modules/<NETWORK>/temp/coach/coach-int-module-list-<AFFINITY_THRESHOLD * 1000>.csv ../../../static/raw_data/network_modules/<NETWORK>/mapping/networkx-node-mapping.pickle ../../../static/app_data/network_modules/<NETWORK>/MSU/coach/<AFFINITY_THRESHOLD * 1000> coach

Replace <AFFINITY_THRESHOLD> with the affinity threshold:

  • If <AFFINITY_THRESHOLD> is 0.125, then <AFFINITY_THRESHOLD * 1000> is 125. This is just a convention in the app to avoid having decimal points in the directory and file names.

d. Detecting Modules via DEMON

Publication: KDD '12: Proceedings of the 18th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining

Dependency: CDlib (Python)

python3 module_detection/detect-modules-via-demon.py --epsilon <EPSILON> ../../../static/raw_data/network_modules/<NETWORK>/mapping/int-edge-list.txt ../../../static/raw_data/network_modules/<NETWORK>/temp/demon
python3 module_util/restore-node-labels-in-modules.py ../../../static/raw_data/network_modules/<NETWORK>/temp/demon/demon-int-module-list-<EPSILON * 100>.csv ../../../static/raw_data/network_modules/<NETWORK>/mapping/networkx-node-mapping.pickle ../../../static/app_data/network_modules/<NETWORK>/MSU/demon/<EPSILON * 100> demon

Replace <EPSILON> with the merging threshold (epsilon):

  • If <EPSILON> is 0.25, then <EPSILON * 100> is 25. This is just a convention in the app to avoid having decimal points in the directory and file names.

e. Detecting Modules via FOX

Publication: ACM Transactions on Social Computing (FOX), PeerJ Computer Science (LazyFox, parallelized implementation of FOX)

Dependency: LazyFox (C++)

module_detection/LazyFox --input-graph ../../../static/raw_data/network_modules/<NETWORK>/mapping/int-edge-list.txt --output-dir temp --queue-size 20 --thread-count 20 --disable-dumping --wcc-threshold <WCC_THRESHOLD>
mv temp/CPP*/iterations/*.txt ../../../static/raw_data/network_modules/<NETWORK>/temp/fox/fox-int-module-list-<WCC_THRESHOLD * 100>.txt
rm -r temp
python3 module_util/restore-node-labels-in-modules.py ../../../static/raw_data/network_modules/<NETWORK>/temp/fox/fox-int-module-list-<WCC_THRESHOLD * 100>.txt ../../../static/raw_data/network_modules/<NETWORK>/mapping/int-edge-list-node-mapping.pickle ../../../static/app_data/network_modules/<NETWORK>/MSU/fox/<WCC_THRESHOLD * 100> fox

Replace <WCC_THRESHOLD> with the weighted community clustering (WCC) threshold:

  • If <WCC_THRESHOLD> is 0.01, then <WCC_THRESHOLD * 100> is 1. This is just a convention in the app to avoid having decimal points in the directory and file names.

Ontology and Pathway Enrichment Analysis

Click here to show/hide the recipes

a. Data Preparation

Dependency: riceidconverter (R)

This recipe extracts the nodes (genes) from the co-expression network:

python3 network_util/get-nodes-from-network.py ../../../static/app_data/networks/<NETWORK>.txt ../../../static/raw_data/enrichment_analysis/all_genes/<NETWORK>/MSU

This recipe maps the MSU accessions used in the app to the target IDs required by the pathway enrichment analysis tools:

Rscript --vanilla enrichment_analysis/util/ricegeneid-msu-to-transcript-id.r -g ../../../static/raw_data/enrichment_analysis/all_genes/<NETWORK>/MSU/all-genes.txt -o ../../../static/raw_data/enrichment_analysis/temp/<NETWORK>
python3 enrichment_analysis/util/msu-to-transcript-id.py ../../../static/raw_data/enrichment_analysis/temp/<NETWORK>/all-transcript-id.txt ../../../static/raw_data/enrichment_analysis/temp/<NETWORK>/all-na-transcript-id.txt ../../../static/raw_data/enrichment_analysis/rap_db/RAP-MSU_2023-03-15.txt ../../../static/raw_data/enrichment_analysis/rap_db/IRGSP-1.0_representative_annotation_2023-03-15.tsv ../../../static/raw_data/enrichment_analysis/mapping/<NETWORK>
python3 enrichment_analysis/util/transcript-to-msu-id.py ../../../static/raw_data/enrichment_analysis/mapping/<NETWORK>/msu-to-transcript-id.pickle ../../../static/app_data/gene_id_mapping/msu_mapping/<NETWORK>
python3 enrichment_analysis/util/file-convert-msu.py ../../../static/raw_data/enrichment_analysis/all_genes/<NETWORK>/MSU/all-genes.txt ../../../static/raw_data/enrichment_analysis/mapping/<NETWORK>/msu-to-transcript-id.pickle ../../../static/raw_data/enrichment_analysis/all_genes/<NETWORK> transcript --skip_no_matches
python3 enrichment_analysis/util/file-convert-msu.py ../../../static/app_data/network_modules/<NETWORK>/MSU/<ALGO>/<PARAM>/<ALGO>-module-list.tsv ../../../static/raw_data/enrichment_analysis/mapping/<NETWORK>/msu-to-transcript-id.pickle ../../../static/app_data/enrichment_analysis/<NETWORK>/modules/<ALGO>/<PARAM> transcript

This recipe prepares the data needed for ontology enrichment analysis:

python3 enrichment_analysis/util/aggregate-go-annotations.py ../../../static/raw_data/enrichment_analysis/go/agrigo.tsv ../../../static/raw_data/enrichment_analysis/go/OryzabaseGeneListAll_20230322010000.txt ../../../static/raw_data/enrichment_analysis/rap_db/IRGSP-1.0_representative_annotation_2023-03-15.tsv ../../../static/raw_data/enrichment_analysis/all_genes/<NETWORK>/transcript/all-genes.tsv ../../../static/raw_data/enrichment_analysis/mapping/<NETWORK>/msu-to-transcript-id.pickle ../../../static/raw_data/enrichment_analysis/go/<NETWORK>
python3 enrichment_analysis/util/aggregate-to-annotations.py ../../../static/raw_data/enrichment_analysis/go/OryzabaseGeneListAll_20230322010000.txt ../../../static/raw_data/enrichment_analysis/to/<NETWORK>
python3 enrichment_analysis/util/aggregate-po-annotations.py ../../../static/raw_data/enrichment_analysis/go/OryzabaseGeneListAll_20230322010000.txt ../../../static/raw_data/enrichment_analysis/po/<NETWORK>

b. Gene Ontology Enrichment Analysis

Dependencies: GO.db (R), clusterProfiler (R)

Rscript --vanilla enrichment_analysis/ontology_enrichment/go-enrichment.r -g ../../../static/app_data/network_modules/<NETWORK>/MSU/<ALGO>/<PARAM>/<ALGO>-module-list.tsv -i <MODULE_NUM> -b ../../../static/raw_data/enrichment_analysis/all_genes/<NETWORK>/MSU/all-genes.txt -m ../../../static/raw_data/enrichment_analysis/go/<NETWORK>/go-annotations.tsv -o ../../../static/app_data/enrichment_analysis/<NETWORK>/output/<ALGO>/<PARAM>/ontology_enrichment/go

c. Trait Ontology Enrichment Analysis

Dependency: clusterProfiler (R)

Rscript --vanilla enrichment_analysis/ontology_enrichment/to-enrichment.r -g ../../../static/app_data/network_modules/<NETWORK>/MSU/<ALGO>/<PARAM>/<ALGO>-module-list.tsv -i <MODULE_NUM> -b ../../../static/raw_data/enrichment_analysis/all_genes/<NETWORK>/MSU/all-genes.txt -m ../../../static/raw_data/enrichment_analysis/to/<NETWORK>/to-annotations.tsv -t ../../../static/raw_data/enrichment_analysis/to/<NETWORK>/to-id-to-name.tsv -o ../../../static/app_data/enrichment_analysis/<NETWORK>/output/<ALGO>/<PARAM>/ontology_enrichment/to

d. Plant Ontology Enrichment Analysis

Dependency: clusterProfiler (R)

Rscript --vanilla enrichment_analysis/ontology_enrichment/po-enrichment.r -g ../../../static/app_data/network_modules/<NETWORK>/MSU/<ALGO>/<PARAM>/<ALGO>-module-list.tsv -i <MODULE_NUM> -b ../../../static/raw_data/enrichment_analysis/all_genes/<NETWORK>/MSU/all-genes.txt -m ../../../static/raw_data/enrichment_analysis/po/<NETWORK>/po-annotations.tsv -t ../../../static/raw_data/enrichment_analysis/po/<NETWORK>/po-id-to-name.tsv -o ../../../static/app_data/enrichment_analysis/<NETWORK>/output/<ALGO>/<PARAM>/ontology_enrichment/po

e. Overrepresentation (Pathway Enrichment) Analysis via clusterProfiler

Dependency: clusterProfiler (R)

Rscript --vanilla enrichment_analysis/pathway_enrichment/ora-enrichment.r -g ../../../static/app_data/network_modules/<NETWORK>/transcript/<ALGO>/<PARAM>/transcript/<ALGO>-module-list.tsv -i <MODULE_NUM> -b ../../../static/raw_data/enrichment_analysis/all_genes/<NETWORK>/transcript/all-genes.tsv -o ../../../static/app_data/enrichment_analysis/<NETWORK>/output/<ALGO>/<PARAM>/pathway_enrichment/ora

f. Topology-Based (Pathway Enrichment) Analysis via Pathway-Express

Paper: Genome Research

Dependency: ROntoTools (R)

Rscript --vanilla enrichment_analysis/pathway_enrichment/pe-enrichment.r -g ../../../static/app_data/enrichment_analysis/modules/<ALGO>/<PARAM>/transcript/<ALGO>-module-list.tsv -i <MODULE_NUM> -b ../../../static/raw_data/enrichment_analysis/all_genes/transcript/all-genes.tsv -o ../../../static/app_data/enrichment_analysis/output/<ALGO>/<PARAM>/pathway_enrichment/pe

This recipe generates additional files needed for the user-facing display of the results on the app (e.g., list of genes in the dosa pathways and names of the pathways):

Rscript enrichment_analysis/util/get-genes-in-pathway.r -o ../../../static/raw_data/enrichment_analysis/kegg_dosa/geneset
python3 enrichment_analysis/util/get-genes-in-pathway-dict.py ../../../static/raw_data/enrichment_analysis/kegg_dosa/geneset/kegg-dosa-geneset.tsv ../../../static/app_data/enrichment_analysis/mapping
wget -O ../../../static/app_data/enrichment_analysis/mapping/kegg-dosa-pathway-names.tsv https://rest.kegg.jp/list/pathway/dosa

g. Topology-Based (Pathway Enrichment) Analysis via SPIA

Paper: Bioinformatics

Dependency: SPIA (R)

The recipe below uses the dosaSPIA.RData file generated by SPIA from the KGML (KEGG pathway data) files for dosa or Oryza sativa japonica (Japanese rice) (gene model taken from RAPDB). The KGML files were downloaded on May 11, 2023.

Rscript --vanilla enrichment_analysis/pathway_enrichment/spia-enrichment.r -g ../../../static/app_data/network_modules/<NETWORK>/transcript/<ALGO>/<PARAM>/transcript/<ALGO>-module-list.tsv -i <MODULE_NUM> -b ../../../static/raw_data/enrichment_analysis/all_genes/<NETWORK>/transcript/all-genes.tsv -s ../../../static/raw_data/enrichment_analysis/kegg_dosa/SPIA -o ../../../static/app_data/enrichment_analysis/<NETWORK>/output/<ALGO>/<PARAM>/pathway_enrichment/spia

If you would like to generate dosaSPIA.RData yourself, the recipe is given below. Note, however, that you have to supply the KGML files for dosa (save them in ../../../static/raw_data/enrichment_analysis/kegg_dosa/XML). We do not distribute them in compliance with KEGG's licensing restrictions.

Rscript --vanilla enrichment_analysis/pathway_enrichment/spia-enrichment.r -g ../../../static/app_data/network_modules/<NETWORK>/transcript/<ALGO>/<PARAM>/transcript/<ALGO>-module-list.tsv -i <MODULE_NUM> -b ../../../static/raw_data/enrichment_analysis/all_genes/<NETWORK>/transcript/all-genes.tsv -p ../../../static/raw_data/enrichment_analysis/kegg_dosa/XML -s ../../../static/raw_data/enrichment_analysis/kegg_dosa/SPIA -o ../../../static/app_data/enrichment_analysis/<NETWORK>/output/<ALGO>/<PARAM>/pathway_enrichment/spia

Summary Table

Click here to show/hide the recipes

a. Getting the Modules to Which Each Gene Belongs

python3 network_util/map-genes-to-modules.py ../../../static/app_data/network_modules/<NETWORK>/MSU/<ALGO>/<PARAM>/<ALGO>-module-list.tsv ../../../static/app_data/network_modules/<NETWORK>/MSU_to_modules/<ALGO>/<PARAM>

b. Getting the Ontology Terms Associated with Each Gene

python3 enrichment_analysis/util/map-genes-to-ontology.py ../../../static/app_data/enrichment_analysis/genes_to_ontology_pathway go ../../../static/raw_data/enrichment_analysis/go/OS-CX/go-annotations.tsv ../../../static/raw_data/enrichment_analysis/go/RCRN/go-annotations.tsv
python3 enrichment_analysis/util/map-genes-to-ontology.py ../../../static/app_data/enrichment_analysis/genes_to_ontology_pathway to ../../../static/raw_data/enrichment_analysis/to/OS-CX/to-annotations.tsv ../../../static/raw_data/enrichment_analysis/to/RCRN/to-annotations.tsv
python3 enrichment_analysis/util/map-genes-to-ontology.py ../../../static/app_data/enrichment_analysis/genes_to_ontology_pathway po ../../../static/raw_data/enrichment_analysis/po/OS-CX/po-annotations.tsv ../../../static/raw_data/enrichment_analysis/po/RCRN/po-annotations.tsv

Note that the last argument of enrichment_analysis/util/map-genes-to-ontology.py is variadic, i.e., you can add as many annotation files as needed.

c. Getting the Pathways in Which Each Gene is Involved

python enrichment_analysis/util/map-genes-to-pathway.py ../../../static/app_data/enrichment_data/enrichment_analysis/mapping/kegg-dosa-geneset.pickle ../../../static/app_data/enrichment_analysis/genes_to_ontology_pathway ../../../static/app_datg/msu_mapping/OSa/gene_id_mapping/msu_mapping/OS-CX/transcript-to-msu-id.pickle ../../../static/app_data/gene_id_mapping/msu_mapping/RCRN/transcript-to-msu-id.pickle

Note that the last argument of enrichment_analysis/util/map-genes-to-pathway.py is variadic, i.e., you can add as many transcript ID-to-MSU accession pickled dictionaries as needed.

4️⃣ Regulatory Feature Enrichment

Transcription Factor Info

Click here to show/hide the recipes

a. Getting the Family of Each Transcription Factor

python3 tfbs/get_fam.py ../../../static/raw_data/tf_enrichment/tf_list/Osj_TF_list.txt ../../../static/app_data/tf_enrichment/annotation

❓ Frequently Asked Questions

How can I build the workflow image locally?

Click here to show/hide the steps
  1. Download Docker, and start the Docker daemon (as in Steps 1 and 2 here).

  2. Clone the RicePilaf repository by running:

    git clone https://github.com/bioinfodlsu/rice-pilaf
    
  3. Launch a terminal from the root of the cloned repository, and build the Docker image for the workflow by running the following:

    docker build -t rice-pilaf-workflow -f Dockerfile-workflow .
    
  4. Spin up a container from the Docker image by running:

    docker create --name rice-pilaf-workflow -v path/to/static/in/local:/app/static rice-pilaf-workflow
    

    Replace path/to/static/in/local with the path to the static folder in your local machine. It may be more convenient to use the absolute path. If you are using Windows, replace the backward slashes (\) in the path with forward slashes (/).

  5. Use the RicePilaf workflow container as in Steps 5 to 8 here.

How can I set up the workflow without Docker?

Click here to show/hide the steps (best of luck!)

⚠️ Important: Windows users have to run the commands on Windows Subsystem for Linux (WSL). Using WSL is necessary since the compilation of one of the dependencies (i.e., LazyFox) requires Unix/Unix-like utilities.

  1. Install the following first:

  2. Clone the RicePilaf repository by running:

    git clone https://github.com/bioinfodlsu/rice-pilaf
    
  3. Transfer your static folder to the root of the cloned repository. It should be in the same level as callbacks, pages, etc.

  4. Launch a terminal from the root of the cloned repository, and install the required Python libraries by running:

    python3 -m pip install -r dependencies/requirements-workflow.txt
    
  5. Install the required R packages by running:

    bash dependencies/r-packages-workflow.sh
    
  6. Download ClusterONE (a module detection tool) from here, and save the JAR file to prepare_data/workflow/scripts/module_detection.

  7. Launch a terminal from prepare_data/workflow/scripts/module_detection, and run the following commands to compile LazyFox (another module detection tool):

    git clone https://github.com/TimGarrels/LazyFox
    mv LazyFox lazyfoxdir
    cd lazyfoxdir
    git reset --hard d08f3c084df19bd2a1726159f181bbe3ad6f5bf4
    mkdir build
    cd build
    cmake ..
    make
    mv LazyFox ../../LazyFox
    cd ../../
    rm -r lazyfoxdir
    chmod +x LazyFox
    
  8. Launch a terminal from prepare_data/workflow/scripts, and run the recipes on this terminal:

    • Note that, if you are using Windows' native terminal (i.e., not WSL), you may have to change python3 to python or py (depending on your Python installation).

How can I set up older versions of the workflow?

Click here to show/hide the steps

⚠️ Important: Make sure that the dataset version matches the release version of the code that you want to run. The dataset version consists of the first two numbers in the release version of the code. For example, if the release version of the code is 0.1.x, then the dataset version should be 0.1.

  1. Refer to this spreadsheet for the link to the dataset and its SHA-512 checksum. Note that this link is different from the one on the Installation page.

  2. Extract the contents of the downloaded dataset. Doing so should result in a folder named static. Inside this should be a single folder named app_data and raw_data.

  3. Download Docker, and start the Docker daemon (as in Steps 1 and 2 here).

  4. Launch a terminal (from anywhere), and pull the Docker image for the workflow by running:

    docker pull ghcr.io/bioinfodlsu/rice-pilaf/workflow:v<RELEASE_VERSION>
    

    Replace <RELEASE_VERSION> with the release version of the code. A complete list of all the release versions can be found here.

  5. Spin up a container from the image by running:

    docker create --name rice-pilaf -v path/to/static/in/local:/app/static ghcr.io/bioinfodlsu/rice-pilaf/workflow:<RELEASE_VERSION>
    

    Replace path/to/static/in/local with the path to the static folder. It may be more convenient to use the absolute path. If you are using Windows, replace the backward slashes (\) in the path with forward slashes (/).

    Replace <RELEASE_VERSION> with the release version of the code.

    Note: If you are intending to run version ≤ 0.1.1, run the command with -p 8050:80.

  6. Use the RicePilaf workflow container as in Steps 5 to 8 here.

Clone this wiki locally