Instructions for Matt

0. Setup

First clone the repository:

git clone [email protected]:zacheliason/hla-em.git

Next move your copies of hla_gen.fasta (link to current IMGT version) and EnsembleGenome_STAR_without_scaffolds inside the root directory of the repository as shown here:

hla-em/
├── src/
│   ├── EMStep.py
│   └── ...
├── HLA_EM.py
├── README.md
├── ...
├── ...
├── hla_gen.fasta
└── EnsembleGenome_STAR_without_scaffolds/
    ├── Genome
    └── ...

They have to be inside the hla-em directory because they will need to be accessible from inside our mounted volume inside the Docker container later.

1. Simulated Test Case Generation

The Python script src/simulate_reads.py is used to generate simulated test cases for HLA-EM.py

Usage

To generate the test cases we talked about previously (5 test cases of each of the following treatments: 100,000, 50,000, 10,000, 5,000, & 1,500 HLA reads out of 1,000,000 reads total) use the following command:

python3 src/simulate_reads.py -r /PATH/TO/hla-em/hla_gen.fasta -m /PATH/TO/hla-em/EnsembleGenome_STAR_without_scaffolds

To increase the number of samples per treatment to 10, add -n 10.

The script will first recreate a fasta file from the masked STAR index directory. Then it will begin generating test cases within a new reference directory. It uses TCGA_HLA_alleles.tsv as a reference for HLA alleles. Half of the test cases at each treatment use the most common genotype matching the TCGA data (best case scenario), and the other half use a random genotype matching the TCGA data (harder scenario). The script will also generate a allele_record.csv file in the reference directory that contains the paths to the test cases, their respective treatments, and expected genotypes.

2. Running HLA-EM on Samples

I wrote the bash script run_hla_em_pipeline.sh to automate the process of running HLA-EM on each of the test cases generated by simulate_reads.py. I don't have a copy of the job scripts we used while I was at WUSTL so I can't automate submitting the jobs to the cluster, but hopefully this is close enough to modify. If you have a job script template you wanted to send me I'm sure I could modify this script to automate sending jobs to the cluster as well.

The script loops through each of the samples in the reference directory and runs HLA-EM on each one. (the number of samples to loop through will have to be adjusted inside the script based on the number of samples inside reference)

The script HLA_EM.py will index a filtered reference genome so there is no need to supply a STAR-indexed HLA reference directory beforehand.

Usage

To run the script, you can use the following command-line arguments:

bash run_hla_em_pipeline.sh

3. Running Optitype on Samples

Because Optitype requires first masking non-HLA reads from the fastq files, I streamlined the process by retaining the unmasked BAM files from the HLA-EM pipeline for later use in the Optitype step. Thus the HLA-EM pipeline must be run before the Optitype pipeline.

I wrote the bash script run_optitype.sh to automate the process of running Optitype on each of the test cases generated by simulate_reads.py. (the number of samples to loop through will have to be adjusted inside the script based on the number of samples inside reference)

Usage

To run the script, you can use the following command-line arguments:

bash run_optitype.sh

4. Evaluating Results

I wrote the Python script evaluate_results.py to evaluate the results of the HLA-EM and Optitype pipelines. The script compares the predicted genotypes to the expected genotypes and calculates the accuracy of each pipeline. It also saves a csv file of the results and scores. This file contains the expected and predicted genotypes for each sample. It also records the mean accuracy of each trial per typing level (two, four, six, and eight-digit).

Usage

To run the script, you can use the following command-line arguments:

bash evaluate_results.sh

Rest of HLA-EM README

HLA-EM is an HLA genotyping tool that utilizes an expectation maximization algorithm to identify a patient's HLA genotype in a sample from RNA-seq data.

HLA-EM Prerequisites

python (v 3+)
STAR
- to compile STAR, you may need to install CMAKE or gcc
SAMtools

Rather than installing each prerequisite manually, it is recommended to use the Docker image. This image includes all necessary dependencies and eliminates the need for manual installations. To use the Docker image, follow the steps below:

Using Docker Image

Install Docker on your machine.
Pull the Docker image:
```
docker pull zeliason/hla_em:latest
```

Run your application within the Docker container:

docker run -it --rm -v /path/to/hla-em/src:/src zeliason/hla_em:latest [OPTIONS]

Input

Running HLA-EM with the -h option (or --help) will print a desciption of its optional and required input arguments. A description of each follows.

docker run zeliason/hla-em:latest [-h] [-t THREADS] [-r REFERENCE] [--starHLA STARHLA] [-o OUTNAME]
          [-d] [--tpm TPM] [-p] [-k] [-v] -s STARGENOME reads1 [reads2]

or

HLA-EM.py [-h] [-t THREADS] [-r REFERENCE] [--starHLA STARHLA] [-o OUTNAME]
          [-d] [--tpm TPM] [-p] [-k] [-v] -s STARGENOME reads1 [reads2]

Optional arguments

-h, --help
Prints the help menu, which contains an example of the function usage and abbreviated explanations of each of the options.
-t, --threads THREADS
Some portions of the HLA-EM pipeline support multithreading; if you wish to use this feature, set this to the number of cores available (default: 1)
-r, --reference REFERENCE
HLA-EM includes a comprehensive set of HLA reference genes maintained by IMGT, the international ImMunoGeneTics Database, the use of which is recommended. However, if you wish to supply your own HLA gene reference file in FASTA format, use this option with the path to the file
-a, --annotation ANNOTATION
A file of HLA gene annotations in TSV format to be used in place of the default HLA gene annotations [$INSTALLDIR/reference/hla_gene_annot.tsv]
--starHLA STARHLA path to a directory containing STAR-generated HLA genome indexes based on the reference FASTA
-o, --outname OUTNAME
Prefix attached to output file names (default: ./hlaEM)
-d, --disabledust
By default, HLA-EM filters low-complexity reads using the DUST algorithm. Specify this option to disable filtering
--tpm TPM
TPM threshold for identifying a true positive HLA genotype (default: 1.48)
-p, --printem
Causes HLA-EM to print its results to STDOUT
-k, --keepint
By default, HLA-EM removes files generated by intermediate steps of the pipeline. Specify this option to keep these files
-v, --version
Print program's version number and exit

Required arguments

-s, --stargenome STARGENOME
Path to the directory containing STAR-generated human genome index files

Positional arguments

reads1 : A FASTQ file of single-end RNA-seq reads or the first of two paired-end FASTQs
reads2 : (optional) Second paired-end FASTQ file

Output

The final genotype prediction is written to OUTNAME.final_predictions.csv.
The results of the EM algorithm are written to OUTNAME.emresults.tsv. Its rows follow the format:
```
HLAtype   MappedReads   MappedProportion   MLE_Reads   MLE_Probability
```
The results of CreateMappedReadTable (a table of filtered HLA reads and references) are written to OUTNAME.mappedReads.tsv.
The read coverage maps for each predicted HLA type are written to OUTNAME.*.cov.pdf.
Pie charts comparing MappedProportion and MLE_Probability for the predicted genotypes are written to OUTNAME.props.pdf.
BAM files containing reads aligned to the HLA reference genomes are written to OUTNAME.1.Aligned.*.bam.
A log file for the EM algorithm is written to OUTNAME.em_algorithm.Log. The first line indicates how may steps the algorithm took to converge and second line shows the maximum likelihood estimate (MLE) of the sequencing error rate.

Name		Name	Last commit message	Last commit date
Latest commit History 55 Commits
Work_Sans		Work_Sans
src		src
.gitignore		.gitignore
Dockerfile		Dockerfile
HLA_EM.py		HLA_EM.py
README.md		README.md
TCGA_HLA_alleles.tsv		TCGA_HLA_alleles.tsv
evaluate_trials.sh		evaluate_trials.sh
inter.sh		inter.sh
new.py		new.py
new.sh		new.sh
requirements.txt		requirements.txt
run_hla_em_pipeline.sh		run_hla_em_pipeline.sh
run_optitype.sh		run_optitype.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Instructions for Matt

0. Setup

1. Simulated Test Case Generation

Usage

2. Running HLA-EM on Samples

Usage

3. Running Optitype on Samples

Usage

4. Evaluating Results

Usage

Rest of HLA-EM README

HLA-EM Prerequisites

Using Docker Image

Input

Optional arguments

Required arguments

Positional arguments

Output

About

Releases

Packages

Languages

zacheliason/hla-em

Folders and files

Latest commit

History

Repository files navigation

Instructions for Matt

0. Setup

1. Simulated Test Case Generation

Usage

2. Running HLA-EM on Samples

Usage

3. Running Optitype on Samples

Usage

4. Evaluating Results

Usage

Rest of HLA-EM README

HLA-EM Prerequisites

Using Docker Image

Input

Optional arguments

Required arguments

Positional arguments

Output

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages