First clone the repository:
git clone [email protected]:zacheliason/hla-em.git
Next move your copies of hla_gen.fasta
(link to current IMGT version) and EnsembleGenome_STAR_without_scaffolds
inside the root directory of the repository as shown here:
hla-em/
├── src/
│ ├── EMStep.py
│ └── ...
├── HLA_EM.py
├── README.md
├── ...
├── ...
├── hla_gen.fasta
└── EnsembleGenome_STAR_without_scaffolds/
├── Genome
└── ...
They have to be inside the hla-em
directory because they will need to be accessible from inside our mounted volume inside the Docker container later.
The Python script src/simulate_reads.py
is used to generate simulated test cases for HLA-EM.py
To generate the test cases we talked about previously (5 test cases of each of the following treatments: 100,000
, 50,000
, 10,000
, 5,000
, & 1,500
HLA reads out of 1,000,000
reads total) use the following command:
python3 src/simulate_reads.py -r /PATH/TO/hla-em/hla_gen.fasta -m /PATH/TO/hla-em/EnsembleGenome_STAR_without_scaffolds
To increase the number of samples per treatment to 10, add -n 10
.
The script will first recreate a fasta file from the masked STAR index directory. Then it will begin generating test cases within a new reference
directory. It uses TCGA_HLA_alleles.tsv
as a reference for HLA alleles. Half of the test cases at each treatment use the most common genotype matching the TCGA data (best case scenario), and the other half use a random genotype matching the TCGA data (harder scenario). The script will also generate a allele_record.csv
file in the reference
directory that contains the paths to the test cases, their respective treatments, and expected genotypes.
I wrote the bash script run_hla_em_pipeline.sh
to automate the process of running HLA-EM on each of the test cases generated by simulate_reads.py
. I don't have a copy of the job scripts we used while I was at WUSTL so I can't automate submitting the jobs to the cluster, but hopefully this is close enough to modify. If you have a job script template you wanted to send me I'm sure I could modify this script to automate sending jobs to the cluster as well.
The script loops through each of the samples in the reference
directory and runs HLA-EM on each one. (the number of samples to loop through will have to be adjusted inside the script based on the number of samples inside reference
)
The script HLA_EM.py
will index a filtered reference genome so there is no need to supply a STAR-indexed HLA reference directory beforehand.
To run the script, you can use the following command-line arguments:
bash run_hla_em_pipeline.sh
Because Optitype requires first masking non-HLA reads from the fastq files, I streamlined the process by retaining the unmasked BAM files from the HLA-EM pipeline for later use in the Optitype step. Thus the HLA-EM pipeline must be run before the Optitype pipeline.
I wrote the bash script run_optitype.sh
to automate the process of running Optitype on each of the test cases generated by simulate_reads.py
. (the number of samples to loop through will have to be adjusted inside the script based on the number of samples inside reference
)
To run the script, you can use the following command-line arguments:
bash run_optitype.sh
I wrote the Python script evaluate_results.py
to evaluate the results of the HLA-EM and Optitype pipelines. The script compares the predicted genotypes to the expected genotypes and calculates the accuracy of each pipeline.
It also saves a csv file of the results and scores. This file contains the expected and predicted genotypes for each sample. It also records the mean accuracy of each trial per typing level (two, four, six, and eight-digit).
To run the script, you can use the following command-line arguments:
bash evaluate_results.sh
HLA-EM is an HLA genotyping tool that utilizes an expectation maximization algorithm to identify a patient's HLA genotype in a sample from RNA-seq data.
Rather than installing each prerequisite manually, it is recommended to use the Docker image. This image includes all necessary dependencies and eliminates the need for manual installations. To use the Docker image, follow the steps below:
-
Install Docker on your machine.
-
Pull the Docker image:
docker pull zeliason/hla_em:latest
-
Run your application within the Docker container:
docker run -it --rm -v /path/to/hla-em/src:/src zeliason/hla_em:latest [OPTIONS]
Running HLA-EM with the -h option (or --help) will print a desciption of its optional and required input arguments. A description of each follows.
docker run zeliason/hla-em:latest [-h] [-t THREADS] [-r REFERENCE] [--starHLA STARHLA] [-o OUTNAME]
[-d] [--tpm TPM] [-p] [-k] [-v] -s STARGENOME reads1 [reads2]
or
HLA-EM.py [-h] [-t THREADS] [-r REFERENCE] [--starHLA STARHLA] [-o OUTNAME]
[-d] [--tpm TPM] [-p] [-k] [-v] -s STARGENOME reads1 [reads2]
-h
,--help
Prints the help menu, which contains an example of the function usage and abbreviated explanations of each of the options.-t
,--threads
THREADS
Some portions of the HLA-EM pipeline support multithreading; if you wish to use this feature, set this to the number of cores available (default: 1)-r
,--reference
REFERENCE
HLA-EM includes a comprehensive set of HLA reference genes maintained by IMGT, the international ImMunoGeneTics Database, the use of which is recommended. However, if you wish to supply your own HLA gene reference file in FASTA format, use this option with the path to the file-a
,--annotation
ANNOTATION
A file of HLA gene annotations in TSV format to be used in place of the default HLA gene annotations [$INSTALLDIR/reference/hla_gene_annot.tsv]--starHLA
STARHLA path to a directory containing STAR-generated HLA genome indexes based on the reference FASTA-o
,--outname
OUTNAME
Prefix attached to output file names (default: ./hlaEM)-d
,--disabledust
By default, HLA-EM filters low-complexity reads using the DUST algorithm. Specify this option to disable filtering--tpm
TPM
TPM threshold for identifying a true positive HLA genotype (default: 1.48)-p
,--printem
Causes HLA-EM to print its results to STDOUT-k
,--keepint
By default, HLA-EM removes files generated by intermediate steps of the pipeline. Specify this option to keep these files-v,
--version
Print program's version number and exit
-s
,--stargenome
STARGENOME
Path to the directory containing STAR-generated human genome index files
reads1
: A FASTQ file of single-end RNA-seq reads or the first of two paired-end FASTQsreads2
: (optional) Second paired-end FASTQ file
-
The final genotype prediction is written to
OUTNAME.final_predictions.csv
. -
The results of the EM algorithm are written to
OUTNAME.emresults.tsv
. Its rows follow the format:HLAtype MappedReads MappedProportion MLE_Reads MLE_Probability
-
The results of CreateMappedReadTable (a table of filtered HLA reads and references) are written to
OUTNAME.mappedReads.tsv
. -
The read coverage maps for each predicted HLA type are written to
OUTNAME.*.cov.pdf
. -
Pie charts comparing MappedProportion and MLE_Probability for the predicted genotypes are written to
OUTNAME.props.pdf
. -
BAM files containing reads aligned to the HLA reference genomes are written to
OUTNAME.1.Aligned.*.bam
. -
A log file for the EM algorithm is written to
OUTNAME.em_algorithm.Log
. The first line indicates how may steps the algorithm took to converge and second line shows the maximum likelihood estimate (MLE) of the sequencing error rate.