Skip to content

zacheliason/hla-em

Repository files navigation

Instructions for Matt

0. Setup

First clone the repository:

git clone [email protected]:zacheliason/hla-em.git

Next move your copies of hla_gen.fasta (link to current IMGT version) and EnsembleGenome_STAR_without_scaffolds inside the root directory of the repository as shown here:

hla-em/
├── src/
│   ├── EMStep.py
│   └── ...
├── HLA_EM.py
├── README.md
├── ...
├── ...
├── hla_gen.fasta
└── EnsembleGenome_STAR_without_scaffolds/
    ├── Genome
    └── ...

They have to be inside the hla-em directory because they will need to be accessible from inside our mounted volume inside the Docker container later.

1. Simulated Test Case Generation

The Python script src/simulate_reads.py is used to generate simulated test cases for HLA-EM.py

Usage

To generate the test cases we talked about previously (5 test cases of each of the following treatments: 100,000, 50,000, 10,000, 5,000, & 1,500 HLA reads out of 1,000,000 reads total) use the following command:

python3 src/simulate_reads.py -r /PATH/TO/hla-em/hla_gen.fasta -m /PATH/TO/hla-em/EnsembleGenome_STAR_without_scaffolds

To increase the number of samples per treatment to 10, add -n 10.

The script will first recreate a fasta file from the masked STAR index directory. Then it will begin generating test cases within a new reference directory. It uses TCGA_HLA_alleles.tsv as a reference for HLA alleles. Half of the test cases at each treatment use the most common genotype matching the TCGA data (best case scenario), and the other half use a random genotype matching the TCGA data (harder scenario). The script will also generate a allele_record.csv file in the reference directory that contains the paths to the test cases, their respective treatments, and expected genotypes.

2. Running HLA-EM on Samples

I wrote the bash script run_hla_em_pipeline.sh to automate the process of running HLA-EM on each of the test cases generated by simulate_reads.py. I don't have a copy of the job scripts we used while I was at WUSTL so I can't automate submitting the jobs to the cluster, but hopefully this is close enough to modify. If you have a job script template you wanted to send me I'm sure I could modify this script to automate sending jobs to the cluster as well.

The script loops through each of the samples in the reference directory and runs HLA-EM on each one. (the number of samples to loop through will have to be adjusted inside the script based on the number of samples inside reference)

The script HLA_EM.py will index a filtered reference genome so there is no need to supply a STAR-indexed HLA reference directory beforehand.

Usage

To run the script, you can use the following command-line arguments:

bash run_hla_em_pipeline.sh

3. Running Optitype on Samples

Because Optitype requires first masking non-HLA reads from the fastq files, I streamlined the process by retaining the unmasked BAM files from the HLA-EM pipeline for later use in the Optitype step. Thus the HLA-EM pipeline must be run before the Optitype pipeline.

I wrote the bash script run_optitype.sh to automate the process of running Optitype on each of the test cases generated by simulate_reads.py. (the number of samples to loop through will have to be adjusted inside the script based on the number of samples inside reference)

Usage

To run the script, you can use the following command-line arguments:

bash run_optitype.sh

4. Evaluating Results

I wrote the Python script evaluate_results.py to evaluate the results of the HLA-EM and Optitype pipelines. The script compares the predicted genotypes to the expected genotypes and calculates the accuracy of each pipeline. It also saves a csv file of the results and scores. This file contains the expected and predicted genotypes for each sample. It also records the mean accuracy of each trial per typing level (two, four, six, and eight-digit).

Usage

To run the script, you can use the following command-line arguments:

bash evaluate_results.sh

Rest of HLA-EM README

HLA-EM is an HLA genotyping tool that utilizes an expectation maximization algorithm to identify a patient's HLA genotype in a sample from RNA-seq data.

HLA-EM Prerequisites

Rather than installing each prerequisite manually, it is recommended to use the Docker image. This image includes all necessary dependencies and eliminates the need for manual installations. To use the Docker image, follow the steps below:

Using Docker Image

  1. Install Docker on your machine.

  2. Pull the Docker image:

    docker pull zeliason/hla_em:latest
  3. Run your application within the Docker container:

    docker run -it --rm -v /path/to/hla-em/src:/src zeliason/hla_em:latest [OPTIONS]

Input

Running HLA-EM with the -h option (or --help) will print a desciption of its optional and required input arguments. A description of each follows.

docker run zeliason/hla-em:latest [-h] [-t THREADS] [-r REFERENCE] [--starHLA STARHLA] [-o OUTNAME]
          [-d] [--tpm TPM] [-p] [-k] [-v] -s STARGENOME reads1 [reads2]

or

HLA-EM.py [-h] [-t THREADS] [-r REFERENCE] [--starHLA STARHLA] [-o OUTNAME]
          [-d] [--tpm TPM] [-p] [-k] [-v] -s STARGENOME reads1 [reads2]

Optional arguments

  • -h, --help
    Prints the help menu, which contains an example of the function usage and abbreviated explanations of each of the options.
  • -t, --threads THREADS
    Some portions of the HLA-EM pipeline support multithreading; if you wish to use this feature, set this to the number of cores available (default: 1)
  • -r, --reference REFERENCE
    HLA-EM includes a comprehensive set of HLA reference genes maintained by IMGT, the international ImMunoGeneTics Database, the use of which is recommended. However, if you wish to supply your own HLA gene reference file in FASTA format, use this option with the path to the file
  • -a, --annotation ANNOTATION
    A file of HLA gene annotations in TSV format to be used in place of the default HLA gene annotations [$INSTALLDIR/reference/hla_gene_annot.tsv]
  • --starHLA STARHLA path to a directory containing STAR-generated HLA genome indexes based on the reference FASTA
  • -o, --outname OUTNAME
    Prefix attached to output file names (default: ./hlaEM)
  • -d, --disabledust
    By default, HLA-EM filters low-complexity reads using the DUST algorithm. Specify this option to disable filtering
  • --tpm TPM
    TPM threshold for identifying a true positive HLA genotype (default: 1.48)
  • -p, --printem
    Causes HLA-EM to print its results to STDOUT
  • -k, --keepint
    By default, HLA-EM removes files generated by intermediate steps of the pipeline. Specify this option to keep these files
  • -v, --version
    Print program's version number and exit

Required arguments

  • -s, --stargenome STARGENOME
    Path to the directory containing STAR-generated human genome index files

Positional arguments

  • reads1 : A FASTQ file of single-end RNA-seq reads or the first of two paired-end FASTQs
  • reads2 : (optional) Second paired-end FASTQ file

Output

  • The final genotype prediction is written to OUTNAME.final_predictions.csv.

  • The results of the EM algorithm are written to OUTNAME.emresults.tsv. Its rows follow the format:

    HLAtype   MappedReads   MappedProportion   MLE_Reads   MLE_Probability
    
  • The results of CreateMappedReadTable (a table of filtered HLA reads and references) are written to OUTNAME.mappedReads.tsv.

  • The read coverage maps for each predicted HLA type are written to OUTNAME.*.cov.pdf.

  • Pie charts comparing MappedProportion and MLE_Probability for the predicted genotypes are written to OUTNAME.props.pdf.

  • BAM files containing reads aligned to the HLA reference genomes are written to OUTNAME.1.Aligned.*.bam.

  • A log file for the EM algorithm is written to OUTNAME.em_algorithm.Log. The first line indicates how may steps the algorithm took to converge and second line shows the maximum likelihood estimate (MLE) of the sequencing error rate.

About

HLA genotyping tool using an EM algorithm

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published