microhomology

The goal of this project is to use model-based statistical inference to identify the extent of microhomology involvement in nucleotide trimming and ligation processes during V(D)J recomination of adaptive immune receptor loci.

Install

Everything R 4.1.3 and/or Python 3.11.5 based. Python and R packages that are required can be installed via miniconda:

conda env create -f environment.yml
conda activate microhomology_jax

You will also need to install IGoR if you wish to annotate sequences using IGoR (Marcou et.al Nature Communications 2018)

Requirements:

Most of these analyses can be run on any machine. However, some of the data preparation steps, such as sequence annotation using IGoR (Marcou et.al Nature Communications 2018), are computationally intensive and require a cluster to run efficiently. This sequence annotation script is written specifically for a cluster set up to use the Slurm job scheduler. Some minor modifications could allow this step to be run locally or using a different cluster workload manager.

Analysis outline:

Table of Contents:

Model training
- Summary of parameter groups and model types
Quantifying coefficient significance
Model comparison
Model validation
- Summary of model validation options
Plot results

Model training

Download your training dataset of interest; We are using a dataset consisting of sequences published in these two publications: here and here
1. We downloaded processed repertoire data from these publications from the Adaptive ImmuneACCESS portal: here and here
Annotate these sequences using IGoR. You can run the annotation script to do this. As described above, this script is written specifically for a cluster set up to use the Slurm job scheduler. This script takes five arguments:
1. the raw file path (wherever you downloaded the files in step 0 above)
2. a temporary directory location (wherever you want the intermediate annotation files to be stored)
3. a boolean value representing whether the sequences were downloaded from the Adaptive ImmuneACCESS portal (if TRUE, the sequences require some extra processing)
4. the locus of the sequences (e.g. alpha or beta)
5. an output directory location (wherever you want the final annotation files to be stored--you will eventually save this location as TCR_REPERTOIRE_DATA_igor_alpha within the config file in step 3)
6. the number of possible rearrangement scenarios you want to sample from for each sequence (we used 10 scenarios)
7. the model parameters you would like to use (e.g. we use the default argument)
8. the model marginals you would like to use (e.g. we use the default argument)
9. the number of CPUs you want to use
10. the cluster partition that you want to submit to
Download TRA gene name (or whatever locus you'd like to use) and germline sequences from IMGT (we downloaded these data in September 2023); save these data to a file--you will eventually save the location of this file as WHOLE_NUCSEQS_alpha within the config file in step 3
Edit config file to be project and/or computer specific. See the README for more details.
Pre-process data for model fitting using the data processing script. This script takes 6 arguments, and can be run locally:
1. the annotation type (e.g. igor_alpha for IGoR-annotated TRA sequences)
2. the parameter group which specifies sequence productivity and the prediction task; the options are described in the parameter group options section
3. the number of CPUs you want to use
4. left motif count (an integer between 0 and 6)--specifies the number of nucleotides 5' of the trimming site to be included in the motif parameter (for models without motif parameters, this should be 0)
5. right motif count (an integer between 0 and 6)--specifies the number of nucleotides 3' of the trimming site to be included in the motif parameter (for models without motif parameters, this should be 0)
6. the desired model type; the options are described in the model options section
Train model using the script suggested in the model options section (e.g. the model training script will depend on the model type you are wanting to train). All of the modeling scripts take 7 arguments, and can be run locally or on a cluster:
1. the annotation type (e.g. igor_alpha for IGoR-annotated TRA sequences)
2. the parameter group which specifies sequence productivity and the prediction task; the options are described in the parameter group options section
3. the number of CPUs you want to use
4. left motif count (an integer between 0 and 6)--specifies the number of nucleotides 5' of the trimming site to be included in the motif parameter (for models without motif parameters, this should be 0)
5. right motif count (an integer between 0 and 6)--specifies the number of nucleotides 3' of the trimming site to be included in the motif parameter (for models without motif parameters, this should be 0)
6. the desired model type; the options are described in the model options section
7. boolean value indicating whether you want to use L2 regularization for the MH-related terms in your model (e.g. choices are True or False)

This trained model will be stored in the trained_models directory.

Parameter group and model type options

Here is a summary of some of the parameter group options:

Parameter group argument	Description	Model training script	Model type option (long name, same as manuscript)	Trimming-related features parameterized	Ligation-related features parameterized	Data processing model argument	Model training model argument
`motif_two-side-base-count-beyond`	Nonproductive sequences used to infer joint V-J trimming probabilities, independent of ligation	fit_trim_model.sh	null	None	NA	`null`	`null`
			motif + two-side-base-count-beyond	1x2 trimming motif and the count of AT and GC nucleotides on either side of the trimming site, for both the V-gene trimming site and J-gene trimming site	NA	`motif_two-side-base-count-beyond`	`motif_two-side-base-count-beyond`
			motif + two-side-base-count-beyond + interior-mh-count	1x2 trimming motif and the count of AT and GC nucleotides on either side of the trimming site, for both the V-gene trimming site and J-gene trimming site, as well as the number of non-contiguous MH nucleotides within overlapping regions between sequences	NA	`motif_two-side-base-count-beyond_interior-mh-count`	`motif_two-side-base-count-beyond_interior-mh-count`
`nonproductive_v-j_trim_ligation-mh`	Nonproductive sequences used to infer joint V-J trimming and ligation probabilities	fit_twostep_model_EM.sh	null	None	None	`null`	`twostep_null`
			motif + two-side-base-count-beyond	1x2 trimming motif and the count of AT and GC nucleotides on either side of the trimming site, for both the V-gene trimming site and J-gene trimming site	None	`motif_two-side-base-count-beyond`	`twostep_motif_two-side-base-count-beyond`
			motif + two-side-base-count-beyond + average-mh	1x2 trimming motif, the count of AT and GC nucleotides on either side of the trimming site, for both the V-gene trimming site and J-gene trimming site, and the average number of contiguous MH nucleotides across all possible ligation configurations	None	`motif_two-side-base-count-beyond_average-mh`	`twostep_motif_two-side-base-count-beyond_average-mh`
			motif + two-side-base-count-beyond + ligation-mh	1x2 trimming motif and the count of AT and GC nucleotides on either side of the trimming site, for both the V-gene trimming site and J-gene trimming site	Number of MH nucleotides within ligation configuration	`motif_two-side-base-count-beyond_ligation-mh`	`twostep_motif_two-side-base-count-beyond_ligation-mh`
			motif + two-side-base-count-beyond + average-mh + ligation-mh	1x2 trimming motif, the count of AT and GC nucleotides on either side of the trimming site, for both the V-gene trimming site and J-gene trimming site, and the average number of contiguous MH nucleotides across all possible ligation configurations	Number of MH nucleotides within ligation configuration	`motif_two-side-base-count-beyond_average-mh_ligation-mh`	`twostep_motif_two-side-base-count-beyond_average-mh_ligation-mh`

Quantifying coefficient significance

If you would like to measure the significance of each model coefficient, you can follow the following steps:

Train the model of interest
Determine the relevant bootstrapping file which will depend on the parameter group and model type you are using. If you are using the fit_trim_model.sh model training script, you will want to use the bootstrap_trim_models.sh script; Likewise, if you are using the fit_twostep_model_EM.sh model training script, you will want to use the bootstrap_twostep_models_EM.sh script.
Run the bootstrapping script. These scripts take 7 arguments, and can be run on a cluster:
1. the annotation type (e.g. igor_alpha for IGoR-annotated TRA sequences)
2. the parameter group which specifies sequence productivity and the prediction task; the options are described in the parameter group options section
3. the number of CPUs you want to use
4. left motif count (an integer between 0 and 6)--specifies the number of nucleotides 5' of the trimming site to be included in the motif parameter (for models without motif parameters, this should be 0)
5. right motif count (an integer between 0 and 6)--specifies the number of nucleotides 3' of the trimming site to be included in the motif parameter (for models without motif parameters, this should be 0)
6. the desired model type; the options are described in the model options section
7. boolean value indicating whether you want to use L2 regularization for the MH-related terms in your model (e.g. choices are True or False)
Once the bootstrapped models have finished being trained, you can run the significance testing script to evaluate the significance of the model coefficients. This script takes 7 arguments and can be run locally:
1. the parameter group which specifies sequence productivity and the prediction task; the options are described in the parameter group options section
2. the number of CPUs you want to use
3. left motif count (an integer between 0 and 6)--specifies the number of nucleotides 5' of the trimming site to be included in the motif parameter (for models without motif parameters, this should be 0)
4. right motif count (an integer between 0 and 6)--specifies the number of nucleotides 3' of the trimming site to be included in the motif parameter (for models without motif parameters, this should be 0)
5. the model type of the model of interest; the options are described in the model options section
6. boolean value indicating whether you want to use L2 regularization for the MH-related terms in your model (e.g. choices are True or False)
7. the annotation type (e.g. igor_alpha for IGoR-annotated TRA sequences)

This analysis will save a file containing the results in a directory called bootstraps located in the indicated OUTPUT_PATH as specified in the config files

Model comparison

If you would like to use a likelihood ratio test to compare two nested models, you can follow these steps:

Train the two models of interest using the same dataset (e.g. same input arguments to model training script, except for model type); Make sure the two models are nested!
Run the LRT script which takes 9 arguments and can be ran locally:
1. the parameter group which specifies sequence productivity and the prediction task; the options are described in the parameter group options section
2. the number of CPUs you want to use
3. left motif count (an integer between 0 and 6)--specifies the number of nucleotides 5' of the trimming site to be included in the motif parameter (for models without motif parameters, this should be 0)
4. right motif count (an integer between 0 and 6)--specifies the number of nucleotides 3' of the trimming site to be included in the motif parameter (for models without motif parameters, this should be 0)
5. the model type of the full model; the options are described in the model options section
6. the model type of the alternative model (which should contain a subset of the parameters that the full model contains)
7. boolean value indicating whether you want to use L2 regularization for the MH-related terms in your model (e.g. choices are True or False)
8. the annotation type (e.g. igor_alpha for IGoR-annotated TRA sequences)
9. the number of degrees of freedom (e.g. should be 1 if only one parameter is different between the two models being analyzed)

The resulting output file will be located in a directory within the OUTPUT_PATH as specified in the config file

Model validation

If you would like to use the model to make predictions on a new data set and/or validate the model on a testing data set, you can follow these steps:

Download the processed testing data set, and make sure it contains the same column names as the training data set
Depending on the locus of the testing data, download the appropriate gene name and germline sequences from IMGT (we downloaded these data in April 2023); save these data to a file--you will save the location of this file in the next step
Edit config file to be project and/or computer specific (be sure to add the path to the file from the last step)
Run the model validation script or the twostep model validation script, depending on the parameter group and model type you are using (e.g. see parameter group and model type options section for more details). This script takes 8 arguments:
1. the annotation type of the data used for model training (e.g. igor_alpha for IGoR-annotated TRA sequences)
2. the parameter group which specifies sequence productivity and the prediction task; the options are described in the parameter group options section
3. the number of CPUs you want to use
4. left motif count (an integer between 0 and 6)--specifies the number of nucleotides 5' of the trimming site to be included in the motif parameter (for models without motif parameters, this should be 0)
5. right motif count (an integer between 0 and 6)--specifies the number of nucleotides 3' of the trimming site to be included in the motif parameter (for models without motif parameters, this should be 0)
6. the desired model type; the options are described in the model options section
7. boolean value indicating whether you want to use L2 regularization for the MH-related terms in your model (e.g. choices are True or False)
8. the annotation type of the data you want to use for validation (e.g. see the model validation options summary for details)

All output files will be located in a directory within the OUTPUT_PATH as specified in the config file

Model validation options

Summary of the model validation data sets used in our analysis:

Locus/Junction name	Code argument	Data used in the manuscript
TRA locus/VJ junction	`validation_data_alpha`	here
TRG locus/VJ junction	`validation_data_gamma`	here

Plot results

Plot figures from the manuscript.

Also, have a look at the plotting README for more details.

About the analysis

With this analysis, we want to quantify the extent of microhomology involvement for nucleotide trimming and ligation during V(D)J recombination. See the manuscript for specific model and methods details:

TODO: add manuscript details

The following packages were especially helpful in our analyses:

Python:

jax
jaxopt
pandas

R:

data.table (Dowle and Srinivasan, 2021)
tidyverse (Wickham et. al, 2019)
doParallel (Corporation and Steve Weston, 2020)
cowplot (Wilke, 2020)
Biostrings (Pagès et. al, 2021)

Name		Name	Last commit message	Last commit date
Latest commit History 74 Commits
analysis_scripts		analysis_scripts
config		config
igor_annotation_scripts		igor_annotation_scripts
jax_scripts		jax_scripts
mh_simulation_scripts		mh_simulation_scripts
plotting_scripts		plotting_scripts
scripts		scripts
trained_models		trained_models
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
bootstrap_trim_models.sh		bootstrap_trim_models.sh
bootstrap_twostep_models.sh		bootstrap_twostep_models.sh
bootstrap_twostep_models_EM.sh		bootstrap_twostep_models_EM.sh
environment.yml		environment.yml
fit_trim_model.sh		fit_trim_model.sh
fit_twostep_model.sh		fit_twostep_model.sh
fit_twostep_model_EM.sh		fit_twostep_model_EM.sh
process_model_data.sh		process_model_data.sh
validate_model.sh		validate_model.sh
validate_twostep_model.sh		validate_twostep_model.sh
validate_twostep_model_EM.sh		validate_twostep_model_EM.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

microhomology

Install

Requirements:

Analysis outline:

Model training

Parameter group and model type options

Quantifying coefficient significance

Model comparison

Model validation

Model validation options

Plot results

About the analysis

About

Releases

Packages

Contributors 2

Languages

License

magdalenarussell/microhomology

Folders and files

Latest commit

History

Repository files navigation

microhomology

Install

Requirements:

Analysis outline:

Model training

Parameter group and model type options

Quantifying coefficient significance

Model comparison

Model validation

Model validation options

Plot results

About the analysis

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages