'regpot_single_gene_hg38_mm10.py' calculates the regulatory potential (RP) score (https://www.nature.com/articles/s41467-020-16106-x) of a transcription factor for a given gene in either the human (hg38) or mouse (mm10) genome. The RP score is a measure of the likelihood that a transcription factor (TF) regulates a gene, based on the distances and the number of TF binding sites, as evidenced by TF ChIP-Seq binding peaks (https://remap2022.univ-amu.fr/), to the gene's transcription start site (TSS).
Here's a step-by-step breakdown of what the script does:
-
It takes three command-line arguments: the gene symbol, the species (either 'hg38' or 'mm10'), and the half-decay distance (either '1kb', '10kb', or '100kb').
-
It connects to a SQLite database that contains information about the genes and their transcription start sites (TSS).
-
It retrieves the gene's information from the database, including its chromosomal location, strand, transcription start and end sites, and species.
-
It creates a new table in the database to store the gene's information.
-
It calculates a region around the gene's TSS, extending to the left and right by the half-decay distance.
-
It determines which bed file to use based on the species, and creates a BedTool object for the region and the bed file.
-
It finds the regions in the bed file that overlap with the gene's region.
-
It creates a new table in the database to store the overlapping regions.
-
It retrieves a list of distinct transcription factors (TFs) from the overlapping regions.
-
For each TF, it calculates the distances between the TF's peak binding sites and the gene's TSS.
-
It calculates the fraction of distances that are less than or equal to 1000 base pairs.
-
It sets a parameter
d0
based on the fraction, and calculates the RP score for each TF. -
It creates a DataFrame that contains the RP scores for all TFs, and appends it to a CSV file.
-
Finally, it closes the connection to the database.