Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Synthesize exons when they are not included in the input GFF3 #491

Open
garrettjstevens opened this issue Dec 4, 2024 · 1 comment
Open

Comments

@garrettjstevens
Copy link
Contributor

Sometimes we see a GFF3 that does not explicitly state where the exons are, e.g.

ctgA	example	gene	1050	9000	.	+	.	ID=EDEN;Name=EDEN;Note=protein kinase
ctgA	example	mRNA	1050	9000	.	+	.	ID=EDEN.1;Parent=EDEN;Name=EDEN.1;Note=Eden splice form 1;Index=1
ctgA	example	mRNA	1050	9000	.	+	.	ID=EDEN.2;Parent=EDEN;Name=EDEN.2;Note=Eden splice form 2;Index=1
ctgA	example	five_prime_UTR	1050	1200	.	+	.	Parent=EDEN.1
ctgA	example	five_prime_UTR	1050	1200	.	+	.	Parent=EDEN.2
ctgA	example	CDS	1201	1500	.	+	0	Parent=EDEN.1
ctgA	example	CDS	1201	1500	.	+	0	Parent=EDEN.2
ctgA	example	mRNA	1300	9000	.	+	.	ID=EDEN.3;Parent=EDEN;Name=EDEN.3;Note=Eden splice form 3;Index=1
ctgA	example	five_prime_UTR	1300	1500	.	+	.	Parent=EDEN.3
ctgA	example	CDS	3000	3902	.	+	0	Parent=EDEN.1
ctgA	example	five_prime_UTR	3000	3300	.	+	.	Parent=EDEN.3
ctgA	example	CDS	3301	3902	.	+	0	Parent=EDEN.3
ctgA	example	CDS	5000	5500	.	+	0	Parent=EDEN.1
ctgA	example	CDS	5000	5500	.	+	0	Parent=EDEN.2
ctgA	example	CDS	5000	5500	.	+	1	Parent=EDEN.3
ctgA	example	CDS	7000	7600	.	+	1	Parent=EDEN.3
ctgA	example	CDS	7000	7608	.	+	0	Parent=EDEN.1
ctgA	example	CDS	7000	7608	.	+	0	Parent=EDEN.2
ctgA	example	three_prime_UTR	7601	9000	.	+	.	Parent=EDEN.3
ctgA	example	three_prime_UTR	7609	9000	.	+	.	Parent=EDEN.1
ctgA	example	three_prime_UTR	7609	9000	.	+	.	Parent=EDEN.2

In this case we need to synthesize the exons for our internal representations.

We can use the five_prime_UTR, three_prime_UTR, and CDS lines to figure out where the exons are. If a UTR and a CDS are adjacent, they should be combined into a single exon. Otherwise, each unique CDS location should get an exon with the same location.

This needs to be handles in packages/apollo-shared/src/GFF3/gff3ToAnnotationFeature.ts. We'll probably want to check after processedCDS are determined in that file if there are any exons, and then synthesize them at that point if not.

@dariober
Copy link
Contributor

This should be fixed in PR #492

The example in the issue looks like:

image

The code may be inefficient but the slow down doesn't seem noticeable. These are two replicates loading a small gff with 283 mRNAs:

# Before (just before this branch started):
git checkout 143354232ff73b042aa0f55996b6b94068eeb748 
time yarn dev feature import --profile testAdmin ~/Downloads/TGGT1_chrII.gff -d -a ToxoDB-67_TgondiiGT1_Genome.fasta.gz
progress [========================================] 100% | ETA: 0s | 605768/605768

real	0m20.265s
user	0m3.263s
sys	0m0.403s

time yarn dev feature import --profile testAdmin ~/Downloads/TGGT1_chrII.gff -d -a ToxoDB-67_TgondiiGT1_Genome.fasta.gz
progress [========================================] 100% | ETA: 0s | 605768/605768

real	0m18.901s
user	0m2.900s
sys	0m0.371s

Current:

git switch -
Previous HEAD position was 14335423 Make feature type ontology configurable (#472)
Switched to branch 'import_gff3_wo_exons_issue491'
Your branch is ahead of 'origin/import_gff3_wo_exons_issue491' by 2 commits.
  (use "git push" to publish your local commits)
time yarn dev feature import --profile testAdmin ~/Downloads/TGGT1_chrII.gff -d -a ToxoDB-67_TgondiiGT1_Genome.fasta.gz
progress [========================================] 100% | ETA: 0s | 605768/605768

real	0m19.276s
user	0m2.675s
sys	0m0.345s

time yarn dev feature import --profile testAdmin ~/Downloads/TGGT1_chrII.gff -d -a ToxoDB-67_TgondiiGT1_Genome.fasta.gz
progress [========================================] 100% | ETA: 0s | 605768/605768

real	0m19.263s
user	0m2.885s
sys	0m0.347s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants