Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

HTP paper data #3465

Closed
manulera opened this issue Feb 24, 2023 · 32 comments
Closed

HTP paper data #3465

manulera opened this issue Feb 24, 2023 · 32 comments
Assignees

Comments

@manulera
Copy link
Contributor

@ValWood @kimrutherford

Here is the file with the HTP data that I think is ready for PomBase.

https://github.com/manulera/phenomics_paper_HTP/blob/master/results/pombase_dataset.tsv

I have added some extra columns that we can remove, but I think they could be interesting to request going forward:

  • Temperature:
  • Chemical or agent: Chemicals or agents used in the experiment, pipe | separated. For chemicals, CHEBI ids, for other agents some brief description, e.g. UV
  • Chemical or agent dose: Dose of the chemical / agent with units when relevant
  • Phenotype score: Typically there is some measure of how strong the phenotype is. In the paper of jurg they used the fold change of something (cell length / percentage of binucleated cells), or the log2 of the median fitness.
  • Phenotype score units: the units of the score, for Jurg's paper I used median_fitness_log2 and fold_change.
@kimrutherford kimrutherford self-assigned this Feb 25, 2023
@kimrutherford
Copy link
Member

Thanks @manulera

I'll try to load this soon. I noticed that there are some rows where the "Gene systematic ID" is just "972"?

Gene systematic ID FYPO ID Allele description Expression Parental strain Background strain name Background genotype description Gene name Allele name Allele synonym Allele type Evidence Condition Penetrance Severity Extension Reference taxon Date Ploidy Temperature Chemical or agent Chemical or agent dose Phenotype score Phenotype score units
972 FYPO:0000067 deletion null 972 h- deletion ECO:0001563 FYECO:0000137,FYECO:0000319 PMID:34984977 4896 2023-02-23 haploid 32 CHEBI:48080 60uM 0.213 median_fitness_log2
972 FYPO:0005266 deletion null 972 h- deletion ECO:0001563 FYECO:0000137,FYECO:0000167 PMID:34984977 4896 2023-02-23 haploid 32 CHEBI:8984 0.02% (w/v) 0.077 median_fitness_log2
972 FYPO:0000797 deletion null 972 h- deletion ECO:0001563 FYECO:0000137,FYECO:0000304 PMID:34984977 4896 2023-02-23 haploid 32 CHEBI:64090 15mM -0.247 median_fitness_log2
972 FYPO:0000089 deletion null 972 h- deletion ECO:0001563 FYECO:0000334,FYECO:0000138,FYECO:0000211 PMID:34984977 4896 2023-02-23 haploid 32 CHEBI:25255 -0.239 median_fitness_log2
972 FYPO:0000087 deletion null 972 h- deletion ECO:0001563 FYECO:0000137,FYECO:0000334,FYECO:0000078 PMID:34984977 4896 2023-02-23 haploid 32 CHEBI:16240 10mM -0.064 median_fitness_log2
972 FYPO:0000089 deletion null 972 h- deletion ECO:0001563 FYECO:0000137,FYECO:0000334,FYECO:0000211 PMID:34984977 4896 2023-02-23 haploid 32 CHEBI:25255 0.075% (v/v) -0.199 median_fitness_log2
972 FYPO:0000797 deletion null 972 h- deletion ECO:0001563 FYECO:0000137,FYECO:0000334,FYECO:0000304 PMID:34984977 4896 2023-02-23 haploid 32 CHEBI:64090 1.5mM -0.323 median_fitness_log2

@manulera
Copy link
Contributor Author

Hi Kim, yes those are wrong for sure they are controls. but then i will have to double check why they are there in the first place. I will sort it out on Monday.

@kimrutherford
Copy link
Member

I will sort it out on Monday.

Thanks!

There are some other IDs that aren't in Chado. Most of the IDs are synonyms of current genes. Here's the list, with the current ID in brackets:

SPNCRNA.01
SPNCRNA.1341 (SPNCRNA.103)
SPNCRNA.144 (SPNCRNA.627)
SPNCRNA.145 (SPNCRNA.628)
SPNCRNA.236 (SPNCRNA.941)
SPNCRNA.248
SPNCRNA.289 (SPNCRNA.1303)
SPNCRNA.410 (SPNCRNA.1559)
SPNCRNA.426 (SPNCRNA.1624)
SPNCRNA.46 (SPAC144.19)
SPNCRNA.507 (SPNCRNA.808)
SPNCRNA.515 (SPNCRNA.1651)
SPNCRNA.538 (SPNCRNA.1501)
SPNCRNA.539 (SPNCRNA.1533)
SPNCRNA.59 (SPNCRNA.817)
SPNCRNA.7
SPNCRNA.87

kimrutherford added a commit to pombase/pombase-chado that referenced this issue Feb 27, 2023
@kimrutherford
Copy link
Member

I'll try to load this soon.

I've done that now. It all looks good. The only warnings were for the unknown gene IDs and there are some FYPO terms which aren't in Chado yet (IDs from FYPO:0009007 to FYPO:0009054). All the other columns and formatting is correct.

I have added some extra columns that we can remove, but I think they could be interesting to request going forward:

I've changed the PHAF file loader to accept the extra columns. Currently they aren't stored in Chado, but they could be if needed.

Do we want to show any of the values from the extra columns on the website?

@kimrutherford
Copy link
Member

I've loaded the data file locally as a test. 3608 annotations loaded successfully (of 5073):
https://desktop.kmr.nz/reference/PMID:34984977

@ValWood
Copy link
Member

ValWood commented Feb 27, 2023 via email

@ValWood
Copy link
Member

ValWood commented Feb 27, 2023

I've loaded the data file locally as a test. 3608 annotations loaded successfully (of 5073)

I thought there were ~50,000 annotations?

@kimrutherford
Copy link
Member

I thought there were ~50,000 annotations?

Definitely 5073 in this data set.

@kimrutherford
Copy link
Member

I thought there were ~50,000 annotations?

That's quite a lot more than our current biggest publication (Marguerat et al. with 19906). It's going to be slow to load the new publication page so I might have to do some work on that.

@manulera
Copy link
Contributor Author

I am doing some extra checks. I will comment in a sec

@ValWood
Copy link
Member

ValWood commented Feb 27, 2023

yes, the cell cycle and deletions publication pages have always been a bit of a problem loading too.

@ValWood
Copy link
Member

ValWood commented Feb 27, 2023

Do we want to show any of the values from the extra columns on the website?
I think it would be useful. We can discuss where and how.

@manulera
Copy link
Contributor Author

manulera commented Feb 27, 2023

Hi @kimrutherford and @ValWood

I have removed hte 972 ones, they appeared as hits, even if they were controls, I guess to check whether the condition affects growth in the wild type itself.

I had a look at the missing systematic ids and made this table:

https://github.com/manulera/phenomics_paper_HTP/blob/master/results/ncRNA_table_missing.tsv

  • current_synonym: Synonym of the obsoleted systematic id.
  • coordinates: The coordinates that the ncRNA had at the time, and therefore the genome region that was deleted in the screen.
  • synonym_already_present: True if the synonym already exists in the screen as a separate entry. E.g. SPNCRNA.1341 is an obsoleted systematic id and currently SPNCRNA.103 is a synonym. However, SPNCRNA.1341 has different coordinates and was also deleted in the screen.
  • synonym_coordinates: Empty if no synonym exists, or if the coordinates of the synonym match the coordinates that were deleted in the screen. In other words, it has a value if what was deleted does not match the current coordinates, even if the obosoleted id is a synonym of an existing id.

Similar case for SPNCRNA.388, where the coordinates have been updated and what they deleted (coordinates) does not match what we currently list as SPNCRNA.388 (current_coordinates).

https://github.com/manulera/phenomics_paper_HTP/blob/master/results/ncRNA_table_differing_coordinates.tsv

What to do

We can load the synonyms where coordinates match and where the synonym was not present (SPNCRNA.817, SPAC144.19, SPNCRNA.316).

Not sure what to do with:

  • Those where the synonym id has different coordinates and was separately deleted.
  • Those where the synonym id was not listed, but has different coordinates.
  • Those without a current synonym (entirely obsoleted)
  • SPNCRNA.388, for which current coordinates do not match what was deleted.

Once we decide what to do I can update the file for update.

kimrutherford added a commit to pombase/pombase-legacy that referenced this issue Feb 27, 2023
@kimrutherford
Copy link
Member

I've loaded the data file locally as a test. 3608 annotations loaded successfully (of 5073)

Now that all the new FYPO terms are available, 4929 out of 5073 now load OK:
https://desktop.kmr.nz/reference/PMID:34984977

The remaining annotations aren't loading because of the gene ID issues that we're working on.

@ValWood
Copy link
Member

ValWood commented Feb 28, 2023

For the untraceables:

SPNCRNA.01 was the last 1101 bases of the eta2 transcription unit, entirely contained within the UTR.
The 3’UTR feature
3003767..3004916
has this note
/note="SPNCRNA.01/prl101 part of the UTR of eta2"
So, this should be described as a nucleotide mutation of SPAC31G5.10 ( allele synonym can be SPNCRNA.01delta )
How can we make “SPNCRNA.01/prl101 part of the UTR of eta2" show up on the history page at some point. This is the info I would have put in the comment here https://www.pombase.org/status/gene-coordinate-changes if we still maintained it.

SPNCRNA.248 was merged into SPNCRNA.996
although SPNCRNA.248 was 78 bases longer at the N-term (SPNCRNA.996 is 1611 nt in total), it’s clearly the same feature, there is no evidence in the transcrptome viewer for a separate feature.
SPNCRNA.248 was probably one of the features that could be on either strand that I was trying to resolve and was probably originally derived from the NCRNA.995 on the opposite strand, so it’s additionally complicated.
Anyway I added SPNCRNA.248 as an alternative ID SPNCRNA.996 because this is the strand it was annotated on.
This allele could be described as a partial deletion of SPNCRNA.996).

SPNCRNA.87 is the same as SPNCRNA.893 (SPNCRNA.87 has the same C term but was 17 bp longer at the N-term). I added SPNCRNA.87 back as the systematic ID and made SPNCRNA.893 a synonym (the earlier name should take precedence, I don’t know why I did not do that.

@ValWood
Copy link
Member

ValWood commented Feb 28, 2023

For the ones with slightly different coordinates like

SPNCRNA.1341 has different coordinates and was also deleted in the screen.

could we describe these as alleles of the current gene, but representing the deleted region? (so they would have deleted slightly more, or slightly less than the current region). I usually took the later coordinates because they would likely be more accurate than the earlier coordinates using microarrays and greedy algorithms to assemble
short reads etc).
The allelle name could be synonym-delta.
Would that work?

@manulera
Copy link
Contributor Author

manulera commented Mar 1, 2023

This is the info I would have put in the comment here https://www.pombase.org/status/gene-coordinate-changes if we still maintained it.

You can add these kinds of comments in the equivalent file of the new repo and they will be visible in the website.
https://github.com/pombase/genome_changelog/blob/master/gene_changes_comments_and_pmids/gene-coordinate-change-data.tsv

@manulera
Copy link
Contributor Author

manulera commented Mar 1, 2023

@ValWood I already added it, you can see the line in this commit:
pombase/genome_changelog@136bfda#diff-3c2606074897cee718b203b19ba7ed05475bcbdc2b1af9b3d5c9959c02c59b90

@manulera
Copy link
Contributor Author

manulera commented Mar 1, 2023

Hi @ValWood I realised now that the thing is even more complicated for the overexpression alleles, where they introduced the promoters at the "wrong place", so they have expressed some chunk of the RNA, or something that is in principle outside of the RNA... Should we just drop those phenotypes? 72 annotations concerned

@manulera
Copy link
Contributor Author

manulera commented Mar 1, 2023

Hi @ValWood , I decided to record such cases like this (example for SPNCRNA.628). The allele variant and the description are just a feature location of the fragment that was cloned into the plasmid. I think it's probably the simplest / safest solution.

Gene systematic ID FYPO ID Allele description Expression Parental strain Background strain name Background genotype description Gene name Allele name Allele synonym Allele type Evidence Condition Penetrance Severity Extension Reference taxon Date Ploidy Allele variant
SPNCRNA.628 FYPO:0009051 I:239730..240571 overexpression 972 h- or 968 h90 SPNCRNA.145OE other ECO:0001563 FYECO:0000254(5 g/L),FYECO:0000005(32) fitness_log2(0.0931) PMID:34984977 4896 23/02/2023 haploid I:239730..240571

@ValWood
Copy link
Member

ValWood commented Mar 1, 2023

I would drop them if they expressed some region outside of an annotated feature.

@kimrutherford
Copy link
Member

Thanks for all the fixes. I ran another test load. There are only 2 missing gene IDs now, SPNCRNA.7 and SPNCRNA.87.

There are 8 lines that don't load because of that:

 Gene systematic ID │ FYPO ID      │ Allele description │ Expression │ Parental strain   │ Allele type │ Evidence    │ Condition          │ Severity           │ Reference     │ taxon │ Date       │ Ploidy  │ Allele variant                
 SPNCRNA.7          │ FYPO:0000095 │ wild type          │ overexpres…│ 972 h- or 968 h90 │ wild type   │ ECO:0001563 │ FYECO:0000277(20 µ…│ fitness_log2(-0.18…│ PMID:34984977 │ 4896  │ 2023-02-23 │ haploid │                               
 SPNCRNA.87         │ FYPO:0000799 │ deletion           │ null       │ 972 h- or 968 h90 │ deletion    │ ECO:0001563 │ FYECO:0000137,FYEC…│ fitness_log2(-0.10…│ PMID:34984977 │ 4896  │ 2023-02-23 │ haploid │ I:g.3177948_3178309del        
 SPNCRNA.87         │ FYPO:0001034 │ deletion           │ null       │ 972 h- or 968 h90 │ deletion    │ ECO:0001563 │ FYECO:0000137,FYEC…│ fitness_log2(0.098…│ PMID:34984977 │ 4896  │ 2023-02-23 │ haploid │ I:g.3177948_3178309del        
 SPNCRNA.87         │ FYPO:0000115 │ deletion           │ null       │ 972 h- or 968 h90 │ deletion    │ ECO:0001563 │ FYECO:0000137,FYEC…│ fitness_log2(-0.08…│ PMID:34984977 │ 4896  │ 2023-02-23 │ haploid │ I:g.3177948_3178309del        
 SPNCRNA.87         │ FYPO:0009027 │ wild type          │ overexpres…│ 972 h- or 968 h90 │ wild type   │ ECO:0001563 │ FYECO:0000296(25 m…│ fitness_log2(0.183)│ PMID:34984977 │ 4896  │ 2023-02-23 │ haploid │ I:complement(3177948..3178309)
 SPNCRNA.87         │ FYPO:0009028 │ wild type          │ overexpres…│ 972 h- or 968 h90 │ wild type   │ ECO:0001563 │ FYECO:0000246(26 m…│ fitness_log2(0.12) │ PMID:34984977 │ 4896  │ 2023-02-23 │ haploid │ I:complement(3177948..3178309)
 SPNCRNA.87         │ FYPO:0007808 │ wild type          │ overexpres…│ 972 h- or 968 h90 │ wild type   │ ECO:0001563 │ FYECO:0000413(3 mM…│ fitness_log2(0.172)│ PMID:34984977 │ 4896  │ 2023-02-23 │ haploid │ I:complement(3177948..3178309)
 SPNCRNA.87         │ FYPO:0007808 │ wild type          │ overexpres…│ 972 h- or 968 h90 │ wild type   │ ECO:0001563 │ FYECO:0000413(5 mM…│ fitness_log2(0.226)│ PMID:34984977 │ 4896  │ 2023-02-23 │ haploid │ I:complement(3177948..3178309)

@manulera
Copy link
Contributor Author

manulera commented Mar 2, 2023

Hi @kimrutherford I will fix SPNCRNA.7 (most likely means SPNCRNA.07 since it only appears as such in one of the datasets). The other one should be the systematic id with last changes made by Val. Perhaps it did not update yet?

@kimrutherford
Copy link
Member

Perhaps it did not update yet?

Sorry, my fault. I wasn't using the latest Chado. Now that I am there is just the one warning about SPNCRNA.7

I will fix SPNCRNA.7

Thanks. Once that's done the file is probably ready for loading?

To make that happen we'll need to copy to this directory in Subversion:
pombe-embl/external_data/phaf_files/chado_load/htp_phafs/ with the file name PMID_34984977_phaf.tsv
(All files in that directory are loaded into Chado)

We'll need three extra lines at the top of the file, in this format:

#Submitter_name: Val Wood
#Submitter_ORCID: 0000-0001-6330-7526
#Submitter_status: PomBase

but with you as the Submitter. The issue about that header is here: pombase/pombase-chado#886

@manulera
Copy link
Contributor Author

manulera commented Mar 2, 2023

Ok, all done! I added it to the svn, closing.

@kimrutherford
Copy link
Member

Ok, all done! I added it to the svn, closing.

Excellent. Thanks.

The nightly load worked: https://www.pombase.org/reference/PMID:34984977

@ValWood
Copy link
Member

ValWood commented Mar 3, 2023

fab! @manulera can you do an announcement and news item. I'll tweet it later.

We can also put the session through as "curated"

@kimrutherford , these annotations will display in Canto, we will probably never visit this session in canto, but for sessions like this where there is a mixture of HTP and LTP we should make viewing the HTP annotations optional.

@kimrutherford
Copy link
Member

we should make viewing the HTP annotations optional.

Would a message like this be OK:

This publication has NNNN existing annotations from high throughput experiments.  These are
not displayed in Canto.
Please visit the publication page for PMID:1234567 at pombase.org to view them.

and make publication page for PMID:1234567 to the publication page?

(Also with better wording)

@ValWood
Copy link
Member

ValWood commented Mar 3, 2023

That would be a good solution.

@kimrutherford
Copy link
Member

OK, I've made an issue: pombase/canto#2702

@kimrutherford
Copy link
Member

Hi @manulera

Sorry to come back to this issue but I noticed a small problem with the conditions field on two lines. The condition column start with "," so I'm wondering: is this is a typo or did a condition went missing? Could you check? Thanks!

This doesn't cause a loading problem. I only noticed because of two warnings here:
https://curation.pombase.org/dumps/builds/pombase-build-2023-03-04/logs/log.2023-03-03-21-31-41.web-json-write

ignoring condition that isn't a term ID "" (from annotation of PomBase-genotype-15469 with FYPO:0001355)
ignoring condition that isn't a term ID "" (from annotation of PomBase-genotype-15412 with FYPO:0004557)

Here are the two lines that cause the warnings:

Gene systematic ID      FYPO ID Allele description      Expression      Parental strain Background strain name  Background genotype description Gene name       Allele name     Allele synonymAllele type     Evidence        Condition       Penetrance      Severity        Extension       Reference       taxon   Date    Ploidy  Allele variant
SPNCRNA.31      FYPO:0001355    wild type       overexpression  972 h-                                          wild type       ECO:0001563     ,FYECO:0000005(32)              fitness_log2(-0.11)           PMID:34984977   4896    2023-02-23      haploid I:2975608..2976007
SPNCRNA.382     FYPO:0004557    wild type       overexpression  972 h-                                          wild type       ECO:0001563     ,FYECO:0000005(25)              fitness_log2(0.066)           PMID:34984977   4896    2023-02-23      haploid II:1963882..1964560

@manulera
Copy link
Contributor Author

manulera commented Mar 7, 2023

Should be fixed now

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants