HTP paper data #3465

manulera · 2023-02-24T16:37:09Z

@ValWood @kimrutherford

Here is the file with the HTP data that I think is ready for PomBase.

https://github.com/manulera/phenomics_paper_HTP/blob/master/results/pombase_dataset.tsv

I have added some extra columns that we can remove, but I think they could be interesting to request going forward:

Temperature:
Chemical or agent: Chemicals or agents used in the experiment, pipe | separated. For chemicals, CHEBI ids, for other agents some brief description, e.g. UV
Chemical or agent dose: Dose of the chemical / agent with units when relevant
Phenotype score: Typically there is some measure of how strong the phenotype is. In the paper of jurg they used the fold change of something (cell length / percentage of binucleated cells), or the log2 of the median fitness.
Phenotype score units: the units of the score, for Jurg's paper I used median_fitness_log2 and fold_change.

The text was updated successfully, but these errors were encountered:

kimrutherford · 2023-02-25T10:34:10Z

Thanks @manulera

I'll try to load this soon. I noticed that there are some rows where the "Gene systematic ID" is just "972"?

Gene systematic ID	FYPO ID	Allele description	Expression	Parental strain	Allele type	Evidence	Condition	Reference	taxon	Date	Ploidy	Temperature	Chemical or agent	Chemical or agent dose	Phenotype score	Phenotype score units
972	FYPO:0000067	deletion	null	972 h-	deletion	ECO:0001563	FYECO:0000137,FYECO:0000319	PMID:34984977	4896	2023-02-23	haploid	32	CHEBI:48080	60uM	0.213	median_fitness_log2
972	FYPO:0005266	deletion	null	972 h-	deletion	ECO:0001563	FYECO:0000137,FYECO:0000167	PMID:34984977	4896	2023-02-23	haploid	32	CHEBI:8984	0.02% (w/v)	0.077	median_fitness_log2
972	FYPO:0000797	deletion	null	972 h-	deletion	ECO:0001563	FYECO:0000137,FYECO:0000304	PMID:34984977	4896	2023-02-23	haploid	32	CHEBI:64090	15mM	-0.247	median_fitness_log2
972	FYPO:0000089	deletion	null	972 h-	deletion	ECO:0001563	FYECO:0000334,FYECO:0000138,FYECO:0000211	PMID:34984977	4896	2023-02-23	haploid	32	CHEBI:25255		-0.239	median_fitness_log2
972	FYPO:0000087	deletion	null	972 h-	deletion	ECO:0001563	FYECO:0000137,FYECO:0000334,FYECO:0000078	PMID:34984977	4896	2023-02-23	haploid	32	CHEBI:16240	10mM	-0.064	median_fitness_log2
972	FYPO:0000089	deletion	null	972 h-	deletion	ECO:0001563	FYECO:0000137,FYECO:0000334,FYECO:0000211	PMID:34984977	4896	2023-02-23	haploid	32	CHEBI:25255	0.075% (v/v)	-0.199	median_fitness_log2
972	FYPO:0000797	deletion	null	972 h-	deletion	ECO:0001563	FYECO:0000137,FYECO:0000334,FYECO:0000304	PMID:34984977	4896	2023-02-23	haploid	32	CHEBI:64090	1.5mM	-0.323	median_fitness_log2

manulera · 2023-02-25T11:21:34Z

Hi Kim, yes those are wrong for sure they are controls. but then i will have to double check why they are there in the first place. I will sort it out on Monday.

kimrutherford · 2023-02-27T01:30:49Z

I will sort it out on Monday.

Thanks!

There are some other IDs that aren't in Chado. Most of the IDs are synonyms of current genes. Here's the list, with the current ID in brackets:

SPNCRNA.01
SPNCRNA.1341 (SPNCRNA.103)
SPNCRNA.144 (SPNCRNA.627)
SPNCRNA.145 (SPNCRNA.628)
SPNCRNA.236 (SPNCRNA.941)
SPNCRNA.248
SPNCRNA.289 (SPNCRNA.1303)
SPNCRNA.410 (SPNCRNA.1559)
SPNCRNA.426 (SPNCRNA.1624)
SPNCRNA.46 (SPAC144.19)
SPNCRNA.507 (SPNCRNA.808)
SPNCRNA.515 (SPNCRNA.1651)
SPNCRNA.538 (SPNCRNA.1501)
SPNCRNA.539 (SPNCRNA.1533)
SPNCRNA.59 (SPNCRNA.817)
SPNCRNA.7
SPNCRNA.87

And for phenotype score Refs pombase/curation#3465

kimrutherford · 2023-02-27T03:20:04Z

I'll try to load this soon.

I've done that now. It all looks good. The only warnings were for the unknown gene IDs and there are some FYPO terms which aren't in Chado yet (IDs from FYPO:0009007 to FYPO:0009054). All the other columns and formatting is correct.

I have added some extra columns that we can remove, but I think they could be interesting to request going forward:

I've changed the PHAF file loader to accept the extra columns. Currently they aren't stored in Chado, but they could be if needed.

Do we want to show any of the values from the extra columns on the website?

kimrutherford · 2023-02-27T07:37:55Z

I've loaded the data file locally as a test. 3608 annotations loaded successfully (of 5073):
https://desktop.kmr.nz/reference/PMID:34984977

ValWood · 2023-02-27T08:29:37Z

SPNCRNA.01 (3'UTR of eta2) #3333 SPNCRNA.248 #3410 all I recorded was "warning, this transcript could be on either strand" so I need to dig out the coordinates to see if there is a feature on the other strand). I didn't map the IDs over because it would be confusing, it isn't the same feature. SPNCRNA.7 I can't locate this one on the tracker, will need the coordinates to find the history. SPNCRNA.87 #3410 transcript is n opposite strand (but I did not record the ID, need to find coordinates)

ValWood · 2023-02-27T08:30:52Z

I've loaded the data file locally as a test. 3608 annotations loaded successfully (of 5073)

I thought there were ~50,000 annotations?

kimrutherford · 2023-02-27T10:25:42Z

I thought there were ~50,000 annotations?

Definitely 5073 in this data set.

kimrutherford · 2023-02-27T10:36:29Z

I thought there were ~50,000 annotations?

That's quite a lot more than our current biggest publication (Marguerat et al. with 19906). It's going to be slow to load the new publication page so I might have to do some work on that.

manulera · 2023-02-27T10:38:33Z

I am doing some extra checks. I will comment in a sec

ValWood · 2023-02-27T10:57:23Z

yes, the cell cycle and deletions publication pages have always been a bit of a problem loading too.

ValWood · 2023-02-27T10:59:08Z

Do we want to show any of the values from the extra columns on the website?
I think it would be useful. We can discuss where and how.

manulera · 2023-02-27T12:15:08Z

Hi @kimrutherford and @ValWood

I have removed hte 972 ones, they appeared as hits, even if they were controls, I guess to check whether the condition affects growth in the wild type itself.

I had a look at the missing systematic ids and made this table:

https://github.com/manulera/phenomics_paper_HTP/blob/master/results/ncRNA_table_missing.tsv

current_synonym: Synonym of the obsoleted systematic id.
coordinates: The coordinates that the ncRNA had at the time, and therefore the genome region that was deleted in the screen.
synonym_already_present: True if the synonym already exists in the screen as a separate entry. E.g. SPNCRNA.1341 is an obsoleted systematic id and currently SPNCRNA.103 is a synonym. However, SPNCRNA.1341 has different coordinates and was also deleted in the screen.
synonym_coordinates: Empty if no synonym exists, or if the coordinates of the synonym match the coordinates that were deleted in the screen. In other words, it has a value if what was deleted does not match the current coordinates, even if the obosoleted id is a synonym of an existing id.

Similar case for SPNCRNA.388, where the coordinates have been updated and what they deleted (coordinates) does not match what we currently list as SPNCRNA.388 (current_coordinates).

https://github.com/manulera/phenomics_paper_HTP/blob/master/results/ncRNA_table_differing_coordinates.tsv

What to do

We can load the synonyms where coordinates match and where the synonym was not present (SPNCRNA.817, SPAC144.19, SPNCRNA.316).

Not sure what to do with:

Those where the synonym id has different coordinates and was separately deleted.
Those where the synonym id was not listed, but has different coordinates.
Those without a current synonym (entirely obsoleted)
SPNCRNA.388, for which current coordinates do not match what was deleted.

Once we decide what to do I can update the file for update.

Refs pombase/curation#3465

kimrutherford · 2023-02-28T04:22:45Z

I've loaded the data file locally as a test. 3608 annotations loaded successfully (of 5073)

Now that all the new FYPO terms are available, 4929 out of 5073 now load OK:
https://desktop.kmr.nz/reference/PMID:34984977

The remaining annotations aren't loading because of the gene ID issues that we're working on.

ValWood · 2023-02-28T09:00:48Z

For the untraceables:

SPNCRNA.01 was the last 1101 bases of the eta2 transcription unit, entirely contained within the UTR.
The 3’UTR feature
3003767..3004916
has this note
/note="SPNCRNA.01/prl101 part of the UTR of eta2"
So, this should be described as a nucleotide mutation of SPAC31G5.10 ( allele synonym can be SPNCRNA.01delta )
How can we make “SPNCRNA.01/prl101 part of the UTR of eta2" show up on the history page at some point. This is the info I would have put in the comment here https://www.pombase.org/status/gene-coordinate-changes if we still maintained it.

SPNCRNA.248 was merged into SPNCRNA.996
although SPNCRNA.248 was 78 bases longer at the N-term (SPNCRNA.996 is 1611 nt in total), it’s clearly the same feature, there is no evidence in the transcrptome viewer for a separate feature.
SPNCRNA.248 was probably one of the features that could be on either strand that I was trying to resolve and was probably originally derived from the NCRNA.995 on the opposite strand, so it’s additionally complicated.
Anyway I added SPNCRNA.248 as an alternative ID SPNCRNA.996 because this is the strand it was annotated on.
This allele could be described as a partial deletion of SPNCRNA.996).

SPNCRNA.87 is the same as SPNCRNA.893 (SPNCRNA.87 has the same C term but was 17 bp longer at the N-term). I added SPNCRNA.87 back as the systematic ID and made SPNCRNA.893 a synonym (the earlier name should take precedence, I don’t know why I did not do that.

ValWood · 2023-02-28T09:05:05Z

For the ones with slightly different coordinates like

SPNCRNA.1341 has different coordinates and was also deleted in the screen.

could we describe these as alleles of the current gene, but representing the deleted region? (so they would have deleted slightly more, or slightly less than the current region). I usually took the later coordinates because they would likely be more accurate than the earlier coordinates using microarrays and greedy algorithms to assemble
short reads etc).
The allelle name could be synonym-delta.
Would that work?

Refs #1066 Refs pombase/curation#3465

manulera · 2023-03-01T12:14:18Z

This is the info I would have put in the comment here https://www.pombase.org/status/gene-coordinate-changes if we still maintained it.

You can add these kinds of comments in the equivalent file of the new repo and they will be visible in the website.
https://github.com/pombase/genome_changelog/blob/master/gene_changes_comments_and_pmids/gene-coordinate-change-data.tsv

manulera · 2023-03-01T13:08:07Z

@ValWood I already added it, you can see the line in this commit:
pombase/genome_changelog@136bfda#diff-3c2606074897cee718b203b19ba7ed05475bcbdc2b1af9b3d5c9959c02c59b90

manulera · 2023-03-01T15:00:20Z

Hi @ValWood I realised now that the thing is even more complicated for the overexpression alleles, where they introduced the promoters at the "wrong place", so they have expressed some chunk of the RNA, or something that is in principle outside of the RNA... Should we just drop those phenotypes? 72 annotations concerned

manulera · 2023-03-01T17:10:09Z

Hi @ValWood , I decided to record such cases like this (example for SPNCRNA.628). The allele variant and the description are just a feature location of the fragment that was cloned into the plasmid. I think it's probably the simplest / safest solution.

Gene systematic ID	FYPO ID	Allele description	Expression	Parental strain	Background strain name	Background genotype description	Gene name	Allele name	Allele synonym	Allele type	Evidence	Condition	Penetrance	Severity	Extension	Reference	taxon	Date	Ploidy	Allele variant
SPNCRNA.628	FYPO:0009051	I:239730..240571	overexpression	972 h- or 968 h90					SPNCRNA.145OE	other	ECO:0001563	FYECO:0000254(5 g/L),FYECO:0000005(32)		fitness_log2(0.0931)		PMID:34984977	4896	23/02/2023	haploid	I:239730..240571

ValWood · 2023-03-01T18:21:41Z

I would drop them if they expressed some region outside of an annotated feature.

kimrutherford · 2023-03-02T06:38:45Z

Thanks for all the fixes. I ran another test load. There are only 2 missing gene IDs now, SPNCRNA.7 and SPNCRNA.87.

There are 8 lines that don't load because of that:

 Gene systematic ID │ FYPO ID      │ Allele description │ Expression │ Parental strain   │ Allele type │ Evidence    │ Condition          │ Severity           │ Reference     │ taxon │ Date       │ Ploidy  │ Allele variant                
 SPNCRNA.7          │ FYPO:0000095 │ wild type          │ overexpres…│ 972 h- or 968 h90 │ wild type   │ ECO:0001563 │ FYECO:0000277(20 µ…│ fitness_log2(-0.18…│ PMID:34984977 │ 4896  │ 2023-02-23 │ haploid │                               
 SPNCRNA.87         │ FYPO:0000799 │ deletion           │ null       │ 972 h- or 968 h90 │ deletion    │ ECO:0001563 │ FYECO:0000137,FYEC…│ fitness_log2(-0.10…│ PMID:34984977 │ 4896  │ 2023-02-23 │ haploid │ I:g.3177948_3178309del        
 SPNCRNA.87         │ FYPO:0001034 │ deletion           │ null       │ 972 h- or 968 h90 │ deletion    │ ECO:0001563 │ FYECO:0000137,FYEC…│ fitness_log2(0.098…│ PMID:34984977 │ 4896  │ 2023-02-23 │ haploid │ I:g.3177948_3178309del        
 SPNCRNA.87         │ FYPO:0000115 │ deletion           │ null       │ 972 h- or 968 h90 │ deletion    │ ECO:0001563 │ FYECO:0000137,FYEC…│ fitness_log2(-0.08…│ PMID:34984977 │ 4896  │ 2023-02-23 │ haploid │ I:g.3177948_3178309del        
 SPNCRNA.87         │ FYPO:0009027 │ wild type          │ overexpres…│ 972 h- or 968 h90 │ wild type   │ ECO:0001563 │ FYECO:0000296(25 m…│ fitness_log2(0.183)│ PMID:34984977 │ 4896  │ 2023-02-23 │ haploid │ I:complement(3177948..3178309)
 SPNCRNA.87         │ FYPO:0009028 │ wild type          │ overexpres…│ 972 h- or 968 h90 │ wild type   │ ECO:0001563 │ FYECO:0000246(26 m…│ fitness_log2(0.12) │ PMID:34984977 │ 4896  │ 2023-02-23 │ haploid │ I:complement(3177948..3178309)
 SPNCRNA.87         │ FYPO:0007808 │ wild type          │ overexpres…│ 972 h- or 968 h90 │ wild type   │ ECO:0001563 │ FYECO:0000413(3 mM…│ fitness_log2(0.172)│ PMID:34984977 │ 4896  │ 2023-02-23 │ haploid │ I:complement(3177948..3178309)
 SPNCRNA.87         │ FYPO:0007808 │ wild type          │ overexpres…│ 972 h- or 968 h90 │ wild type   │ ECO:0001563 │ FYECO:0000413(5 mM…│ fitness_log2(0.226)│ PMID:34984977 │ 4896  │ 2023-02-23 │ haploid │ I:complement(3177948..3178309)

manulera · 2023-03-02T09:44:11Z

Hi @kimrutherford I will fix SPNCRNA.7 (most likely means SPNCRNA.07 since it only appears as such in one of the datasets). The other one should be the systematic id with last changes made by Val. Perhaps it did not update yet?

kimrutherford · 2023-03-02T10:24:46Z

Perhaps it did not update yet?

Sorry, my fault. I wasn't using the latest Chado. Now that I am there is just the one warning about SPNCRNA.7

I will fix SPNCRNA.7

Thanks. Once that's done the file is probably ready for loading?

To make that happen we'll need to copy to this directory in Subversion:
pombe-embl/external_data/phaf_files/chado_load/htp_phafs/ with the file name PMID_34984977_phaf.tsv
(All files in that directory are loaded into Chado)

We'll need three extra lines at the top of the file, in this format:

#Submitter_name: Val Wood
#Submitter_ORCID: 0000-0001-6330-7526
#Submitter_status: PomBase

but with you as the Submitter. The issue about that header is here: pombase/pombase-chado#886

manulera · 2023-03-02T11:19:14Z

Ok, all done! I added it to the svn, closing.

kimrutherford · 2023-03-03T05:20:59Z

Ok, all done! I added it to the svn, closing.

Excellent. Thanks.

The nightly load worked: https://www.pombase.org/reference/PMID:34984977

ValWood · 2023-03-03T06:44:54Z

fab! @manulera can you do an announcement and news item. I'll tweet it later.

We can also put the session through as "curated"

@kimrutherford , these annotations will display in Canto, we will probably never visit this session in canto, but for sessions like this where there is a mixture of HTP and LTP we should make viewing the HTP annotations optional.

kimrutherford · 2023-03-03T06:52:53Z

we should make viewing the HTP annotations optional.

Would a message like this be OK:

This publication has NNNN existing annotations from high throughput experiments.  These are
not displayed in Canto.
Please visit the publication page for PMID:1234567 at pombase.org to view them.

and make publication page for PMID:1234567 to the publication page?

(Also with better wording)

ValWood · 2023-03-03T08:27:56Z

That would be a good solution.

kimrutherford · 2023-03-03T08:37:17Z

OK, I've made an issue: pombase/canto#2702

kimrutherford · 2023-03-04T11:17:22Z

Hi @manulera

Sorry to come back to this issue but I noticed a small problem with the conditions field on two lines. The condition column start with "," so I'm wondering: is this is a typo or did a condition went missing? Could you check? Thanks!

This doesn't cause a loading problem. I only noticed because of two warnings here:
https://curation.pombase.org/dumps/builds/pombase-build-2023-03-04/logs/log.2023-03-03-21-31-41.web-json-write

ignoring condition that isn't a term ID "" (from annotation of PomBase-genotype-15469 with FYPO:0001355)
ignoring condition that isn't a term ID "" (from annotation of PomBase-genotype-15412 with FYPO:0004557)

Here are the two lines that cause the warnings:

Gene systematic ID      FYPO ID Allele description      Expression      Parental strain Background strain name  Background genotype description Gene name       Allele name     Allele synonymAllele type     Evidence        Condition       Penetrance      Severity        Extension       Reference       taxon   Date    Ploidy  Allele variant
SPNCRNA.31      FYPO:0001355    wild type       overexpression  972 h-                                          wild type       ECO:0001563     ,FYECO:0000005(32)              fitness_log2(-0.11)           PMID:34984977   4896    2023-02-23      haploid I:2975608..2976007
SPNCRNA.382     FYPO:0004557    wild type       overexpression  972 h-                                          wild type       ECO:0001563     ,FYECO:0000005(25)              fitness_log2(0.066)           PMID:34984977   4896    2023-02-23      haploid II:1963882..1964560

manulera · 2023-03-07T13:53:32Z

Should be fixed now

kimrutherford self-assigned this Feb 25, 2023

kimrutherford added a commit to pombase/pombase-chado that referenced this issue Feb 27, 2023

Accept extra PHAF columns for chemical and temp

f506aed

And for phenotype score Refs pombase/curation#3465

manulera mentioned this issue Feb 27, 2023

Deleted genes still present in exon dataset #3467

Closed

kimrutherford added a commit to pombase/pombase-legacy that referenced this issue Feb 27, 2023

Add ECO codes need for new HTP phenotype dataset

1d478b6

Refs pombase/curation#3465

manulera mentioned this issue Feb 28, 2023

Add new PHAF columns pombase/pombase-chado#1066

Closed

kimrutherford added a commit to pombase/pombase-chado that referenced this issue Feb 28, 2023

Handle PHAF files with severity and cond. changes

32db3d3

Refs #1066 Refs pombase/curation#3465

manulera closed this as completed Mar 2, 2023

manulera mentioned this issue Mar 2, 2023

Should we allow specific temperatures in Canto? #3362

Closed

kimrutherford mentioned this issue Mar 3, 2023

Don't try to show HTP annotations in Canto pombase/canto#2702

Open

kimrutherford reopened this Mar 4, 2023

manulera closed this as completed in manulera/phenomics_paper_HTP@b299139 Mar 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

HTP paper data #3465

HTP paper data #3465

manulera commented Feb 24, 2023

kimrutherford commented Feb 25, 2023

manulera commented Feb 25, 2023

kimrutherford commented Feb 27, 2023

kimrutherford commented Feb 27, 2023

kimrutherford commented Feb 27, 2023

ValWood commented Feb 27, 2023 via email •

edited

Loading

ValWood commented Feb 27, 2023

kimrutherford commented Feb 27, 2023

kimrutherford commented Feb 27, 2023

manulera commented Feb 27, 2023

ValWood commented Feb 27, 2023

ValWood commented Feb 27, 2023

manulera commented Feb 27, 2023 •

edited

Loading

kimrutherford commented Feb 28, 2023

ValWood commented Feb 28, 2023

ValWood commented Feb 28, 2023

manulera commented Mar 1, 2023

manulera commented Mar 1, 2023

manulera commented Mar 1, 2023

manulera commented Mar 1, 2023

ValWood commented Mar 1, 2023

kimrutherford commented Mar 2, 2023

manulera commented Mar 2, 2023

kimrutherford commented Mar 2, 2023

manulera commented Mar 2, 2023

kimrutherford commented Mar 3, 2023

ValWood commented Mar 3, 2023

kimrutherford commented Mar 3, 2023

ValWood commented Mar 3, 2023

kimrutherford commented Mar 3, 2023

kimrutherford commented Mar 4, 2023

manulera commented Mar 7, 2023

HTP paper data #3465

HTP paper data #3465

Comments

manulera commented Feb 24, 2023

kimrutherford commented Feb 25, 2023

manulera commented Feb 25, 2023

kimrutherford commented Feb 27, 2023

kimrutherford commented Feb 27, 2023

kimrutherford commented Feb 27, 2023

ValWood commented Feb 27, 2023 via email • edited Loading

ValWood commented Feb 27, 2023

kimrutherford commented Feb 27, 2023

kimrutherford commented Feb 27, 2023

manulera commented Feb 27, 2023

ValWood commented Feb 27, 2023

ValWood commented Feb 27, 2023

manulera commented Feb 27, 2023 • edited Loading

What to do

kimrutherford commented Feb 28, 2023

ValWood commented Feb 28, 2023

ValWood commented Feb 28, 2023

manulera commented Mar 1, 2023

manulera commented Mar 1, 2023

manulera commented Mar 1, 2023

manulera commented Mar 1, 2023

ValWood commented Mar 1, 2023

kimrutherford commented Mar 2, 2023

manulera commented Mar 2, 2023

kimrutherford commented Mar 2, 2023

manulera commented Mar 2, 2023

kimrutherford commented Mar 3, 2023

ValWood commented Mar 3, 2023

kimrutherford commented Mar 3, 2023

ValWood commented Mar 3, 2023

kimrutherford commented Mar 3, 2023

kimrutherford commented Mar 4, 2023

manulera commented Mar 7, 2023

ValWood commented Feb 27, 2023 via email •

edited

Loading

manulera commented Feb 27, 2023 •

edited

Loading