-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HTP paper data #3465
Comments
Thanks @manulera I'll try to load this soon. I noticed that there are some rows where the "Gene systematic ID" is just "972"?
|
Hi Kim, yes those are wrong for sure they are controls. but then i will have to double check why they are there in the first place. I will sort it out on Monday. |
Thanks! There are some other IDs that aren't in Chado. Most of the IDs are synonyms of current genes. Here's the list, with the current ID in brackets: SPNCRNA.01 |
And for phenotype score Refs pombase/curation#3465
I've done that now. It all looks good. The only warnings were for the unknown gene IDs and there are some FYPO terms which aren't in Chado yet (IDs from FYPO:0009007 to FYPO:0009054). All the other columns and formatting is correct.
I've changed the PHAF file loader to accept the extra columns. Currently they aren't stored in Chado, but they could be if needed. Do we want to show any of the values from the extra columns on the website? |
I've loaded the data file locally as a test. 3608 annotations loaded successfully (of 5073): |
SPNCRNA.01 (3'UTR of eta2) #3333
SPNCRNA.248 #3410 all I recorded was "warning, this transcript could be on either strand" so I need to dig out the coordinates to see if there is a feature on the other strand). I didn't map the IDs over because it would be confusing, it isn't the same feature.
SPNCRNA.7 I can't locate this one on the tracker, will need the coordinates to find the history.
SPNCRNA.87 #3410 transcript is
n opposite strand (but I did not record the ID, need to find coordinates)
|
I thought there were ~50,000 annotations? |
Definitely 5073 in this data set. |
That's quite a lot more than our current biggest publication (Marguerat et al. with 19906). It's going to be slow to load the new publication page so I might have to do some work on that. |
I am doing some extra checks. I will comment in a sec |
yes, the cell cycle and deletions publication pages have always been a bit of a problem loading too. |
|
Hi @kimrutherford and @ValWood I have removed hte 972 ones, they appeared as hits, even if they were controls, I guess to check whether the condition affects growth in the wild type itself. I had a look at the missing systematic ids and made this table: https://github.com/manulera/phenomics_paper_HTP/blob/master/results/ncRNA_table_missing.tsv
Similar case for SPNCRNA.388, where the coordinates have been updated and what they deleted ( What to doWe can load the synonyms where coordinates match and where the synonym was not present (SPNCRNA.817, SPAC144.19, SPNCRNA.316). Not sure what to do with:
Once we decide what to do I can update the file for update. |
Now that all the new FYPO terms are available, 4929 out of 5073 now load OK: The remaining annotations aren't loading because of the gene ID issues that we're working on. |
For the untraceables: SPNCRNA.01 was the last 1101 bases of the eta2 transcription unit, entirely contained within the UTR. SPNCRNA.248 was merged into SPNCRNA.996 SPNCRNA.87 is the same as SPNCRNA.893 (SPNCRNA.87 has the same C term but was 17 bp longer at the N-term). I added SPNCRNA.87 back as the systematic ID and made SPNCRNA.893 a synonym (the earlier name should take precedence, I don’t know why I did not do that. |
For the ones with slightly different coordinates like
could we describe these as alleles of the current gene, but representing the deleted region? (so they would have deleted slightly more, or slightly less than the current region). I usually took the later coordinates because they would likely be more accurate than the earlier coordinates using microarrays and greedy algorithms to assemble |
You can add these kinds of comments in the equivalent file of the new repo and they will be visible in the website. |
@ValWood I already added it, you can see the line in this commit: |
Hi @ValWood I realised now that the thing is even more complicated for the overexpression alleles, where they introduced the promoters at the "wrong place", so they have expressed some chunk of the RNA, or something that is in principle outside of the RNA... Should we just drop those phenotypes? 72 annotations concerned |
Hi @ValWood , I decided to record such cases like this (example for SPNCRNA.628). The allele variant and the description are just a feature location of the fragment that was cloned into the plasmid. I think it's probably the simplest / safest solution.
|
I would drop them if they expressed some region outside of an annotated feature. |
Thanks for all the fixes. I ran another test load. There are only 2 missing gene IDs now, There are 8 lines that don't load because of that:
|
Hi @kimrutherford I will fix SPNCRNA.7 (most likely means SPNCRNA.07 since it only appears as such in one of the datasets). The other one should be the systematic id with last changes made by Val. Perhaps it did not update yet? |
Sorry, my fault. I wasn't using the latest Chado. Now that I am there is just the one warning about
Thanks. Once that's done the file is probably ready for loading? To make that happen we'll need to copy to this directory in Subversion: We'll need three extra lines at the top of the file, in this format:
but with you as the Submitter. The issue about that header is here: pombase/pombase-chado#886 |
Ok, all done! I added it to the svn, closing. |
Excellent. Thanks. The nightly load worked: https://www.pombase.org/reference/PMID:34984977 |
fab! @manulera can you do an announcement and news item. I'll tweet it later. We can also put the session through as "curated" @kimrutherford , these annotations will display in Canto, we will probably never visit this session in canto, but for sessions like this where there is a mixture of HTP and LTP we should make viewing the HTP annotations optional. |
Would a message like this be OK:
and make (Also with better wording) |
That would be a good solution. |
OK, I've made an issue: pombase/canto#2702 |
Hi @manulera Sorry to come back to this issue but I noticed a small problem with the conditions field on two lines. The condition column start with "," so I'm wondering: is this is a typo or did a condition went missing? Could you check? Thanks! This doesn't cause a loading problem. I only noticed because of two warnings here:
Here are the two lines that cause the warnings:
|
Should be fixed now |
@ValWood @kimrutherford
Here is the file with the HTP data that I think is ready for PomBase.
https://github.com/manulera/phenomics_paper_HTP/blob/master/results/pombase_dataset.tsv
I have added some extra columns that we can remove, but I think they could be interesting to request going forward:
|
separated. For chemicals, CHEBI ids, for other agents some brief description, e.g.UV
median_fitness_log2
andfold_change
.The text was updated successfully, but these errors were encountered: