Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

xml parsing #1

Open
chrisprobert opened this issue May 9, 2015 · 5 comments
Open

xml parsing #1

chrisprobert opened this issue May 9, 2015 · 5 comments

Comments

@chrisprobert
Copy link
Owner

@aaroncp1an0 I've started some code for XML parsing, check the xml_parser directory. I'll aim to finish this up later today, and we should have a TSV file for sequence feature attributes soon.

@aaroncp1an0
Copy link
Collaborator

@chrisprobert Ok cool, I finishing up my code to convert the xml file into several arrays and dictionaries of relevant information, should be done by tomorrow.

@chrisprobert
Copy link
Owner Author

@aaroncp1an0 take a look at /share/PI/pfordyce/uniprot_data/feature.description_counts_first_10k.txt and /share/PI/pfordyce/uniprot_data/feature.type_counts.txt. These are the counts of different feature-description terms and feature-type terms across the uniprot.xml file.

Also, /share/PI/pfordyce/uniprot_data/uniprot.xml.tsv is a tsv based on the uniprot xml file.


I've also posted those here:
https://github.com/chrisprobert/deep-psp/blob/master/data/feature.description_counts_first_10k.txt
https://github.com/chrisprobert/deep-psp/blob/master/data/feature.type_counts.txt


Some initial thoughts:

  • There are probably too many categories in the feature-description field to have a good multi-class classification problem
    • I wonder if there are ways to extract info from the descriptions? For example, combine all description fields that contain the word 'metal'?
  • The feature-type entries seem much better, but there aren't as many of them (there are only 33 unique ones).
  • As a first pass, maybe we should start with feature-type categories, since they are a much more tractable size and won't require further processing before we can use them. Later, we probably want a way to extract more categories/labels.
    • Within the feature DOM, I haven't looked anywhere outside the description or type attributes. Maybe you could check to see if there are other attributes we should be looking at?

@aaroncp1an0
Copy link
Collaborator

I check out your commits on github. Nice work. It will take me a little time to get up and running with github yet. In terms of the files you've generated, my data agrees with yours, although after processing 100k proteins I had 39 features rather than 33 - but some of them were trivial such as 'initiator methionine'.

Do you prefer I upload my files and code to your github, as they are somewhat redundant? I am storing the dictionary/array descriptions using numpy dump.

My sample output is as follows:
seq, feat, des, pos, fdic, ddic
arrays------------- dictionary--

######SAMPLE OUTPUT of array elements
seq[0]
METMSDYSKEVSEALSALRGELSALSAAISNTVRAGSYSAPVAKDCKAGHCDSKAVLKSLSRSARDLDSAVEAVSSNCEWASSGYGKQIARALRDDAVRVKREVESTRDAVDVVTPSCCVQGLAEEAGKLSEMAAVYRCMATVFETADSHGVREMLAKVDGLKQTMSGFKRLLGKTAEIDGLSDSVIRLGRSIGEVLPATEGKAMRDLVKQCERLNGLVVDGSRKVEEQCSKLRDMASQSYVVADLASQYDVLGGKAQEALSASDALEQAAAVALRAKAAADAVAKSLDSLDVKKLDRLLEQASAVSGLLAKKNDLDAVVTSLAGLEALVAKKDELYKICAAVNSVDKSKLELLNVKPDRLKSLTEQTVVVSQMTTALATFNEDKLDSVLGKYMQMHRFLGMATQLKLMSDSLAEFQPAKMAQMAAAASQLKDFLTDQTVSRLEKVSAAVDATDVTKYASAFSDGGMVSDMTKAYETVKAFAAVVNSLDSKKLKLVAECAKK

feat[0] -> corresponding features in feature dictionary
[0, 2]

des[0] -> corresponding descriptions in description dictionary
[2, 3]

pos[0] -> corresponding to start/stop position for each feat+description pair
[['1', '502'], ['425', '428']]

Feature Dictionary
{'topological domain': 19, 'domain': 20, 'active site': 18, 'modified residue': 10, 'glycosylation site': 5, 'site': 9, 'propeptide': 31, 'non-consecutive residues': 11, 'turn': 27, 'chain': 0, 'sequence conflict': 7, 'signal peptide': 3, 'sequence variant': 13, 'nucleotide phosphate-binding region': 25, 'non-terminal residue': 12, 'helix': 14, 'disulfide bond': 22, 'initiator methionine': 8, 'transit peptide': 30}

Description Dictionary:
{'Napin-1A large chain': 127, 'NAD': 138, 'Loss of catalytic activity': 200, "5'-deoxynucleotidase yfbR": 212, '3-hydroxyanthranilate 3,4-dioxygenase': 161, 'Loss of activity': 44, 'Proton donor': 156, 'In dbSNP:rs1131215.': 71, 'In dbSNP:rs9269744.': 121, 'Muscarinic toxin 2': 181, 'In dbSNP:rs6211.': 144, 'Ig-like C1-type': 49, 'In isoform Short.': 29, 'Spectrin--actin-binding': 187, 'Probable nitronate monooxygenase': 122, 'Uncharacterized protein 123L': 13, 'Uncharacterized protein 234R': 82}

@chrisprobert
Copy link
Owner Author

This page documents the various feature annotations: http://www.uniprot.org/help/sequence_annotation

Ideally I think we'd like to download the 'structure' section, but I'm not sure how: http://www.uniprot.org/help/structure_section


Actually, it turns out that the feature section does contain secondary structure information. The 'helix', 'turn' and 'strand' values we found here are the same as those in the secondary structure section on the uniprotKB entries.

@aaroncp1an0
Copy link
Collaborator

I checked that out ~ thank you. Interesting unfortunately they don't
provide much haha.

I have a flexible outline for implementing the 'part of speech tagging' -
if you have time to meet briefly tomorrow let me know and I can
formalize/make depictions to put it into our document.
Alternatively I can do the same if you have outlined your ideas for
implementing a model.

I'm free most of the day.
-Aaron

On Tue, May 12, 2015 at 9:05 AM, Chris Probert [email protected]
wrote:

This page documents the various feature annotations:
http://www.uniprot.org/help/sequence_annotation


Reply to this email directly or view it on GitHub
#1 (comment)
.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants