xml parsing #1

chrisprobert · 2015-05-09T17:38:19Z

@aaroncp1an0 I've started some code for XML parsing, check the xml_parser directory. I'll aim to finish this up later today, and we should have a TSV file for sequence feature attributes soon.

aaroncp1an0 · 2015-05-09T20:05:24Z

@chrisprobert Ok cool, I finishing up my code to convert the xml file into several arrays and dictionaries of relevant information, should be done by tomorrow.

chrisprobert · 2015-05-09T23:36:43Z

@aaroncp1an0 take a look at /share/PI/pfordyce/uniprot_data/feature.description_counts_first_10k.txt and /share/PI/pfordyce/uniprot_data/feature.type_counts.txt. These are the counts of different feature-description terms and feature-type terms across the uniprot.xml file.

Also, /share/PI/pfordyce/uniprot_data/uniprot.xml.tsv is a tsv based on the uniprot xml file.

I've also posted those here:
https://github.com/chrisprobert/deep-psp/blob/master/data/feature.description_counts_first_10k.txt
https://github.com/chrisprobert/deep-psp/blob/master/data/feature.type_counts.txt

Some initial thoughts:

There are probably too many categories in the feature-description field to have a good multi-class classification problem
- I wonder if there are ways to extract info from the descriptions? For example, combine all description fields that contain the word 'metal'?
The feature-type entries seem much better, but there aren't as many of them (there are only 33 unique ones).
As a first pass, maybe we should start with feature-type categories, since they are a much more tractable size and won't require further processing before we can use them. Later, we probably want a way to extract more categories/labels.
- Within the feature DOM, I haven't looked anywhere outside the description or type attributes. Maybe you could check to see if there are other attributes we should be looking at?

aaroncp1an0 · 2015-05-11T01:12:16Z

I check out your commits on github. Nice work. It will take me a little time to get up and running with github yet. In terms of the files you've generated, my data agrees with yours, although after processing 100k proteins I had 39 features rather than 33 - but some of them were trivial such as 'initiator methionine'.

Do you prefer I upload my files and code to your github, as they are somewhat redundant? I am storing the dictionary/array descriptions using numpy dump.

My sample output is as follows:
seq, feat, des, pos, fdic, ddic
arrays------------- dictionary--

######SAMPLE OUTPUT of array elements
seq[0]
METMSDYSKEVSEALSALRGELSALSAAISNTVRAGSYSAPVAKDCKAGHCDSKAVLKSLSRSARDLDSAVEAVSSNCEWASSGYGKQIARALRDDAVRVKREVESTRDAVDVVTPSCCVQGLAEEAGKLSEMAAVYRCMATVFETADSHGVREMLAKVDGLKQTMSGFKRLLGKTAEIDGLSDSVIRLGRSIGEVLPATEGKAMRDLVKQCERLNGLVVDGSRKVEEQCSKLRDMASQSYVVADLASQYDVLGGKAQEALSASDALEQAAAVALRAKAAADAVAKSLDSLDVKKLDRLLEQASAVSGLLAKKNDLDAVVTSLAGLEALVAKKDELYKICAAVNSVDKSKLELLNVKPDRLKSLTEQTVVVSQMTTALATFNEDKLDSVLGKYMQMHRFLGMATQLKLMSDSLAEFQPAKMAQMAAAASQLKDFLTDQTVSRLEKVSAAVDATDVTKYASAFSDGGMVSDMTKAYETVKAFAAVVNSLDSKKLKLVAECAKK

feat[0] -> corresponding features in feature dictionary
[0, 2]

des[0] -> corresponding descriptions in description dictionary
[2, 3]

pos[0] -> corresponding to start/stop position for each feat+description pair
[['1', '502'], ['425', '428']]

Feature Dictionary
{'topological domain': 19, 'domain': 20, 'active site': 18, 'modified residue': 10, 'glycosylation site': 5, 'site': 9, 'propeptide': 31, 'non-consecutive residues': 11, 'turn': 27, 'chain': 0, 'sequence conflict': 7, 'signal peptide': 3, 'sequence variant': 13, 'nucleotide phosphate-binding region': 25, 'non-terminal residue': 12, 'helix': 14, 'disulfide bond': 22, 'initiator methionine': 8, 'transit peptide': 30}

Description Dictionary:
{'Napin-1A large chain': 127, 'NAD': 138, 'Loss of catalytic activity': 200, "5'-deoxynucleotidase yfbR": 212, '3-hydroxyanthranilate 3,4-dioxygenase': 161, 'Loss of activity': 44, 'Proton donor': 156, 'In dbSNP:rs1131215.': 71, 'In dbSNP:rs9269744.': 121, 'Muscarinic toxin 2': 181, 'In dbSNP:rs6211.': 144, 'Ig-like C1-type': 49, 'In isoform Short.': 29, 'Spectrin--actin-binding': 187, 'Probable nitronate monooxygenase': 122, 'Uncharacterized protein 123L': 13, 'Uncharacterized protein 234R': 82}

chrisprobert · 2015-05-12T16:05:49Z

This page documents the various feature annotations: http://www.uniprot.org/help/sequence_annotation

Ideally I think we'd like to download the 'structure' section, but I'm not sure how: http://www.uniprot.org/help/structure_section

Actually, it turns out that the feature section does contain secondary structure information. The 'helix', 'turn' and 'strand' values we found here are the same as those in the secondary structure section on the uniprotKB entries.

aaroncp1an0 · 2015-05-13T06:29:27Z

I checked that out ~ thank you. Interesting unfortunately they don't
provide much haha.

I have a flexible outline for implementing the 'part of speech tagging' -
if you have time to meet briefly tomorrow let me know and I can
formalize/make depictions to put it into our document.
Alternatively I can do the same if you have outlined your ideas for
implementing a model.

I'm free most of the day.
-Aaron

On Tue, May 12, 2015 at 9:05 AM, Chris Probert [email protected]
wrote:

This page documents the various feature annotations:
http://www.uniprot.org/help/sequence_annotation

—
Reply to this email directly or view it on GitHub
#1 (comment)
.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

xml parsing #1

xml parsing #1

chrisprobert commented May 9, 2015

aaroncp1an0 commented May 9, 2015

chrisprobert commented May 9, 2015

aaroncp1an0 commented May 11, 2015

chrisprobert commented May 12, 2015

aaroncp1an0 commented May 13, 2015

xml parsing #1

xml parsing #1

Comments

chrisprobert commented May 9, 2015

aaroncp1an0 commented May 9, 2015

chrisprobert commented May 9, 2015

aaroncp1an0 commented May 11, 2015

chrisprobert commented May 12, 2015

aaroncp1an0 commented May 13, 2015