Skip to content

Latest commit

 

History

History
124 lines (71 loc) · 4.86 KB

README.md

File metadata and controls

124 lines (71 loc) · 4.86 KB

GPCR-mining

Functions to scrape data about G protein-coupled receptors (GPCRs) from the web.

The GPCRdb provides a comprehensive overview for sequence information about a GPCR, including definitions of transmembrane helices and generic residue numbering. Looking up a large number of residues or including the conversion for a specific receptor into an automated workflow can become tedious. Here we provide code to download and display this data.

Installation

You can install the latest release of the python package via pip

pip install gpcrmining

or an editable installation from this repository

git clone https://github.com/drorlab/GPCR-mining
cd GPCR-mining
pip install -e .

Run within Python code

To include the functions in your Python workflow, import the library via

import gpcrmining.gpcrdb as db

Sequence information

You can download all information into a list of residues

gpcr_name = 'acm2_human'
res_info = db.get_residue_info(gpcr_name)

... convert sequential numbers to the generic GPCRdb numbers

db_num = db.sequential_to_gpcrdb('acm2_human', [393, 194, 151, 154, 190, 68, 108]))
print(db_num)

... or the other way round

seq_num = db.gpcrdb_to_sequential('acm2_human', ['6.41x41', '5.46x461', '4.56x56', '5.42x43'])    
print(seq_num)

These conversion functions also work with the generic numbering schemes for signalling proteins (arrestins and G proteins).

Structures

You can obtain information about any GPCR structure from the PDB such as the entry names of receptor and signaling proteins, chain names, or details on the experimental method. Run the following to obtain a dictionary with the corresponding information:

info = db.get_structure_info('6u1m')
print(info)

To download the PDB file to a new directory, run

download_pdb_structure('6u1m', directory='structures')

Run from the command line

Obtain the entire sequence

To obtain such a sequence and to save it in a more easily usable CSV file, run

python -m gpcrmining.gpcrdb -n GPCR_NAME -d DIR

with "GPCR_NAME" being the name of the GPCR as used in the corresponding GPCRmd URL. "DIR" is the directory where the data should be saved (default: data-gpcrmd), which is created if it does not exist. For example,

python -m gpcrmining.gpcrdb -n adrb1_human -d my-data-from-gpcrmd

writes the file gpcrdb-residues_adrb1_human.csv into the directory my-data-from-gpcrmd.

Select and print residues

To select residues by their sequential number, use the option -rn. To select multiple residues, their IDs have to be separated by a whitespace and everything enclosed in quotation marks.

python -m gpcrmining.gpcrdb -n adrb1_human -rn "230 231 232 233 313 339" 

To select residues by a generic residue numbering scheme, use the option -id. GPCRdb uses two similar numbering systems (one sequence-based, following Ballesteros-Weistein, Wooten,... and one corrected for helix bulges). By default, the code will return the combined format. For input, both formats can be used (BW etc. with a dot as separator and the GPCRdb format with x) as well as the combined one. Numbering schemes can be mixed, e.g.,

python -m gpcrmining.gpcrdb -n adrb1_human -id "5.45 5x461 6.24 6.27 6.50x50"

To select defined parts of the receptor, use the option -p.

python -m gpcrmining.gpcrdb -n adrb1_human -p "N-term TM7 ICL2"

If several selection flags are provided, only residues that fulfill all conditions will be printed. For example,

python -m gpcrmining.gpcrdb -n adrb1_human -id "5.45 5x461 6.24 6.27 6.50x50" -rn "230 231 232 233 313 339"    

prints the following:

Residue mapping for adrb1_human, using directory ./data-gpcrdb.
   TM5  231 V 5.45x46
   TM5  232 S 5.46x461
   TM6  313 R 6.24x24
   TM6  339 P 6.50x50

To obtain analogous residues across receptors, use a multiple-entry string, just as for the residues:

python -m gpcrmining.gpcrdb -n "adrb1_human adrb2_human" -id "5.45 5x461 6.24 6.27 6.50x50"

Output formats

Available output formats are 'plain' and 'drormd', with 'plain' (as above) being the default.

If you would like to have another format added, you have two options:

  • open an issue with a description of what you have in mind or
  • fork the repo, implement your favorite format as an additional option, and open a pull request.

The specific DrorMD format has an option to define one or multiple segment IDs. For example,

python -m gpcrmining.gpcrdb -n adrb1_human -id "6.24 6.27 6.50" -f drormd -s 'P0 P1'

prints the numbers in a format that can be directly copied into a DrorMD conditions file:

Residue mapping for adrb1_human, using directory ./data-gpcrdb.
'R6.24x24': 'segid P0 P1 and resid 313'
'A6.27x27': 'segid P0 P1 and resid 316'
'P6.50x50': 'segid P0 P1 and resid 339'