Skip to content

vdftools/gencode

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

gencode

Gencode parser for vdftools. Current version v47 (GRCh38), produces:

data/transcripts.parquet

  • transcript_chrom (string)
  • transcript_start (integer)
  • transcript_end (integer)
  • transcript_id (string) # ENST
  • gene_id (string) # ENSG
  • gene_symbol (string)
  • is_coding (bool)
  • is_pos_strand (bool)
  • is_seleno (bool) # is selenoprotein
  • priority_level (integer)
  • exon_total (integer) # exon count
  • is_excludable (bool) # transcripts with problems (ie. readthrough, no-start, no-end)

Priority Levels:

  1. APPRIS alternative 2

  2. APPRIS alternative 1

  3. APPRIS principal 5

  4. APPRIS principal 4

  5. APPRIS principal 3

  6. APPRIS principal 2

  7. APPRIS principal 1

  8. MANE Clinical

  9. MANE Select

data/exons.parquet

  • exon_chrom (string)
  • exon_start (integer)
  • exon_end (integer)
  • transcript_id (string) # ENST
  • exon_number (integer)
  • bases_preceding_exon (integer) # total bases in preceding exons

data/cds.parquet

  • cds_chrom (string)
  • cds_start (integer)
  • cds_end (integer)
  • transcript_id (string) # ENST

data/branch_sites.parquet

  • branch_chrom (string)
  • branch_start (integer)
  • branch_end (integer)
  • transcript_id (string) # ENST

data/transcript_sequences.parquet

  • transcript_id (string) # ENST
  • orf_start_pos_in_transcript (integer) # 0-based position of A in start codon
  • orf_end_pos_in_transcript (integer) # closed position end of stop codon
  • transcript_sequence (string) # ACGT, coding transcripts

About

Gencode and GRCh38 interval annotation data

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages