Bio::Chromo - List of genes in a chromosome
version 0.004
$chromo = new Bio::Chromo 'NC_000002';
@aminos = $chromo->aminos(272244); # D, T
@codons = $chromo->codons(272244); # gac, acc
This class represents the list of CDS
features in a chromosome.
The list is held sorted (by inheriting from List::Sorted) for speed of
lookup. The main purpose of this class is currently to easily retrieve the
amino acid produced by a particular location in the chromosome. This is
achieved through the "aminos()" method. The other notable feature of this
classes is that it automatically caches the chromosome data it needs, which
drastically reduces the time, memory and bandwidth consumed.
This section contains a short (and imprecise) summary of the biological background required to understand the problem solved by this module.
The DNA consists of a fixed (per species) number of chromosomes. Each chromosome can be dealt with independently, and so we always consider a fixed chromosome. In humans, the chromosomes are enumerated 1-22, X, Y. The chromosome is a sequence of the letters A,T,C,G. Each such letter is called a base or a nucleotide. The letters A and T and the letters C and G are dual to each other. The process of turning each letter to its dual in a given sequence is termed complementing the sequence.
The purpose of the DNA is to code (and produce) proteins. A protein is
coded by a segment of the chromosome called a gene. The protein itself is
a sequence of amino acids, and the protein is coded by coding this
sequence. Each kind of amino acid is also symbolised by a fixed latin letter.
An amino acid is coded by a sequence of 3 bases in the gene. Such a sequence
is called a codon. Thus, the main goal of this module is to compute in
which codon a particular base lies. We further have the following
terminology:
-
Feature
Generic name for an interesting part of the chromosome. Types of features are identified by their primary tag. For us, it seems that the only one of interest is CDS.
-
CDS
Essentially, the same as a gene. However, every gene may have more than one interpretation, in the sense that the same (approximate) region in the chromosome might code several different proteins. These different variants correspond to different CDS features, though they are considered to be the same gene. It is impossible to determine, from the chromosome data alone, which interpretation is actually used: This depends on biological input external to the DNA (and might change, e.g., from one cell to another).
Therefore, we consider all of them.In this module, a CDS feature is represented by a helper class, Bio::Chromo::CDS. Each Bio::Chromo object is essentially a sorted array of Bio::Chromo::CDS objects.
-
Gene
Largest meaningful unit. Each gene codes a protein, though there might be several variants, as explained above. Each gene has a name, like 'FAM110C', and these names are different for different genes on the same chromosome (though not for different CDSs that correspond to the same gene).
The part of the gene that actually corresponds to the protein need not be contiguous: each contiguous part is called an exon. The part that actually codes the protein is obtained by concatenating the exons in one gene, and then possibly reversing (the order) and complementing the resulting sequence.
The list and boundaries of the exons might differ in different CDSs corresponding to the same gene. For this reason, we are not directly interested in the genes, and they have no counterpart in the code.
-
Exon
As explained above, this is a connected component of the part of a gene that actually codes a protein. Each of this is represented in the code by a Bio::Location::Simple object.
Summarising all the information above, we discover that to find the codons of the base at position N, is logically equivalent to:
- Find all CDS features containing position N
- For each such feature, concatenate the exons to obtain a string S
- Compute the location M of position N within the string S
- If the direction (strand) of the gene is reversed, reverse and complement S, updating M.
- Compute L=M/3, rounding down. The codon consists of the three bases within S starting at 3*L.
In reality, we perform a slightly different computation, and also compute directly the amino acid, rather than the codon (the translation from codons to amino acids is completely determined). This is done for reasons of efficiency, and should be equivalent.
my $chromo = new Bio::Chromo [B<$seq>]
Create a new Bio::Chromo object. The $seq argument can be either a
Bio::SeqI type object (or anything else that has a get_SeqFeatures
method), or an id, such as NC_000002
. The object is initialised using
either init_from_seq() or init_from_id(), as appropriate.
Compare two CDSs, for sorting using List::Sorted. We order the CDS first by the starting point, and within each variation by the end point.
@aminos = $self->aminos($i)
Return the list of possible amino acids coded by a possible codon to which
the given base $i belongs. Thus, each element in the result is one upper case
letter, or *
for the end of the sequence.
@codons = $self->codons($i)
Return the list of possible codons to which the given base $i belongs. Thus,
each element in the result consists of a string of length 3 on the alphabet
ACGT
.
@genes = $self->genes($i);
Returns the indices of all genes containing the given position. More
precisely, each element returned is an index into $self of a feature (CDS
)
containing absolute position $i on the sequence. Note that the direction
(strand) plays no role here. The features themselves, as Bio::Chromo::CDS
objects, can be accessed by usual array notation into $self, as in
for my $g ( @genes ) {
my $feat = $self->[$g];
# $feat is a Bio::Chromo::CDS object now
# ...
}
$self->init_from_seq($seq)
Initialise the object from the given Bio::SeqI object $seq
. In fact,
$seq
is only required to have a get_SeqFeatures method, which returns a
list of objects that look like Bio::SeqFeatureI. The resulting object will
hold a list of items corresponding to the CDS items in $seq
.
Returns true iff the initialisation was successful
$self->init_from_id($id)
Initialise the object from the sequence with id $id
, a string that looks
something like NC_000002. This is logically equivalent to fetching the
sequence given by $id
from GenBank, and then calling init_from_seq()
with it, except we store a cache of the actual structures, and so this is
potentially much faster and less memory and network consuming. Both the cache
file and the GenBank file will be stored on the local machine, in the
directory given by DataDir(). These files need to be erased if the
GenBank file should be re-downloaded from the site (for example, if there is
a new version on the web).
Returns true iff the initialisation was successful
Returns the directory where the genbank files and the cache files are stored.
Taken from the environment variable $GENBANK_DIR
, or $HOME/genbank
by
default.
Version of the data stored in the cache. Should be increased whenever the
structure of the Bio::Chromo::CDS package is changed, or the format of the
data stored in the yaml
cache files is otherwise modified.
$seq = $self->_fetch_gb($id[, $gb[, $dir]])
Fetch sequence with id $id from Genbank. If $gb is also given, this is the full path of the file to which we write the result, for reuse. $dir is the directory portion of $gb, which may be provided for efficiency.
Returns the resulting Bio::Seq object, or an empty list if fetching failed.
Bio::Chromo::CDS, List::Sorted
Moshe Kamensky [email protected]
This software is copyright (c) 2013 by Moshe Kamensky.
This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself.