CPEN 221 / Fall 2021 / Example
In this exercise you will implement a datatype on your own. You will have to define and implement all the methods for a datatype that represents a DNA sequence. You will also use the datatype Set
in this exercise.
This section explains some information that is related to this exercise. It is for your information only; you do not need to fully understand it to complete the tasks.
Deoxyribonucleic acid (DNA) is a complex biochemical macromolecule that carries genetic information for cellular life forms and some viruses. DNA is also the mechanism through which genetic information from parents is passed on during reproduction. DNA consists of long chains of chemical compounds called nucleotides. Four nucleotides are present in DNA: Adenine (A), Cytosine (C), Guanine (G), and Thymine (T). DNA has a double helix structure containing complementary chains of these four nucleotides connected by hydrogen bonds.
Certain regions of the DNA are called genes. Most genes encode instructions for building proteins (they're called "protein-coding" genes). These proteins are responsible for carrying out most of the life processes of the organism. Nucleotides in a gene are organized into codons. Codons are groups of three nucleotides and are written as the first letters of their nucleotides (e.g., TAC or GGA). Each codon uniquely encodes a single amino acid, a building block of proteins. The process of building proteins from DNA has two major phases called transcription and translation, in which a gene is replicated into an intermediate form called mRNA, which is then processed by a structure called a ribosome to build the chain of amino acids encoded by the codons of the gene.
The sequences of DNA that encode proteins occur between a start codon (which we will assume to be ATG) and a stop codon (which is any of TAA, TAG, or TGA). Not all regions of DNA are genes; large portions that do not lie between a valid start and stop codon are called intergenic DNA and have other (possibly unknown) functions. Computational biologists examine large DNA data files to find patterns and important information, such as which regions are genes. Sometimes they are interested in the percentages of mass accounted for by each of the four nucleotide types. Often high percentages of Cytosine (C) and Guanine (G) are indicators of important genetic data.
More about DNA: DNA - Wikipedia
You will implement a datatype called DNA
that will represent a DNA sequence. This datatype should support the following operations:
- creation/constructor: using a
String
of nucleotide sequences; - isProtein(): returns
true
if the DNA sequence is a protein andfalse
otherwise; - totalMass(): returns a
double
value that represents the mass of the DNA sequence (this value should be rounded to one digit past the decimal point). - nucleotideCount(char c): given a nucleotide (one of
A
,C
,G
orT
), returns the count of that nucleotide in the DNA sequence (and this method returns 0 for all invalid/junk nucleotides). - codonSet(): returns a
Set
that contains all the distinct codons in the DNA sequence; - mutateCodon(String originalCodon, String newCodon): Alters the DNA sequence by replacing all occurrences of
originalCodon
withnewCodon
, and eliminates all junk regions (see details below). - sequence(): returns the nucleotide sequence.
For the programming exercise, here are the details that you will need:
-
A nucleotide sequence consists of characters
A
,C
,T
andG
. Other characters can be part of the sequence but we will consider them as representing junk regions. You may assume that all nucleotides will be represented using the uppercase character. We will not usea
,c
,g
ort
as part of the test cases. This is not a major assumption/restriction. -
A codon is a sequence of three nucleotides. You can ignore junk regions when determining the codons. For example,
ACGT-!BAG
contains two codons:ACG
andTAG
. -
The sequence
ATCGAA
represents the codonsATC
andGAA
. Further this sequence has the following nucleotide count:A: 3
,C: 1
,G: 1
, andT: 1
. -
To determine the mass of a sequence, we will use the following list for the mass of a nucleotide (in gms/mol).
- Adenine (A): 135.128
- Cytosine (C): 111.103
- Guanine (G): 151.128
- Thymine (T): 125.107
- Junk: 100.000
-
Proteins: we will assume that a DNA sequence is a protein-coding gene if the following criteria are met.
- The sequence starts with the codon
ATG
; - The sequence ends with one of these codons:
TAA
,TAG
,TGA
; - Contains at least five codons including the start and stop codons;
- Cytosine and Guanine account for at least 30% of the total mass of the DNA sequence.
- For the purpose of determining the starting and ending codon, ignore all junk regions.
- The sequence starts with the codon
-
When you perform a mutation, you should also remove all the junk regions. You should verify that the old codon and the new codon are valid codons, otherwise the mutation should have no effect.
-
If you want to round a number to one decimal place, you may want to try something like this:
double y = 3.1412; double x = Math.round(y * 10.0) / 10.0;
-
Every DNA sequence should have complete codons. Your constructor should reject invalid sequences such as
ATCGAxbTTz
that have a few complete codons (ATC
,GAT
) but there is a codon that is incomplete. When a string passed to the constructor fails to meet this requirement, anIllegalArgumentException
must be thrown. You can do this rather easily. An example is here:// assuming there is a function/method to validate a String if (!valid(dnaSequence)) { throw new IllegalArgumentException("Invalid DNA sequence"); }
-
We will discuss exceptions but you can read ahead. Consult the official Java tutorial.
-
You will have to work with sets in this exercise. In particular, you will have to return a
Set<String>
for the operationcodonSet
. You can read more about this datatype and a concrete implementation,HashSet
, as part of the official Java documentation.
- You should implement the
DNA
datatype in the fileDNA.java
, which is in the directorysrc/main/java/dna
. - You should not change the directory structure.
- The
DNA
datatype is part of thedna
package. Remember to start theDNA.java
file with the statementpackage dna;
Also read:
Creating and Using Packages (The Java™ Tutorials > Learning the Java Language > Packages)