Skip to content
Megabyte edited this page Aug 21, 2023 · 10 revisions

Welcome to the Phylogenetic-tree-study wiki!

Introduction

We are trying to figure out the similarities and differences between microbial sequences specifically the 16S ribosomal Ribonucleic acid (RNA) region. This is a very important region of the bacteria & archaea since it has a conserved and variable region. It is approximately 1500 nucleotides long but it can vary.

Findings

In the past, the person @Shuyib who was doing the study was mostly interested in comparing sequences using conventional methods: character based methods and distance based methods. These are commonly used in bioinformatics/ Genomic Data Science. However, @Shuyib noticed that machine learning based methods for exampele Hierarchical clustering can be substituted to help with the estimation of the phylogenetic tree. In addition, we noticed that we are almost close to meeting the tree of life just with the sequences used. But, for the few sequences we had.

At the moment, we have updated the sequences to meet the constraint. That is, only Drug resistant sequences preferably obtained from human beings. Found out the most common metrics e.g GC content, GA content and motifs. Surprisingly, the motifs annotated by MEME. Were of plant ancestors (Paralogous gene[needs confirmation]). Which is quite intriguing. On the other hand, using UniProt gave us more clues that the genes related to the sequences were not annotated yet. These sections are important for gene therapy where we can use CRISPR to edit these regions, and see the expression.

Other objectives, we'd like to achieve is figuring out motifs with Natural language based techniques for example DNA BERT, embeddings with visualization and similarity matching algorithms e.g with cosine similarity after NLP processing pipelines.

So far with cosine similarity metric is a similarity metric. We scored our sequences after padding them (adding zeros to shorter sequences so that they can match). Similar sequences will have a score closer to zero. Those that have are not will have values less than 0.5. That in terms of interpretation makes it quite easy to make sense of our results. We found out that they were really similar with some exception which we intend to investigate by visualizing the results in a Cartesian plane or heatmap.

FAQ

Q: What is a Phylogenetic tree (aka phylogeny)? A: According to Baum in Nature, this is a diagram that shows lines of evolutionary descent of different species, organisms or genes from a common ancestor. It is useful for organizing knowledge of biological diversity, for structuring classifications, or understanding evolutionary events.

Q: What is antimicrobial resistance? A: This is when a microorganism is no longer sensitive to action of antibiotics. Antibiotics are compounds produced by the natural metabollic processes of microbes for instance fungi that kill or inhibit other microbes. Key people who discovered the first named them are Alexander Fleming & Selman Waksman. Now, these microbes are not affected by these antibiotics and they can lose/gain function via misspelling their genetic material to also bypass the mechanism of action of antibiotics.Some microbes can also share genes and may get this resistance via horizontal gene transfer.

Q: What are nitrogenous bases? A: Adenine, Guanine, Cytosine and Thymine (Uracil, if in messanger ribonucleic acid form). These are the building blocks of DNA (Deoxyribonucleic acid) or the variant (Ribonucleic acid). Notice the sugar is how distinguish them deoxyribose and ribose.

Q: What is 16S rRNA (ribosomal ribonucleic acid)? A: It is component of the 30S subunit of the prokaryotic ribosome. It consists about 1500 nucleotides. It possesses a conserved and variable region which is important in Phylogenetic studies due to slow evolution of the area. Similar regions of interest in other organisms are Internal transcribed spacer for fungi, 18S for microbial eukaryotes and for you + me 28S and 18S fragments.

Q: What are GC and GA content? A: These are common bioinformatic metrics that can enable us to find out the origin of a DNA we don't know the origin of. Higher GC content could mean contamination of a sample with a microbe. For most organisms its about 50 percent with some exceptions in some areas. Read more.

Q: What are Kmers? A: A nucleotide sequence of a certain length. Read more here & here. They are important for the read assembly problem that is, moving from small reads then bringing them together using graphs.

Q: What are motifs? A: patterns/regions which are conserved over evolutionary time and are presumed to be important in function / biologically important region of a protein. N.B the sequences we are using are especially crucial for searching for these regions which we can study further.

Q: What is gene therapy? A: Modifying a cell for therapeutic effects. For example editing the plasmid of bacteria to produce insulin.

Q: What is CRISPR? A: It is Clustered Regularly Interspaced Short Palindromic Repeats(CRISPR). They are about 23 to 47 base pairs long. It can help us cut and paste regions of the genetic material.

Q: What is a paralogous gene? A: Paralogous genes related via gene duplication event [diversion and not share function]. Credit: N.J. Provart Bioinformatic methods I and II coursera.

Clone this wiki locally