Gene regulatory sites, such as Transcription Factor Binding Sites (TFBS) and Promoters, are extremely important regions within both eukaryotic and prokaryotic genomes. Predicting whether or not a site acts as a regulatory element is an important, yet surprisingly difficult task. There has been a lot of focus in recent years towards building machine learning (ML) approaches for automatically detecting these genomic regions. In this hackathon, we hope to experiment with some of these tools.
Our goals during hackseq19 are to:
- a) Build an accurate classifier for a given gene regulation dataset.
- b) Build an interpretable classifier that outputs useful rules, describing each dataset.
We will experiment with many different classifiers, including decision trees, random forests, support vector machines, and neural networks. Accuracy is measured using F1 score, which we can visualize on our leaderboard (see below). Interpretability is measured by how clearly we can deduce rules from our dataset. An example rule:
IF Position[2] == "G" AND Position[3] == "C" THEN Class == "TFBS"
Our leaderboard page is available here. You are required to sign in using your Google account. Once signed in, you can choose your username and submit files to the leaderboard. The leaderboard is based on a hacked version of my Natural Language Processing course professor's website.
Datasets:
- 1 Human Chromosome #1 TFBS
- 1 Ecoli K12 TFBS
- 2 Ecoli K12 Promoter Region
- 1 Pokemon
These come from a variety of sources, including gene regulation databases and previous Kaggle competitions.
The following graph represents our progress improving classifier accuracy over the course of hackseq19. x-axis is measure in hours of time since the start of our hackathon, y-axis is measure in terms of F1 Score. We annotated times when we noticeably improved our position on the leaderboard. Dashed lines represent our "oracle", representing the highest recorded accuracy in the literature. As you can see, we beat the oracle score for Huamn SP1 TFBS!