Rule Based Learning for Transcriptional Regulation

This is the GitHub repo for the hackseq19 project: Rule Based Learning for Transcriptional Regulation!

Rationale

Gene regulatory sites, such as Transcription Factor Binding Sites (TFBS) and Promoters, are extremely important regions within both eukaryotic and prokaryotic genomes. Predicting whether or not a site acts as a regulatory element is an important, yet surprisingly difficult task. There has been a lot of focus in recent years towards building machine learning (ML) approaches for automatically detecting these genomic regions. In this hackathon, we hope to experiment with some of these tools.

Goals

Our goals during hackseq19 are to:

a) Build an accurate classifier for a given gene regulation dataset.
b) Build an interpretable classifier that outputs useful rules, describing each dataset.

We will experiment with many different classifiers, including decision trees, random forests, support vector machines, and neural networks. Accuracy is measured using F1 score, which we can visualize on our leaderboard (see below). Interpretability is measured by how clearly we can deduce rules from our dataset. An example rule:

IF Position[2] == "G" AND Position[3] == "C" THEN Class == "TFBS"

Data

Our leaderboard page is available here. You are required to sign in using your Google account. Once signed in, you can choose your username and submit files to the leaderboard. The leaderboard is based on a hacked version of my Natural Language Processing course professor's website.

Datasets:

1 Human Chromosome #1 TFBS
1 Ecoli K12 TFBS
2 Ecoli K12 Promoter Region
1 Pokemon

These come from a variety of sources, including gene regulation databases and previous Kaggle competitions.

Results

The following graph represents our progress improving classifier accuracy over the course of hackseq19. x-axis is measure in hours of time since the start of our hackathon, y-axis is measure in terms of F1 Score. We annotated times when we noticeably improved our position on the leaderboard. Dashed lines represent our "oracle", representing the highest recorded accuracy in the literature. As you can see, we beat the oracle score for Huamn SP1 TFBS!

Team Members

Team Lead:
Alex Sweeten

Participants:
Aris Grout

Chahat Upreti

Jade Chen

Kate Gibson

Oriol Fornes

Priyanka Mishra

Shawn Hsueh

Zakhar Krekhno

Name		Name	Last commit message	Last commit date
Latest commit History 26 Commits
codes		codes
datasets		datasets
participants		participants
plots		plots
tutorials		tutorials
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Rule Based Learning for Transcriptional Regulation

Rationale

Goals

Data

Results

Team Members

About

Releases

Packages

Contributors 4

Languages

alexsweeten/rule-based-learning

Folders and files

Latest commit

History

Repository files navigation

Rule Based Learning for Transcriptional Regulation

Rationale

Goals

Data

Results

Team Members

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 4

Languages

Packages