This project demonstrates the application of Machine Learning, specifically deep neural networks, in identifying transcription-factor binding sites in DNA sequences. Utilizing a synthetic dataset of DNA sequences, the goal is to predict the presence of a specific transcription factor indicated by labels (1 or 0) using a developed machine learning model.
The dataset consists of DNA sequences that are either regulated by a specific transcription factor (label 1) or not (label 0).
Data Loading and Preprocessing: The data is loaded and preprocessed using one-hot encoding to transform the DNA sequences into a matrix format, suitable for machine learning algorithms.
The project involves building and training a deep neural network, with convolutional layers to detect motifs indicative of binding sites.
The model is evaluated on a test dataset, with performance metrics such as accuracy and a confusion matrix being computed.
To interpret the model, saliency maps are computed to identify which parts of the DNA sequences most influence the model's predictions.
The repository contains a Jupyter Notebook with detailed steps and code for data preprocessing, model building, and evaluation. Users can experiment with different model architectures and parameters.
Python Libraries: TensorFlow, Keras, Pandas, Scikit-Learn, Matplotlib, Seaborn
Clone the repository. Install required libraries. Run the Jupyter Notebook, following the instructions and completing the incomplete cells. Experiment with modifying the model and preprocessing steps to explore different outcomes.