Skip to content

Latest commit

 

History

History

source_data

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Source data

This directory contains the raw source data for each dataset, acquired from supplemental information sections or directly from authors. It also contains the source AAindex database.

Dataset Reference First Author Year Acquired From Link
avgfp Local fitness landscape of the green fluorescent protein Sarkisyan 2016 Associated data on figshare, amino_acid_genotypes_to_brightness.tsv Link
bgl3 Dissecting enzyme function with microfluidic-based deep mutational scanning Romero 2015 Sequence read archive Link
gb1 A comprehensive biophysical description of pairwise epistasis throughout an entire protein domain Olson 2014 Supplemental information, Table S2 Link
pab1 Deep mutational scanning of an RRM domain of the Saccharomyces cerevisiae poly(A)-binding protein Melamed 2013 Supplemental material, Supp Table 2 and Supp Table 5 Link
ube4b Activity-enhancing mutations in an E3 ubiquitin ligase identified by high-throughput mutagenesis Starita 2013 Supporting Information, Dataset_S01, nscor_log2_ratio Link

AAIndex database: https://www.genome.jp/aaindex/

We processed this raw data into a uniform format that can be used to train models with our codebase. The processed data is contained in the data directory. The scripts we used to process the data and the AAIndex database are provided for reference.

For Bgl3, we used bowtie2 to align raw sequencing reads (linked above) and computed variant counts based on the resulting sequence alignment maps. The Bgl3 source data directory contains text files with all the variant reads in both the "unlabeled" set (initial library) and the "positive" set (post function selection).