This directory contains the raw source data for each dataset, acquired from supplemental information sections or directly from authors. It also contains the source AAindex database.
Dataset | Reference | First Author | Year | Acquired From | Link |
---|---|---|---|---|---|
avgfp | Local fitness landscape of the green fluorescent protein | Sarkisyan | 2016 | Associated data on figshare, amino_acid_genotypes_to_brightness.tsv | Link |
bgl3 | Dissecting enzyme function with microfluidic-based deep mutational scanning | Romero | 2015 | Sequence read archive | Link |
gb1 | A comprehensive biophysical description of pairwise epistasis throughout an entire protein domain | Olson | 2014 | Supplemental information, Table S2 | Link |
pab1 | Deep mutational scanning of an RRM domain of the Saccharomyces cerevisiae poly(A)-binding protein | Melamed | 2013 | Supplemental material, Supp Table 2 and Supp Table 5 | Link |
ube4b | Activity-enhancing mutations in an E3 ubiquitin ligase identified by high-throughput mutagenesis | Starita | 2013 | Supporting Information, Dataset_S01, nscor_log2_ratio | Link |
AAIndex database: https://www.genome.jp/aaindex/
We processed this raw data into a uniform format that can be used to train models with our codebase. The processed data is contained in the data directory. The scripts we used to process the data and the AAIndex database are provided for reference.
For Bgl3, we used bowtie2 to align raw sequencing reads (linked above) and computed variant counts based on the resulting sequence alignment maps. The Bgl3 source data directory contains text files with all the variant reads in both the "unlabeled" set (initial library) and the "positive" set (post function selection).