Skip to content

Latest commit

 

History

History
125 lines (99 loc) · 26.9 KB

03.personal-research.md

File metadata and controls

125 lines (99 loc) · 26.9 KB

Level 1: Personal Research

The computational biology journey begins with you and the set of skills, tools, and practices that you have in place to conduct your research. Taking the time to optimally establish these building blocks will have high payoffs later when you find yourself going back to previous analyses. Consider that your most important collaborator is your future self, be it tomorrow or several years from now. We devised a framework involving four main sequential steps to kickstart any computational biology project (Table @tbl:personal-tools).

Table: Steps involved in starting a computational biology project. {#tbl:personal-tools}

Step Use case Common tools
Step 1: Choose your programming languages Interacting with a Unix/Linux HPC Shell/Bash [@https://www.gnu.org/software/bash/]
Data analysis Python [@https://python.org], R [@https:r-project.org]
Scripts and programs Interpreted: Python [@https://python.org], R [@https://r-project.org], Perl [@https://perl.org], MATLAB<\b> [@https://www.mathworks.com/], Julia<\b> [@https://julialang.org/]
Compiled: C/C++ [@https://www.cplusplus.com/], Rust [@https://www.rust-lang.org/]
Workflows Linux-based: shell script, GNU Make [@https://www.gnu.org/software/make/]
Workflow management systems: Snakemake (Python) [@https://snakemake.github.io/], Nextflow (Groovy) [@https://www.nextflow.io/],
Worflow specifications: CWL [@https://www.commonwl.org/], WDL [@https://openwdl.org/]
Step 2: Define your project structure Project structure Templates: Cookiecutter Data Science [@https://drivendata.github.io/cookiecutter-data-science], rr-init [@https://github.com/Reproducible-Science-Curriculum/rr-init]
Workflows: Snakemake worflow template [@https://github.com/snakemake-workflows/snakemake-workflow-template]
Virtual environment managers Language-specific: virtualenv (Python) [@https://virtualenv.pypa.io/], renv (R) [@https://rstudio.github.io/renv/index.html]
Language agnostic: Conda [@https://docs.conda.io/]
Package managers Language-specific: pip (Python) [@https://pip.pypa.io/], Bioconductor (R) [@https://www.bioconductor.org/], R Studio package manager (R) [@https://www.rstudio.com/products/package-manager/]
Language-agnostic: Conda [@https://docs.conda.io/]
Step 3: Choosing your working set-up Text editors Desktop applications: Atom [@https://atom.io/], Sublime [@https://www.sublimetext.com/], Visual Studio Code [@https://code.visualstudio.com/], Notepad++ [@https://notepad-plus-plus.org/]
Command line: Vim[@https://www.vim.org/], GNU Emacs[@https://www.gnu.org/software/emacs]
IDEs For Python: JupyterLab [@https://jupyter.org/], JetBrains/PyCharm [@https://www.jetbrains.com/pycharm/], Spyder [@https://www.spyder-ide.org/]
For R: R Studio [@https://www.rstudio.com/]
Notebooks Jupyter (Python, R) [@https://jupyter.org/], R Markdown (R) [@https://rmarkdown.rstudio.com]
Step 4: Follow good coding practices Coding style Styling guides: PEP-8 (Python) [@https://www.python.org/dev/peps/pep-0008/], Google (Python, R) [@https://github.com/google/styleguide]
Automatic code formatting: Black (Python) [@https://black.readthedocs.io/en/stable/], Snakefmt (Snakemake) [@https://github.com/snakemake/snakefmt]
Literate programming Markdown [@https://www.markdownguide.org/]
R Markdown [@https://rmarkdown.rstudio.com]
Version control Version control system: Git [@https://git-scm.com]
Code repositories: GitHub [@https://github.com], GitLab [@https://gitlab.com], Bitbucket [@https://bitbucket.org]
Git GUIs: GitHub Desktop [@https://desktop.github.com/], GitKraken [@https://www.gitkraken.com/]

Step 1: Choose your programming languages

Different programming languages serve distinctive purposes and have unique idiosyncrasies. As such, choosing a programming language for a specific project depends on your research goals, personal preferences, and skillsets. Additionally, communities usually favor the usage and training of some programming languages over others; utilizing such languages may facilitate integrating your work within the existing ecosystem.

Interacting with high-performance computing (HPC) clusters has become a hallmark for the data-intensive discipline of computational biology. HPC infrastructures commonly use Unix/Linux distributions as their operating system. To interact with these platforms, a command-line interpreter known as the shell must be used. There are multiple versions of shells, with Bash [@https://www.gnu.org/software/bash/] being one of the most widely adopted. In addition to providing an interface, the shell is also a scripting language that allows manipulating files and executing programs through shell scripts. Unix/Linux operating systems have other interesting perks, such as powerful, fast commands for searching and manipulating files (e.g., sed, grep, or join) as well as the language AWK, which can perform quick text processing and arithmetic operations.

One of the most common tasks of any computational biologist is data analysis, which usually involves data cleaning, exploration, manipulation, and visualization. Currently, Python [@https://python.org] is the most widely used programming language for data analysis [@https://insights.stackoverflow.com/survey/2021;@https://www.kaggle.com/kaggle/kaggle-survey-2021]. Python is also a popular language among computational biologists, a trend that will likely continue as machine learning and deep learning are more widely adopted in biological research. Python usage has been facilitated by the availability of packages for biological data analysis accessible through package managers such as pip [@https://pip.pypa.io/] or Conda [@https://docs.conda.io/]. Likewise, R [@https://r-project.org] is another prominent language in the field. Arguably, one of the main strengths of R is its wide array of tools for statistical analysis. Of particular interest is the Bioconductor repository [@https://www.bioconductor.org/], where many gold-standard tools for biological data analysis have been published and can be installed using BiocManager [@https://github.com/Bioconductor/BiocManager]. R usage in data science has deeply benefited from the Tidyverse packages [@doi:10.21105/joss.01686] and surrounding community, increasing the readability of the R syntax for both data manipulation via dplyr and visualization via ggplot2.

Computational biologists often must code their own sets of instructions for processing data using scripts or tools. In computational biology, a script often refers to a lightweight single-file program written in an interpreted programming language and developed to perform a specific task. Scripts are quick to edit and can be run interactively but at the expense of computational performance. To automate instructions in HPC clusters, shell scripts are commonly used. For other purposes, the most widely used scripting languages are Python [@https://python.org] and R [@https://r-project.org], but Perl [@https://perl.org], MATLAB [@https://www.mathworks.com/], and Julia [@https://julialang.org/] are preferred by some researchers for bioinformatics, systems biology, and statistics, respectively. A computational biology tool, on the other hand, is a more complex program designed to tackle computationally intensive problems like developing new algorithms. Several tools devised for data-intensive biology have been written in compiled languages such as C/C++ [@https://www.cplusplus.com/]. In recent years, however, scientists have been turning to Rust [@https://www.rust-lang.org/] due to its speed, memory safety, and active community [@doi:10.1038/d41586-020-03382-2]. When computational performance is less of a concern, Python and R are suitable alternatives for computational biology tool development.

Biological data processing is rarely a one-step process. To go from raw data to useful insights, several steps need to be taken in a specific order, accompanied by a plethora of decisions regarding parameters. Computational biologists have addressed this need by embracing workflow management systems to automate data analysis pipelines. A pipeline can be a shell script where commands are written sequentially, using shell variables and scripting syntax when needed. Although effective, this approach provides little control over the workflow and lacks features to run isolated parts of the pipeline or track changes of input and output files. To overcome these limitations, a shell script can be upgraded using the GNU Make [@https://www.gnu.org/software/make/] program, which was originally designed to automate compilation and installation of software, but is flexible enough to build workflows. More sophisticated bioinformatics workflow managers have also been developed such as Snakemake [@https://snakemake.github.io/] based on Python and Nextflow [@https://www.nextflow.io/] based on Groovy (a programming language for the Java virtual machine). These tools offer support for environment managers and software containers (discussed in Level 3), as well as allow for easy scaling of pipelines to both traditional HPC and modern cloud environments. Alternatively, there are available declarative standards to define workflows in a portable and human-readable manner such as the Common Workflow Language (CWL) [@https://www.commonwl.org/] and Workflow Description Language (WDL, pronounced "widdle") [@https://openwdl.org/], used by the cloud computing platform AnVIL [@https://anvilproject.org/;@doi:10.1016/j.xgen.2021.100085]. Although these are not executable, they can be run in CWL- or DWL-enabled engines such as Cromwell [@https://cromwell.readthedocs.io/en/stable/].

Step 2: Define your project structure

The next step after choosing your programming languages but before starting coding is to develop an organized project structure. The project design should be intentional and tailored to the present and future needs of your project—remember to be kind to your future self! A computational biology project requires, at the very least, a folder structure that supports code, data, and documentation. Although tempting, cramming various file types into one unique folder is unsustainable. Instead, separate files into different folders and subfolders, if needed. To simplify this process, base your project structure on research templates available off-the-rack. For data science projects, the Python package Cookiecutter Data Science [@https://drivendata.github.io/cookiecutter-data-science] decreases the effort to minimal. Running the package prompts a questionnaire in the terminal where you can input the project name, authors, and other basic information. Then, the program generates a folder structure to store data—raw and processed—separate from notebooks and source code, as well as pre-made files for documentation such as a readme, a docs folder, and a license. Similarly, the Reproducible Research Project Initialization (rr-init) offers a template folder structure that can be cloned from a GitHub repository and modified by the user [@https://github.com/Reproducible-Science-Curriculum/rr-init]. Although rr-init is slightly simpler, both follow an akin philosophy aimed at research correctness and reproducibility [@doi:10.1371/journal.pcbi.1000424]. For workflow automation projects, we advise following the Snakemake workflow template [@https://snakemake.readthedocs.io/en/stable/snakefiles/deployment.html;@https://github.com/snakemake-workflows/snakemake-workflow-template], storing each workflow in a dedicated folder divided into subfolders for workflow-related files, results, and configuration. In all cases, the folder must be initialized as a git repo for version control (see Step 4).

The software and dependencies needed to execute a tool or workflow are also part of the project structure itself. The intricacies of software installation and dependency management should not be underestimated. Fortunately, package and virtual environment managers significantly reduce this burden. A package manager is a system that automates the installation, upgrading, configuration, and removal of community-developed programs. A virtual environment manager is a tool that generates isolated environments where programs and dependencies are installed independently from other environments or the default operating system. Once a virtual environment is activated, a package manager can be used to install third-party programs. We believe that every computational biology project must start with its own virtual environment to boost reproducibility: environments save the project's dependencies and can restore them at will so the code can be run on any other computer. There are multiple options for both package and virtual environment management—some language-specific and some language-agnostic. If you are working with Python, you can initialize a Python environment using virtualenv [@https://virtualenv.pypa.io/] (where different Python versions can be installed). Inside the environment, you can use the Python package manager pip [@https://pip.pypa.io/] to import Python code from the Python Package Index (PyPI) repository, GitHub, or locally. For the R language, R-specific environments can be created using renv [@https://rstudio.github.io/renv/index.html], where packages can be installed via the install.packages function from the Comprehensive R Archive Network (CRAN) and CRAN-like repositories. R also has BiocManager to install packages from the Bioconductor repository, which contains relevant software for high-throughput genomic sequencing analysis. Additionally, RStudio Package Manager [@https://www.rstudio.com/products/package-manager/] works with third-party code available in CRAN, Bioconductor, GitHub, or locally. Conda [@https://docs.conda.io/]—a language-agnostic alternative—supports program installation from the Anaconda repository, which contains the channel Bioconda [@https://bioconda.github.io/] specifically tailored to bioinformatics software. Python dependencies can also be installed via pip inside a Conda environment. Conda is particularly helpful when working with third-party code in various languages—a common predicament in computational biology. The Conda package and environment manager is included in both the Anaconda and Miniconda distributions. The latter is a minimal version of Anaconda, containing only Conda, Python, and a few useful packages.

Step 3: Choose your working set-up

Before coding, a more practical question needs to be answered first: Where to code? The simplest tools available for this purpose are text editors. Since writing code is ultimately writing text, any tool where characters can be typed fulfills this purpose. However, coding can be streamlined by additional features—including syntax highlight, indentation, and auto-completion—available in code editors such as Atom [@https://atom.io/], Sublime [@https://www.sublimetext.com/], Visual Studio Code [@https://code.visualstudio.com/], and Notepad++ [@https://notepad-plus-plus.org/] (Windows only). Command-line text editors such as Vim [@https://www.vim.org/] and Emacs [@https://www.gnu.org/software/emacs] are also suitable options for coding. These tools share the advantage of being language agnostic, which is handy for the polyglot computational biologist.

In addition to text editors, integrated development environments (IDEs) are also popular options for coding. In their essence, IDEs are supercharged text editors comprising a code editor (with syntax highlight, indentation, and suggestions), a debugger, a folder structure, and a way to execute your code (a compiler or interpreter). Some IDEs are not language-agnostic, often only allowing code in one language. The array of features also comes at a cost—IDEs typically use more memory. For Python, Jupyter Lab [@https://jupyter.org/], Spyder [@https://www.spyder-ide.org/], and PyCharm [@https://www.jetbrains.com/pycharm/] are popular options, while for R, RStudio [@https://www.rstudio.com/] is the gold standard. Notably, the differences between an IDE and a code editor are somewhat blurry, particularly when employing plugins with a code editor.

In recent years, notebooks have acquired relevance in computational biology research. A notebook is an interactive application that combines live code (read-print-eval loop or REPL), narrative, equations, and visualizations, internally stored using a format called JavaScript Object Notation (JSON). Common notebooks use interpreted languages such as Python or R, and narrative usually uses Markdown—a lightweight markup language. Data analysis greatly benefits from using notebooks instead of plain text editors or even IDEs. The combination of visuals and texts allows researchers to tell compelling stories about their data, and the interactivity of its code enables quick testing of different strategies. Jupyter [@https://jupyter.org/] is a popular web-based interactive notebook developed originally for Python coding but also accepts R and other programming languages upon installation of their kernels—the computing engine that executes the notebook's live code under the hood. Jupyter notebook can also be executed in the cloud using platforms such as Google CoLaboratory (CoLab) [@https://colab.research.google.com] and Amazon WebServices, taking advantage of the current trend of cloud computing. In addition, RStudio allows the generation of R-based notebooks known as R Markdown [@https://rmarkdown.rstudio.com], which is especially well suited for generating data analysis reports.

Step 4: Follow good coding practices

With the foundation in place, the next step is to start writing code. Coding, however, requires good practices to ensure correctness, sustainability, and reproducibility for you, your future self, your collaborators, and the whole community. First and foremost, you need to make sure your code works correctly. In computational biology, correctness implies biological and statistical soundness. Although both are topics beyond the scope of this manuscript, a useful approach to evaluate biological correctness is to design positive and negative controls in your program, analysis, or workflow. In scientific experimentation, a positive control is a control group that is expected to produce results; a negative control is expected to produce no results. The same approach can be applied to computation, using input data whose output is previously known. Biological soundness can also be tested by quickly assessing expected orders of magnitude in both intermediate and final files. These checks can be packaged in unit testing (discussed in Level 2).

In addition to correctly functioning code, code appearance, also known as coding style, is important. Code style includes a series of small, ubiquitous decisions regarding where and how to add comments; indentation and white-space usage; variable, function, and class naming; and overall code organization. Although, as in writing, personality and preference differences dictate how you code, coding style rules facilitate collaboration with your future self and others. Indeed, as we sometimes have trouble reading our own handwriting, we can also struggle reading our own code if we disregard guidelines. At the very least, aim to follow internal consistency in writing code. Even better, consider following any of the multiple published coding-style guides such as those from software development teams. Google, for example, has guidelines for Python, R, Shell, C++, and HTML/CSS [@https://github.com/google/styleguide]. Guidelines for Python are available as part of the Python Enhancement Proposal (PEP), known as PEP 8 [@https://www.python.org/dev/peps/pep-0008/]. To facilitate compliance, tools called linters can be incorporated into most code editors and IDEs to flag stylistic errors in your code based on a given style guide. Furthermore, many editors and tools perform automatic code formatting (e.g., Black [@https://black.readthedocs.io/en/stable/] that formats Python code to be PEP 8 compliant), which can greatly facilitate stylistic coherence in a collaborative project. In the case of Snakemake files, stylistic errors can be flagged using the Snakemake linter, which can be invoked with the command snakemake --lint [@https://snakemake.readthedocs.io/en/stable/snakefiles/best_practices.html], or automatically corrected with the tool Snakefmt [@https://github.com/snakemake/snakefmt], based on Black.

On the matter of code styling, two topics merit additional attention: variable naming and comments. Variable names should be descriptive enough to convey information about the variable, function, or class content and use. The goal is to produce self-documented code that reads close to plain English. To do so, multi-word variable names should be used if necessary. In such cases, the most common conventions include Camel Case, where the second and subsequent words are capitalized (camelCase); Pascal Case, where all words are capitalized (PascalCase); and Snake Case, where words are separated by underscores (snake_case). Notably, these conventions can be used in the same coding style to differentiate variables, functions, and classes. For example, PEP-8 recommends Snake Case for functions and variables and Pascal Case for class names. As most modern code editors and IDEs provide autocompletion of variable, function, and class names, it is no longer a valid excuse to use cryptic one-character variable names (e.g., x, y, z) to save a few keystrokes.

In addition to mastering variable naming, code comments—explanatory human-readable statements not evaluated by the program—are necessary to enhance the code's readability. No matter how beautiful and well-organized your code is, high-level code decisions will not be obvious unless stated. As a corollary, code explanations that can be deduced from the syntax itself should be omitted. Comments can span a single line or several lines, and can be found in three strategic parts: at the top of the program file (header comment), which describes what the code accomplishes and sometimes the code's author/date; above every function (function header), which contains the purpose and behavior of the function; and in line, next to difficult code with behavior that is not obvious or warrants a remark.

Code-styling rules also apply to data science notebooks. However, when writing notebooks, you must also engage in literate programming—a programming paradigm where the code is accompanied by a human-readable explanation of its logic and purpose. In other words, notebooks must tell a story about the analysis, connecting the dots between the code, the results, and the figures. Human-readable language is often written in Markdown [@https://www.markdownguide.org/] when working in Jupyter, or R Markdown [@https://rmarkdown.rstudio.com] when working in R. Little has been written about good practices for literate programming, but our suggested good practices are to include the purpose and interpretation of results for each section of code.

When working with a sizable codebase, we advise modular programming—the practice of subdividing a computer program into independent and interchangeable sub-programs, each one tackling a specific functionality. Modularity enhances code readability and reusability, as well as expedites testing and maintenance. In practice, modularity can be implemented at different levels, from using functions within a single-file program to separating functionalities into different files in a more complex tool. In Python, subdivisions are defined as follows: modules are a collection of functions and global variables, packages are a collection of modules, libraries are a collection of packages, and frameworks are a collection of libraries. Modules are files with .py extension, while packages are folders that contain several .py files, including one called init.py which can be empty or not and allows the Python interpreter to recognize a package.

Finally, there is version control, one of the most important personal practices. Version control entails tracking and managing changes in your code. A popular version-control system is Git [@https://git-scm.com], which requires a folder to be initiated as a Git repository, after which changes to any of the files inside would be tracked. File modifications must be staged (using git add) and then committed (using git commit). The commit will serve as a screenshot of your project at that time and stage, which you can review or recover later (using git checkout). Additionally, version control allows you to sa\fely try new functions in branches (using git branch and git checkout)—independent carbon copies of the main original branch (known as main) that you can optionally merge back to the original copy. Currently, there are multiple hosting services that provide online storage of Git repositories, such as GitHub [@https://github.com], GitLab [@https://gitlab.com], or Bitbucket [@https://bitbucket.org], that users can navigate using the web browser or via a graphic user interface (GUI) such as GitHub Desktop [@https://desktop.github.com/] or GitKraken [@https://www.gitkraken.com/]. These platforms have the additional benefit of backing up your code in the cloud, keeping your work safe and shareable, which is especially relevant for collaboration.