Skip to content

Contributing

Bogdan Kirilenko edited this page Oct 1, 2023 · 4 revisions

We're delighted that you're considering contributing to TOGA! This project is the culmination of significant effort and dedication, and your interest in aiding its further development is greatly appreciated.

The main contributor, Bogdan Kirilenko, no longer works in Michael Hiller's lab but continues maintaining the project as an open-source initiative. Bogdan would be happy to review and pull changes from your forks.

Discussion Before Major or Scientific Changes: We welcome contributions from Python developers with a solid background. However, any major or scientific changes that could affect the results—such as alterations to the logic of the orthology graph—should first be discussed with the scientific advisor, Michael. This ensures that the scientific integrity of the project remains intact.

Areas for Contribution

  1. Enhancement of Testing and Quality Control: Our goal is to refine our testing procedures and introduce robust quality control measures. For instance, the system should be able to flag potential issues, such as a significantly low number of orthologs for a large set of genes. For now, there's a simple procedure to check the number of crashed CESAR jobs, but having a module to verify all steps would be a significant improvement.

  2. Improvement of Logging: With TOGA operating numerous parallel processes, enhancing our logging system, especially for parallel steps, is crucial. If you have expertise in Python logging in a multi-process environment, your insights would be extremely beneficial.

  3. Code Refactoring and Bug Fixes: The current codebase could benefit from better organization and bug fixes. If you notice code duplications (like two almost identical functions), unused functions, or branches that are never executed, feel free to suggest refactorings. Contributions like this one that fixed the U12 argument correctness check, or this one that added an argument to control TOGA predicted gene prefix, are highly valued.

  4. Documentation: While Bogdan and Michael are primarily responsible for this, if you have a comprehensive understanding of the project and can contribute to the documentation, your input would be highly valuable.

  5. Parallelization Module Improvements: TOGA executes many jobs in parallel, and enhancing the stability or flexibility of our parallelization module is highly appreciated. For instance, making the Nextflow workflow more stable or introducing additional parallelization strategies could significantly improve the performance and reliability of TOGA.

  6. Containerization: Containerization could be immensely helpful yet challenging. The goal is to ensure that TOGA, when executed on a cluster, can run parallel jobs on systems like Slurm efficiently, despite the potential permissions hurdles. Considering the constraints, Singularity might be a more fitting containerization solution as it is designed to handle HPC environments and has the capability to operate seamlessly across various cluster configurations.

  7. Code Improvements: There's an ongoing effort to enhance the code structure, for instance, by collecting all project-related constants and moving them into a constants.py module. Bogdan has initiated this process, and if you could continue this effort, it would be fantastic. Streamlining the code in such a way aids in maintaining a clean and organized codebase, which is crucial for the project’s sustainability and ease of understanding for all contributors.

Code Organization

Below is a brief outline of the project's structure and the purpose of key files and directories:

  • toga.py: The main entry point of the application. Contains the Toga class responsible for executing the pipeline steps via the run function.
  • configure.sh: A script for installing dependencies, downloading CESAR2.0 from submodule, building C code into shared libraries, and running scripts to train the XGBoost model.
  • parallel_jobs_manager.py: Holds classes for executing jobs in parallel.
  • modules/: Directory housing modules necessary for running the pipeline.
  • supply/: Contains post-processing scripts and resources.
  • nextflow_config_files/: Holds templates for Nextflow config.
  • ucsc_browser_visualisation/: Code for the UCSC genome browser plugin.

If you have any questions regarding the purpose of a specific file or need further clarification on code organization, feel free to reach out and we'll update this page accordingly.

Suggestions for organizational improvements are always welcome!

Note on External Dependencies

TOGA has been designed to minimize dependency issues often encountered in bioinformatics projects. We've strived to keep the number of dependencies as low as possible to facilitate easy installation. Therefore, we prefer contributions that do not necessitate additional packages. If adding a new dependency is unavoidable, we favor more stable and widely used packages, such as numpy or pandas.