Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: support for samples w/ and w/o primers #344

Merged
merged 6 commits into from
Dec 3, 2024

Conversation

FelixMoelder
Copy link
Contributor

@FelixMoelder FelixMoelder commented Dec 3, 2024

In some experiments, there are samples with and without primers. However, this case was not previously supported because the workflow always expects a panel containing primers to be provided in the samples.tsv file. To address this, the workflow now dynamically checks if a panel is provided for primer trimming. Nevertheless, primer trimming must still be activated in the configuration file.

We should consider whether the primer trimming section in the configuration file is still necessary, or if trimming can be inferred from the presence of a panel in the samplesheet. However, we should retain the primer section for setting a custom library length and specifying the paths to the primer files.

Summary by CodeRabbit

  • New Features

    • Introduced a function to check for primer presence in sample data.
    • Enhanced logic for handling sample data and file outputs.
    • Added new requirements for sample and unit sheets in configuration.
    • Enabled primer trimming and mutational burden estimation in configuration files.
    • Activated both delly and freebayes for variant calling in multiple configurations.
  • Bug Fixes

    • Improved error handling for sample and unit combinations.
  • Chores

    • Removed environment specification from the merge_untrimmed_fastqs rule.
    • Added execution order dependency between apply_bqsr and bam_index rules.
    • Updated comments for clarity throughout configuration files.

Copy link

coderabbitai bot commented Dec 3, 2024

Walkthrough

The changes in this pull request involve significant modifications to the Snakemake workflow, particularly in the workflow/rules/common.smk and workflow/rules/mapping.smk files. Key updates include the introduction of new functions for handling sample data, restructuring existing functions for improved clarity, and refining error handling. Additionally, the conda directive was removed from the merge_untrimmed_fastqs rule, and a rule order was established between apply_bqsr and bam_index, ensuring proper execution sequencing.

Changes

File Path Change Summary
workflow/rules/common.smk - Added sample_has_primers function.
- Updated get_recalibrate_quality_input, get_trimming_input, get_aligner, and get_consensus_input function signatures.
- Enhanced error handling in get_raw_reads function.
- Streamlined output file path generation in get_final_output.
workflow/rules/mapping.smk - Removed conda directive from merge_untrimmed_fastqs rule.
- Added ruleorder: apply_bqsr > bam_index to enforce execution order.
config/README.md - Added new requirements for panel, mutational_burden_events, umi_read, and umi_read_structure columns in sample and unit sheets.
.test/config-chm-eval/config.yaml - Removed activate key in primers section.
- Activated delly and freebayes in calling section.
.test/config-giab/config.yaml - Removed activate key in primers section; paths for primer files set to empty.
- No other changes.
.test/config-no-candidate-filtering/config.yaml - Removed activate key in primers section.
- Added with_aux key in filter subsection.
- Updated final_calls in vep subsection.
.test/config-simple/config.yaml - Set primers.trimming.activate to false.
- Activated mutational_burden and report sections.
.test/config-sra/config.yaml - Removed activate key in primers section.
- Activated mutational_burden.
.test/config-target-regions/config.yaml - Changed primers.trimming.activate to true.
- Activated annotations.vcfs.
.test/config-target-regions/config_multiple_beds.yaml - Removed activate key in primers section.
- Activated mutational_burden.
.test/config_primers/config.yaml - Removed activate key in primers.trimming.
config/config.yaml - Set primers.trimming.activate and mutational_burden.activate to true.
- Updated paths for primers_fa1 and primers_fa2 to empty strings.

Possibly related PRs

  • fix: fix excluding events from the report #332: The changes in this PR involve substantial modifications to the workflow/rules/common.smk file, particularly in the get_final_output function, which is related to the restructuring of functions and enhancements to data handling logic in the main PR. Both PRs focus on improving output generation and managing sample data effectively.

Suggested reviewers

  • johanneskoester
  • dlaehnemann

Poem

In the workflow's dance, we refine and align,
Functions and rules, all crafted in time.
With samples and paths, we tidy the way,
A hop and a skip, new logic at play!
Errors now caught, like carrots in sight,
Our Snakemake's ready, all set for the flight! 🐇✨


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (1)
workflow/rules/common.smk (1)

263-267: Ensure get_sample_datatype returns expected values

The get_aligner function defaults to 'bwa' when the sample datatype is not 'rna'. If get_sample_datatype can return unexpected or invalid datatypes, consider handling such cases explicitly or raising an error to prevent unintended behavior.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 185b2fe and 710b001.

📒 Files selected for processing (2)
  • workflow/rules/common.smk (4 hunks)
  • workflow/rules/mapping.smk (0 hunks)
💤 Files with no reviewable changes (1)
  • workflow/rules/mapping.smk
🔇 Additional comments (4)
workflow/rules/common.smk (4)

Line range hint 422-426: Function get_markduplicates_input logic is appropriate

The function correctly determines the input path for duplicate marking based on whether the sample has UMIs and the aligner used.


431-441: Logic in get_recalibrate_quality_input function is acceptable

The function appropriately selects the input file for quality recalibration based on the sample datatype and whether certain features are activated.


443-448: Function get_consensus_input correctly handles primer trimming

The function properly checks if primer trimming is activated and if the sample has primers, to determine the appropriate consensus input.


451-459: Function get_trimming_input logic is sound

The function correctly determines the trimming input based on whether duplicate removal is activated and uses the appropriate aligner.

workflow/rules/common.smk Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
config/README.md (1)

49-49: Consider enhancing the primer trimming documentation.

While the new behavior is clearly documented, consider adding these clarifications to help users:

  1. Explicitly state that primer trimming must still be enabled in config.yaml even when panels are specified
  2. Add a note about the impact on existing workflows (e.g., "Note: Existing workflows that rely on primer trimming should ensure their samples have panels specified")
  3. Consider adding an example showing the relationship between panel specification and primer trimming behavior

Here's a suggested expansion of the documentation:

 If a panel is not provided for a sample, trimming will not be performed on that sample. 
+Note: Primer trimming must still be explicitly enabled in config.yaml, even when panels are specified.
+
+Example:
+| sample_name | panel      | Trimming Behavior                |
+|-------------|------------|----------------------------------|
+| sample1     | panel1.bed | Trimming performed if enabled    |
+| sample2     | NA         | No trimming, regardless of config|
+
+Note: If you're updating an existing workflow that relies on primer trimming,
+ensure that all relevant samples have panels specified in samples.tsv.
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 710b001 and 8bccfa4.

📒 Files selected for processing (1)
  • config/README.md (1 hunks)
🔇 Additional comments (1)
config/README.md (1)

49-49: LGTM! Clear documentation of the new behavior.

The added line clearly explains that primer trimming will be skipped for samples without a panel, which aligns with the PR's objective of supporting samples with and without primers.

Copy link
Contributor

@johanneskoester johanneskoester left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor things, nice work! I vote for removing the activate mechanism for primer trimming in the config then.

workflow/rules/common.smk Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (1)
workflow/rules/common.smk (1)

1376-1376: Remove extra whitespace.

There's unnecessary whitespace on this line.

-   
+
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 8bccfa4 and 55f616c.

📒 Files selected for processing (10)
  • .test/config-chm-eval/config.yaml (0 hunks)
  • .test/config-giab/config.yaml (0 hunks)
  • .test/config-no-candidate-filtering/config.yaml (0 hunks)
  • .test/config-simple/config.yaml (0 hunks)
  • .test/config-sra/config.yaml (0 hunks)
  • .test/config-target-regions/config.yaml (0 hunks)
  • .test/config-target-regions/config_multiple_beds.yaml (0 hunks)
  • .test/config_primers/config.yaml (0 hunks)
  • config/config.yaml (0 hunks)
  • workflow/rules/common.smk (3 hunks)
💤 Files with no reviewable changes (9)
  • .test/config_primers/config.yaml
  • config/config.yaml
  • .test/config-giab/config.yaml
  • .test/config-target-regions/config.yaml
  • .test/config-target-regions/config_multiple_beds.yaml
  • .test/config-chm-eval/config.yaml
  • .test/config-sra/config.yaml
  • .test/config-no-candidate-filtering/config.yaml
  • .test/config-simple/config.yaml
🔇 Additional comments (4)
workflow/rules/common.smk (4)

263-267: LGTM: Clear and correct aligner selection logic.

The function correctly determines the appropriate aligner based on the sample datatype, following standard bioinformatics practices.


417-422: LGTM: Proper handling of UMI and non-UMI cases.

The function correctly determines the input BAM file path based on UMI presence and aligner type.


424-452: LGTM: Well-structured input determination chain.

The chain of functions properly handles the determination of input files based on various conditions (RNA/DNA, primers, duplicates, etc.). The implementation is clean and follows a clear responsibility chain.


1371-1390: LGTM: Comprehensive primer detection logic.

The function correctly implements the PR's objective of supporting samples with and without primers by:

  1. Checking global primer configuration
  2. Checking sample-specific panel configuration
  3. Validating paired-end requirement

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Outside diff range and nitpick comments (3)
config/config.yaml (1)

33-39: LGTM! Consider enhancing the documentation.

The primer configuration changes look good. The empty default values for primers_fa1 and primers_fa2 align well with the PR's objective to support samples with and without primers.

Consider adding a comment explaining that these paths can be:

  1. Left empty for samples without primers
  2. Set globally for all samples
  3. Overridden per sample via the optional primer TSV file
workflow/rules/common.smk (2)

424-452: LGTM! Consider adding error handling for invalid file extensions.

The chain of functions is well-structured and correctly handles the various input scenarios. The addition of the bai parameter improves flexibility.

Consider adding validation for the ext parameter in get_trimming_input:

 def get_trimming_input(wildcards, bai=False):
     ext = "bai" if bai else "bam"
+    if ext not in ["bam", "bai"]:
+        raise ValueError(f"Invalid extension: {ext}. Must be 'bam' or 'bai'.")
     if is_activated("remove_duplicates"):
         return "results/dedup/{{sample}}.{ext}".format(ext=ext)
     else:
         aligner = get_aligner(wildcards)
         return "results/mapped/{aligner}/{{sample}}.{ext}".format(
             aligner=aligner, ext=ext
         )

424-452: Consider using f-strings for string formatting.

For better readability and maintainability, consider using f-strings instead of .format().

Example changes:

-    return "results/dedup/{{sample}}.{ext}".format(ext=ext)
+    return f"results/dedup/{{sample}}.{ext}"

-    return "results/mapped/{aligner}/{{sample}}.{ext}".format(
-        aligner=aligner, ext=ext
-    )
+    return f"results/mapped/{aligner}/{{sample}}.{ext}"

Also applies to: 1371-1384

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 55f616c and 893e0e1.

📒 Files selected for processing (3)
  • .test/config-giab/config.yaml (1 hunks)
  • config/config.yaml (1 hunks)
  • workflow/rules/common.smk (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • .test/config-giab/config.yaml
🔇 Additional comments (3)
workflow/rules/common.smk (3)

263-267: LGTM! Good extraction of aligner selection logic.

Clean implementation that improves maintainability by centralizing the aligner selection logic.


417-422: LGTM! Good use of the new aligner selection function.

The function correctly uses the new get_aligner function while maintaining the existing UMI annotation logic.


1371-1384: ⚠️ Potential issue

Add safety check for 'panel' column access.

The function implementation looks good overall, but there's a potential issue with accessing the 'panel' column.

Apply this diff to handle the missing 'panel' column:

 def sample_has_primers(wildcards):
     sample_name = wildcards.sample
 
     if config["primers"]["trimming"].get("primers_fa1") or (
-        "panel" in samples.columns
-        and samples.loc[samples["sample_name"] == sample_name, "panel"].notna().any()
+        samples.get("panel") is not None
+        and samples.loc[samples["sample_name"] == sample_name, "panel"].notna().any()
     ):
         if not is_paired_end(sample_name):
             raise WorkflowError(
                 f"Primer trimming is only available for paired-end data. Sample '{sample_name}' is not paired-end."
             )
         return True
     return False

@FelixMoelder
Copy link
Contributor Author

FelixMoelder commented Dec 3, 2024

Just updated the workflow to not rely on primer trimming activation in the config anymore.
The updated config now does not allow placeholder strings for primers_fa1/fa2 anymore as those would be interpreted as global primers without explicit activation. This is a breaking change which is why we should treat this PR as a major release.

@FelixMoelder FelixMoelder merged commit 7c24fb8 into master Dec 3, 2024
9 checks passed
@FelixMoelder FelixMoelder deleted the feat/support_primerless_samples branch December 3, 2024 12:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants