Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

goodbye Rmarkdown, hello Quarto #180

Merged
merged 62 commits into from
Jan 8, 2025
Merged

goodbye Rmarkdown, hello Quarto #180

merged 62 commits into from
Jan 8, 2025

Conversation

pdimens
Copy link
Owner

@pdimens pdimens commented Jan 3, 2025

This PR replaces all the RMarkdown reports with Quarto reports. They show the same stuff, but look a little different. It's for long term support b/c Rmarkdown is being phased out and Flexdash has no active development.

Also a few wording tweaks here and there, NX plots for phasing, etc.

Internal stuff too, like a sqlite-based molecular coverage script, which uses MUCH less ram than before.

Introduced scripts and text for Pixi-based workflows.

Summary by CodeRabbit

Based on the comprehensive summary of changes, here are the updated release notes:

  • New Features

    • Added support for Quarto report generation across multiple workflows.
    • Enhanced report configuration with dynamic parameters.
    • Introduced new scripts for development installation using Conda and Pixi.
  • Improvements

    • Streamlined Snakemake command execution.
    • Updated report file extensions from .Rmd to .qmd.
    • Refined parameter handling in workflows.
    • Improved logging and error handling in various modules.
  • Bug Fixes

    • Enhanced error checking in preflight and variant calling workflows.
    • Improved file path handling in report generation.
  • Chores

    • Refactored multiple Snakemake workflow files.
    • Updated documentation and report templates.
    • Simplified command construction in various Python modules.

Copy link

coderabbitai bot commented Jan 3, 2025

Warning

Rate limit exceeded

@pdimens has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 20 minutes and 32 seconds before requesting another review.

⌛ How to resolve this issue?

After the wait time has elapsed, a review can be triggered using the @coderabbitai review command as a PR comment. Alternatively, push new commits to this PR.

We recommend that you space out your commits to avoid hitting the rate limit.

🚦 How do rate limits work?

CodeRabbit enforces hourly rate limits for each developer per organization.

Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout.

Please see our FAQ for further information.

📥 Commits

Reviewing files that changed from the base of the PR and between bdda55e and 94f1b97.

📒 Files selected for processing (1)
  • harpy/_printing.py (2 hunks)
📝 Walkthrough

Walkthrough

This pull request introduces a comprehensive update to the Harpy project, focusing on transitioning report generation from R Markdown (.Rmd) to Quarto Markdown (.qmd) files. The changes span multiple components of the project, including Snakemake workflows, Python scripts, and report templates. The primary goal is to modernize the reporting infrastructure, improve flexibility, and enhance the overall reporting capabilities using Quarto, a next-generation scientific and technical publishing system. Additionally, improvements to installation scripts and the management of disk space in CI/CD workflows are included.

Changes

File/Directory Change Summary
.github/filters.yml Updated report file references from .Rmd to .qmd across various workflow sections
.gitignore Added entries for .conda, *.bai, .pixi, pixi*, and *.egg-info
README.md Enhanced installation instructions for Conda and Pixi package managers
harpy/_conda.py Removed r-flexdashboard and r-ggplot2, added r-quarto
harpy/_launch.py Added timing functionality to track workflow execution duration
harpy/_misc.py Added logic to handle Quarto configuration file creation
harpy/reports/ Converted all .Rmd reports to .qmd with enhanced parameterization and layout
harpy/snakefiles/ Updated report generation rules to support Quarto rendering
resources/ Added new installation scripts for Conda and Pixi development environments
.github/workflows/tests.yml Replaced shell command for clearing disk space with a dedicated action

Sequence Diagram

sequenceDiagram
    participant User
    participant Harpy
    participant Quarto
    participant Reports

    User->>Harpy: Initiate Workflow
    Harpy->>Quarto: Prepare Report Configuration
    Quarto->>Reports: Generate Dynamic Report
    Reports-->>User: Display Comprehensive Analysis
Loading

Possibly related PRs


Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai generate docstrings to generate docstrings for this PR. (Beta)
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

🔭 Outside diff range comments (1)
resources/buildlocal.sh (1)

Line range hint 19-20: Add error handling for script operations

The script permission and copy operations need proper error handling.

Apply this diff:

-chmod +x harpy/bin/* 
-cp -f harpy/bin/* ${CONDA_PREFIX}/bin/
+if ! chmod +x harpy/bin/*; then
+    echo "Error: Failed to set execute permissions on scripts"
+    exit 1
+fi
+
+if ! cp -f harpy/bin/* ${CONDA_PREFIX}/bin/; then
+    echo "Error: Failed to copy scripts to conda environment"
+    exit 1
+fi
🧹 Nitpick comments (93)
harpy/snakefiles/sv_leviathan.smk (1)

200-221: Add error handling for the Quarto rendering step.
While the Quarto-based approach is correct, it might be beneficial to capture errors or logs in case quarto render fails (e.g., Quarto not installed, version mismatch, etc.). This can improve debugging and reliability in automated workflows.

resources/dev_install_pixi.sh (1)

3-3: Consider making ENV_PREFIX configurable

The hardcoded path might not work in all environments.

Consider making it configurable:

-ENV_PREFIX="harpy/.pixi/envs/default/bin/"
+ENV_PREFIX="${PIXI_ENV_PREFIX:-harpy/.pixi/envs/default/bin}"
resources/buildlocal.sh (1)

Line range hint 1-20: Consider unifying installation scripts

There's significant code duplication across the three installation scripts (dev_install_pixi.sh, dev_install_conda.sh, and buildlocal.sh). Consider creating a shared library of functions.

Create a new file resources/install_common.sh with shared functions:

#!/usr/bin/env bash

compile_extract_reads() {
    local bin_dir="$1"
    if ! g++ harpy/bin/extractReads.cpp -O3 -o "${bin_dir}/extractReads"; then
        echo "Error: Compilation failed"
        return 1
    fi
    
    if [ ! -x "${bin_dir}/extractReads" ]; then
        echo "Error: Compilation succeeded but executable not found or not executable"
        return 1
    fi
}

install_harpy() {
    if ! pip install --no-deps --disable-pip-version-check -e .; then
        echo "Error: Failed to install harpy package"
        return 1
    fi
    [ -d build ] && rm -rf build
}

install_scripts() {
    local bin_dir="$1"
    if ! chmod +x harpy/bin/*; then
        echo "Error: Failed to set execute permissions on scripts"
        return 1
    fi
    
    if ! cp -f harpy/bin/* "${bin_dir}"; then
        echo "Error: Failed to copy scripts"
        return 1
    fi
}

Then source this file in each installation script and use the common functions.

harpy/snakefiles/sv_naibr_pop_phase.smk (2)

251-252: Consider unifying log outputs.

In this rule, you specify a log artifact in both the output section (line 252) and the log: directive (line 259). This duplication can be confusing. Consider relying solely on the Snakemake log: directive for clarity, or if you intend to produce a separate log artifact, rename it to avoid confusion.


347-368: DRY suggestion for shared logic between sample and aggregate reports.

The new aggregate_report rule replicates the pattern of the sample_reports rule (copying QMD, capturing FAIDX, and rendering). Consider centralizing any shared rendering logic into a helper function or consolidated snippet to reduce duplication, improve maintainability, and keep the workflow DRY.

harpy/bin/molecule_coverage.py (2)

57-63: Consider potential performance bottleneck for large contigs.
Inserting up to length + 1 rows could be slow for very large contigs. If contigs or resources are large, you may want to allocate in batches or check feasibility (e.g. skip pre-populating for extremely large contigs).


132-146: Avoid repeated in-memory database re-initialization if needed.
Re-initializing for each contig may be acceptable for moderately sized reference sequences but might be too frequent for large or numerous contigs. If performance becomes an issue, consider a single database approach with a dedicated table per contig.

harpy/reports/stitch_collate.qmd (2)

26-26: Efficient library-loading approach.
Your custom using(...) function ensures missing packages are installed on-demand, which is convenient for an all-in-one environment. Just be mindful that installing packages each run can slow processes if repeated frequently.


37-44: Keep an eye on large inline text blocks.
Large help or explanation texts are helpful, but may clutter the final document. Consider collapsible sections if it grows significantly.

harpy/reports/hapcut.qmd (1)

19-19: Broad library coverage.
Loading multiple libraries (dplyr, DT, scales, plotly, highcharter) can handle large transformations and interactive charts. Confirm that you truly need all to avoid overhead.

harpy/reports/naibr.qmd (7)

9-9: Consider adding timezone or locale context.
The time formatting is functional. If you plan to distribute this report in different locales, consider allowing an explicit timezone or locale parameter for reproducible timestamps.


44-44: Consider explicit versioning of installed libraries.
Using an ad-hoc installation approach is convenient, but pinning versions (e.g., specifying a CRAN snapshot) can help ensure reproducibility.


147-172: Tables for count and length stats.
The tables effectively pivot the data for structural variant counts and lengths. Ensure large data sets are handled efficiently; for extremely large files, you might investigate chunked reading or a database solution.


180-195: Non-chimeric variants table.
The table layout is clear for smaller data sets. If the number of rows is substantial, consider using server-side processing or pagination for better performance.


242-242: Documentation improvements.
The lines after this provide clarity about the circular plot usage. Keep it consistent with user expectations.


364-364: Background track border color.
This is purely aesthetic. Ensure it remains consistent with the rest of your UI design.


377-377: Genome labeling text size.
A size of 20 might be large for smaller screens. Consider a parameter for adjustable text size if user feedback indicates scaling issues.

harpy/reports/naibr_pop.qmd (7)

8-8: Timestamp usage.
Same suggestion: consider adding locale/timezone options if needed.


21-21: Custom library usage.
All libraries appear relevant. Confirm pinned versions for reproducibility.


81-87: Aggregate and summary stats for structural variants.
The approach is consistent. Pay attention to potential memory usage with large population sets.


134-139: Valuebox for groups without variants.
Color logic is sensible, though consider distinguishing no groups vs. emptyfiles visually.


Line range hint 158-177: Variants-by-count table.
The pivoted summary is helpful. For large data, server-side rendering or chunked reading might speed up results in a real environment.


Line range hint 184-205: Variants-by-length table.
Similar observation about performance with large data sets.


283-316: Dynamic text color function repeated.
Well-coded. If repeated across files, consider extracting into a shared utility for DRY principle.

harpy/reports/align_stats.qmd (3)

25-25: Library imports.
Check pinned versions or a lock file for maximum reproducibility.


113-119: Unique Molecules count.
Should remain valid, but be mindful of potential integer overflow if extremely large.


172-193: Summary table with N50, N75, and N90.
Neat approach for quick Nx stats. Might consider using a specialized package for Nx calculations if available.

harpy/snakefiles/preflight_bam.smk (3)

60-60: Consider variable-based output path for future maintainability
Using a hardcoded outdir + "/filecheck.bam.tsv" path is acceptable. However, placing it in a variable or config option (e.g., output_filecheck) could simplify future maintenance if the filename or location needs to change again.


79-81: Decrease duplication of path references
The input references for data, qmd, and yml consistently use f-strings pointing to the same base path. Factor out base_report_dir = f"{outdir}/workflow/report" or similar to reduce repetition and potential for mismatched paths.


83-84: Use temporary QMD file
The approach of rendering from a temporary QMD is clear. However, if debugging is needed, consider allowing an option to keep the file for inspection.

harpy/snakefiles/preflight_fastq.smk (3)

58-58: Consider storing the combined TSV path in a variable
Similar to the other workflows, you might create a variable or config entry for filecheck.fastq.tsv to increase maintainability.


77-79: Reduce repeated code with consistent path references
Reusing the outdir references multiple times is fine. For clarity, consider a helper variable (e.g., report_path = f"{outdir}/workflow/report") to avoid duplication.


81-82: Temporary QMD
Similar to preflight_bam.smk, using a temporary QMD file is reasonable. If debugging is needed, consider making it optional to preserve the intermediate file.

harpy/deconvolve.py (1)

58-58: Add a space after HPC profile
Appending --workflow-profile {hpc} is correct, but double-check that if HPC is absent, it doesn’t accidentally introduce trailing spaces in the command string.

harpy/_misc.py (1)

88-95: Remove the extraneous “f” prefix and confirm file overwriting intentions.

  1. According to the static analysis hint (F541), there are no placeholders in the string on line 92, so the f prefix is unnecessary.
  2. The new logic for creating and overwriting _quarto.yml looks sound, but ensure this overwriting behavior is intentional.

A minimal fix for the static analysis hint is:

- print_error("report configuration missing", f"The required quarto configuration file [blue bold]_quarto.yml[/blue bold] was not found within the Harpy installation.")
+ print_error("report configuration missing", "The required quarto configuration file [blue bold]_quarto.yml[/blue bold] was not found within the Harpy installation.")
🧰 Tools
🪛 Ruff (0.8.2)

92-92: f-string without any placeholders

Remove extraneous f prefix

(F541)

harpy/downsample.py (1)

70-74: Quote paths for safer command construction.

While this dynamic command-building logic is valid, consider quoting or escaping the path arguments to handle potential spaces or special characters in file paths.

For example, using Python’s shlex.quote:

-import os, shlex

...

-command = f'{SNAKEMAKE_CMD} --cores {threads}'
-command += f" --snakefile {workflowdir}/{workflow}.smk"
-command += f" --configfile {workflowdir}/config.yaml"
+import os, shlex

+command = f'{SNAKEMAKE_CMD} --cores {threads}'
+command += f" --snakefile " + shlex.quote(f"{workflowdir}/{workflow}.smk")
+command += f" --configfile " + shlex.quote(f"{workflowdir}/config.yaml")
 if snakemake:
-    command += f" {snakemake}"
+    command += " " + shlex.quote(snakemake)
harpy/reports/preflight_fastq.qmd (1)

8-19: Runtime library installation could introduce reproducibility concerns.
While the using() function ensures required packages are installed on-the-fly, consider documenting or locking version constraints for reproducible results.

harpy/reports/preflight_bam.qmd (4)

8-19: Consider specifying R version or checking for Quarto environment.
The function using(...) installs packages on-the-fly. This might lead to issues if run in an environment without sufficient permissions or locked library paths. Consider specifying an R version or using a robust environment management solution (e.g., renv) to avoid potential conflicts.


89-96: Potential confusion in table columns vs. row definitions.
Lines 93–94 mention BX:Z tag presence and format, but lines 95–96 also discuss an AxxCxxBxxDxx format. Ensure consistent references to the presence or absence of the BX:Z tag as well as the final tag condition.


99-126: Performance consideration for large data sets.
Rendering large tables with DataTables can be memory-intensive. For extremely large logs or sets of alignments, consider server-side processing of data or partitioned views.


128-165: Clarify references to LEVIATHAN usage.
Line 162 explains that the BX:Z tag last position is only relevant for LEVIATHAN. Provide details or references on how users can confirm LEVIATHAN usage.

harpy/snakefiles/qc.smk (1)

149-156: Consider gating the report_config rule with skip logic.
If skip_reports is set, confirm whether you want to copy _quarto.yml. Potentially skip copying to reduce overhead.

harpy/assembly.py (1)

74-80: Watch out for HPC plus user params.
Lines 74–80 append HPC and user-supplied Snakemake parameters. Consider validating that HPC flags and user flags do not collide.

harpy/snakefiles/sv_naibr.smk (1)

206-228: Leverage parameters for extended customization.

In sample_reports, the params are used to inject sample-specific data into the Quarto command. Consider also passing additional parameters to handle experimental settings or page formatting if needed. This might reduce duplication across multiple workflow files.

harpy/snakefiles/snp_mpileup.smk (1)

238-257: Handle log variations gracefully.

In variant_report, the --log {log} flag captures Quarto’s logs, which is helpful for troubleshooting. If logs might contain sensitive or extensive data, consider rotating or compressing them to reduce disk usage.

harpy/_launch.py (1)

8-8: Efficient import usage.

Importing datetime is fine here, but if performance is critical and only datetime.now() is used, you might consider from datetime import datetime as dt or a narrower import (though this is a minor concern).

harpy/snakefiles/sv_naibr_phase.smk (1)

264-286: Parameter-based Quarto rendering is flexible.

Using dynamic Quarto parameters for contigs and sample names is beneficial. If more user-driven commands or advanced rendering options are required, consider providing an optional arguments field.

harpy/reports/leviathan.qmd (4)

11-21: Avoid on-the-fly package installations for reproducibility

Installing packages within a report can result in irreproducible behavior, especially on systems without write access. Consider using a dedicated environment management solution or pinned versions to avoid unexpected conflicts.


36-37: Consider providing richer feedback when no variants are found

Currently, the report simply exits upon detecting no variants. Including an explanatory message or generating a minimal report can be more user-friendly.


90-98: Ensure consistent color usage in value boxes

All variant types except for "Variants" are displayed using #dfdfdf. Using distinct colors for each type can help the user quickly differentiate them.


105-116: Extend variant definitions with references where appropriate

These variant definitions are helpful, but referencing official documentation or credible resources can further aid comprehension, especially for less familiar variant types like breakends.

harpy/reports/impute.qmd (5)

30-30: Check package list for potential version conflicts

While installing packages, consider constraining versions to avoid future compatibility issues. Unrestricted updates can break the reproducibility of the pipeline.


34-40: Provide a more descriptive error message

When the stats file is empty, it may help users to see a more detailed explanation of potential causes (e.g., no biallelic SNPs).


53-60: Confirm that ensuring empty infofile is handled well

Similar to the stats file check, consider logging or explaining why infofile might be empty to help diagnose pipeline or data issues.


139-144: Consider clarifying what "Samples" represents

This box displays the number of samples in gcts. Labeling it as, for example, "Total Samples Processed" might make it more intuitive for end users.


Line range hint 183-203: Consider adjusting histogram bin sizes for better distribution insights

If your data includes many near-perfect or near-zero INFO scores, a dynamic bin size might reveal more insights into outliers or skewed distributions.

harpy/snakefiles/sv_naibr_pop.smk (1)

239-261: Rename group_reports to report_config or unify naming for clarity

You introduced report_config in this file, but the next rule is group_reports. To reduce confusion, unify or consistently name these rules across the workflow for a clearer mental model.

harpy/snakefiles/sv_leviathan_pop.smk (1)

294-294: Confirm coverage of newly named reports in the final workflow summary.
You updated references to .leviathan.html files. Make sure the final summary or any external tools that read these HTML names are also updated accordingly.

harpy/snakefiles/align_strobealign.smk (1)

95-95: Use a dynamic thread count if possible.
Threads are fixed to 4 here, which may be insufficient or excessive under different usage scenarios. If resources permit, consider dynamically tying it to workflow.cores for better scaling.

harpy/snakefiles/impute.smk (1)

25-25: Consider more descriptive default for stitch_extra.
You set stitch_extra = config.get("stitch_extra", "None"). If “None” is passed to commands, it might cause confusion. Possibly use an empty string or handle it gracefully if not specified.

harpy/reports/align_bxstats.qmd (7)

2-4: Check the parameter handling for indir.
Switching from implicit Snakemake input to params$indir is a good improvement. Make sure other .qmd files follow similar patterns for consistency, particularly if referencing the same data directory.


18-18: R package management block is simplified.
The using() function approach for on-the-fly installation is acceptable for local usage, but in production or HPC settings, consider managing dependencies via tested Docker/Conda images to ensure reproducibility.


101-109: Clarify the inline help text.
These bullet points do a nice job explaining each metric. The text is helpful, but consider adding a short note about usage or interpretation in other scripts, particularly if new team members are reading the code.


131-131: Minor note on figure height calculation.
figheight is set dynamically based on sample count. This is user-friendly. Just ensure it’s large enough for large datasets.

Would you like me to suggest a script block that automatically sets a max limit to avoid extremely tall plots for very large sample sets?


157-161: Valid barcode distribution.
The histogram approach is straightforward. Consider adding a warning or note if the majority of barcodes are invalid, indicating potential issues in sample prep or alignment.


205-206: Limit concurrency for large molecules.
The memory usage for generating these ranges can grow quickly. If you have many big samples, it might be beneficial to run them sequentially or with memory constraints specified in the HPC environment.


220-221: Add an outlier summary for reads per molecule.
If the distribution is heavy-tailed, you may want to note outliers or extremely high coverage. This might help users troubleshoot library prep issues.

harpy/reports/bcftools_stats.qmd (15)

7-7: Avoid mixing raw text with R expressions for date/time.

Embedding code inline with a backtick might confuse some markdown renderers if the rest of the file uses a chunk-based approach. Make sure this usage is intentional.


19-20: Confirm that installing packages during runtime won't mask environment issues.

Inside the using(...) function, packages are installed on-the-fly if missing. In a more controlled environment (e.g., containers, conda), this might be unnecessary or could cause reproducibility concerns.


28-29: Enhance naming consistency in code comments.

The comment “# SNPs” at line 28 while code at line 29 says “General Stats.” Consider clarifying the purpose with more descriptive comments or headings.


46-52: Use a consistent color scheme or theming approach.

Hardcoding "#dfdfdf" might conflict with other theme colors. Consider using a theming variable or parameter for the color.


54-60: Ensure repeated code across multiple valueboxes is DRY.

Valueboxes for “SNPs,” “MNPs,” etc. have similar code. If you have many such metrics, consider creating a function or using templating.


70-76: Further unify style for these valueboxes.

Again, repeated logic with color and “Multiallelic Sites” or “Multiallelic SNPs.” Standardizing your approach can reduce maintenance effort.


86-92: Check for consistent naming of “Others.”

It might be unclear to the user what “Others” means. A short note or documentation could clarify.


94-100: Encourage a generic approach to yield multiple metrics.

Similar feedback as above—consider a generic function or a loop to create these stats boxes.


166-172: Grammar/spelling fix: “Each quality score is has…”

Consider rephrasing line 169 – the text reads awkwardly with “Each quality score is has a semi-open bin...”. Possibly “Each quality score uses a semi-open bin...”.


209-215: Expand window labeling in Depth per Genotype.

It looks like you label the x-axis as a “depth bin.” Clarify if these are log scales or integer bins in your final output for user clarity.


245-246: Improve the heading for (Table) references.

Instead of “#| title: Table,” maybe specify “Depth Table” for clarity.


267-270: Add context about “no indels present.”

Line 269 mentions “suggesting no indels are present.” Possibly clarify if that means truly no indels or an unrecognized format.


300-300: Check trailing colon.

Line 300 has ::: with no immediate content after it. Validate if that extra block opening is needed.


330-333: Avoid mixing chunk titles with markdown headers.

You have a “# Per-Sample” heading and a chunk with name="read_persample". Consider ensuring the chunk name is consistent with the markdown heading.


365-382: Use consistent naming conventions between “Individual Stats (Plot)” and “Individual Stats (Table).”

Your code references both “Individual Stats” in different views. Confirm that they are discoverable in the final HTML (matching headings, etc.).

harpy/snp.py (1)

217-217: Ensure consistent naming for “bcftools_stats.qmd.”

In the fetch_report, you call the same .qmd from mpileup and freebayes. That’s good for standardization. Just confirm you don’t need separate freebayes reporting.

harpy/snakefiles/align_ema.smk (2)

316-323: Improve parameter injection for Quarto.

You pass parameters with -P bxstats:$BXSTATS and so on. This is correct. If your params grows, consider Jenkins or external config for better maintainability.


356-374: Avoid duplication in “barcode_report.”

This rule is similar to “sample_reports,” again copying a QMD and rendering. If you keep creating many similarly-structured rules, factor out the logic.

README.md (5)

25-34: Minor markdown-lint nitpick on code fences.

Specify the language (e.g., ```bash) to satisfy Markdownlint rules:

-```
+# ```bash
 # for local project directory
 conda create -p path/to/workdir/harpy -c bioconda -c conda-forge harpy
-```
+# ```
🧰 Tools
🪛 Markdownlint (0.37.0)

29-29: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


62-65: Pixi global install instructions.

One more nitpick from Markdownlint: add language to code fences.

-```
+# ```bash
 # ~/.zshrc or ~/.bashrc (or equivalent)
 export PATH=~/.pixi/bin:$PATH
-```
+# ```
🧰 Tools
🪛 Markdownlint (0.37.0)

65-65: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


66-68: Ensuring PATH is updated for pixi.

The steps are straightforward. Just add the language spec for code fences if possible.


73-90: Local install with Pixi instructions.

Similar minor note: specifying the language for code blocks ensures lint compliance.


92-106: Pixi update steps are comprehensive.

Again, specify code fence languages to pass lint checks.

harpy/reports/_quarto.yml (3)

1-3: Consider more reliable logo hosting

The logo URL points to raw GitHub content, which may face rate limiting in production. Consider:

  1. Using a relative path if the logo is part of the project
  2. Hosting the logo on a CDN for better reliability
🧰 Tools
🪛 yamllint (1.35.1)

[error] 2-2: trailing spaces

(trailing-spaces)


4-7: Review warning suppression and fix indentation

  1. Suppressing warnings (warning: false) might hide important issues during report generation. Consider logging warnings to a separate file instead of completely suppressing them.
  2. Fix the indentation to use 2 spaces instead of 4 to comply with YAML standards.
 execute:
-    echo: false
-    warning: false
+  echo: false
+  warning: false
 embed-resources: true
🧰 Tools
🪛 yamllint (1.35.1)

[warning] 5-5: wrong indentation: expected 2 but found 4

(indentation)


14-20: Enhance navigation buttons and fix file formatting

  1. Consider adding tooltips for accessibility:
 nav-buttons:
   - icon: book-half
+    title: Documentation
     href: https://pdimens.github.io/harpy
   - icon: bug
+    title: Report Issue
     href: https://github.com/pdimens/harpy/issues/new/choose
   - icon: code-slash
+    title: Source Code
     href: https://github.com/pdimens/harpy
  1. Add a newline at the end of the file to comply with YAML standards.
🧰 Tools
🪛 yamllint (1.35.1)

[error] 20-20: no new line character at the end of file

(new-line-at-end-of-file)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 2b2470f and f09d066.

📒 Files selected for processing (60)
  • .github/filters.yml (9 hunks)
  • .gitignore (2 hunks)
  • README.md (2 hunks)
  • harpy/_conda.py (1 hunks)
  • harpy/_launch.py (4 hunks)
  • harpy/_misc.py (1 hunks)
  • harpy/_printing.py (2 hunks)
  • harpy/align.py (5 hunks)
  • harpy/assembly.py (2 hunks)
  • harpy/bin/check_bam.py (1 hunks)
  • harpy/bin/molecule_coverage.py (5 hunks)
  • harpy/container.py (1 hunks)
  • harpy/deconvolve.py (2 hunks)
  • harpy/demultiplex.py (2 hunks)
  • harpy/downsample.py (2 hunks)
  • harpy/impute.py (3 hunks)
  • harpy/metassembly.py (2 hunks)
  • harpy/phase.py (3 hunks)
  • harpy/preflight.py (4 hunks)
  • harpy/qc.py (3 hunks)
  • harpy/reports/_quarto.yml (1 hunks)
  • harpy/reports/align_bxstats.qmd (10 hunks)
  • harpy/reports/align_stats.qmd (13 hunks)
  • harpy/reports/bcftools_stats.qmd (10 hunks)
  • harpy/reports/bx_count.qmd (10 hunks)
  • harpy/reports/hapcut.qmd (8 hunks)
  • harpy/reports/impute.qmd (8 hunks)
  • harpy/reports/leviathan.Rmd (0 hunks)
  • harpy/reports/leviathan.qmd (1 hunks)
  • harpy/reports/leviathan_pop.qmd (15 hunks)
  • harpy/reports/naibr.qmd (8 hunks)
  • harpy/reports/naibr_pop.qmd (14 hunks)
  • harpy/reports/preflight_bam.Rmd (0 hunks)
  • harpy/reports/preflight_bam.qmd (1 hunks)
  • harpy/reports/preflight_fastq.Rmd (0 hunks)
  • harpy/reports/preflight_fastq.qmd (1 hunks)
  • harpy/reports/stitch_collate.qmd (8 hunks)
  • harpy/simulate.py (6 hunks)
  • harpy/snakefiles/align_bwa.smk (4 hunks)
  • harpy/snakefiles/align_ema.smk (2 hunks)
  • harpy/snakefiles/align_strobealign.smk (4 hunks)
  • harpy/snakefiles/impute.smk (3 hunks)
  • harpy/snakefiles/phase.smk (1 hunks)
  • harpy/snakefiles/preflight_bam.smk (1 hunks)
  • harpy/snakefiles/preflight_fastq.smk (1 hunks)
  • harpy/snakefiles/qc.smk (2 hunks)
  • harpy/snakefiles/snp_freebayes.smk (1 hunks)
  • harpy/snakefiles/snp_mpileup.smk (1 hunks)
  • harpy/snakefiles/sv_leviathan.smk (2 hunks)
  • harpy/snakefiles/sv_leviathan_pop.smk (1 hunks)
  • harpy/snakefiles/sv_naibr.smk (2 hunks)
  • harpy/snakefiles/sv_naibr_phase.smk (2 hunks)
  • harpy/snakefiles/sv_naibr_pop.smk (3 hunks)
  • harpy/snakefiles/sv_naibr_pop_phase.smk (3 hunks)
  • harpy/snp.py (5 hunks)
  • harpy/sv.py (5 hunks)
  • resources/buildlocal.sh (1 hunks)
  • resources/createharpyenv.sh (0 hunks)
  • resources/dev_install_conda.sh (1 hunks)
  • resources/dev_install_pixi.sh (1 hunks)
💤 Files with no reviewable changes (4)
  • resources/createharpyenv.sh
  • harpy/reports/leviathan.Rmd
  • harpy/reports/preflight_fastq.Rmd
  • harpy/reports/preflight_bam.Rmd
✅ Files skipped from review due to trivial changes (1)
  • .gitignore
👮 Files not reviewed due to content moderation or server errors (3)
  • harpy/_printing.py
  • harpy/snakefiles/phase.smk
  • harpy/reports/leviathan_pop.qmd
🧰 Additional context used
🪛 Ruff (0.8.2)
harpy/_misc.py

92-92: f-string without any placeholders

Remove extraneous f prefix

(F541)

🪛 Markdownlint (0.37.0)
README.md

29-29: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)


65-65: null
Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Shellcheck (0.10.0)
resources/dev_install_conda.sh

[warning] 9-9: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)

🪛 yamllint (1.35.1)
harpy/reports/_quarto.yml

[error] 2-2: trailing spaces

(trailing-spaces)


[warning] 5-5: wrong indentation: expected 2 but found 4

(indentation)


[error] 8-8: trailing spaces

(trailing-spaces)


[error] 20-20: no new line character at the end of file

(new-line-at-end-of-file)

🔇 Additional comments (217)
harpy/snakefiles/sv_leviathan.smk (3)

162-167: Good use of multiline context management.
This approach simplifies the code by reducing deeply nested "with" blocks. No issues detected.


191-198: Verify the Quarto configuration usage.
Ensure that _quarto.yml is properly recognized and utilized when rendering the Quarto reports. If Quarto fails to pick up this file, you might need to pass an explicit parameter pointing to the config.


228-228: Approach to conditionally collect reports looks good.
Using a ternary expression to skip report generation when requested is clear and consistent with the rest of the workflow logic.

harpy/snakefiles/sv_naibr_pop_phase.smk (3)

314-321: Renamed rule for Quarto config looks good.

Renaming the rule to report_config and copying the Quarto YAML file straightforwardly ensures clarity in your Quarto-based reporting flow. No issues found.


322-344: Sample-based Quarto reports rule is well-structured.

This new rule cleanly references all necessary inputs—FAI index, BEDPE, and Quarto project files—and outputs a population-specific HTML report. Make sure that concurrency with other population runs is considered, but overall, this is a neat approach.


377-377: Conditional skipping of the aggregate summary is consistent.

Using a conditional expression to include or exclude summary outputs based on skip_reports is a clear and concise way to handle optional reporting. Looks good.

harpy/bin/molecule_coverage.py (3)

7-7: Seamless integration of sqlite3.
The import of sqlite3 is appropriate and well-placed. No issues found here.


42-43: Handle non-positive window edge cases.
Currently, the code only checks for == 0. If a user provides a negative value, it won't be caught. Consider checking <= 0 here to avoid strange behavior or potential errors.


71-98: Validate partial-window division logic.
Reducing coverage by (contig_end - actual_window) in the last interval is logically correct for a floating average, but ensure this division reflects the desired coverage metric (e.g. floor vs. round) and consistently handles scenarios where actual_window == contig_end.

harpy/preflight.py (4)

43-43: Distinct output directory defaults.
Switching the default output directory to "Preflight/bam" for the BAM command is clear and helps keep the workflows separate. Ensure end users realize that running both fastq and bam checks will produce separate directories.


64-66: Good use of dedicated Snakefile references.
Pointing to preflight_bam.smk clarifies the separate workflow logic for BAM vs. FASTQ. The references are consistent.


113-121: Consistent approach for FASTQ checks.
Using “Preflight/fastq” aligns well with “Preflight/bam.” This symmetry is good for maintainability.


100-100: Confirm HPC logs vs. local logs collision.
You are writing logs to logs/snakemake in output_dir. If HPC or local runs occur successively, ensure that logs from HPC and local runs don’t overwrite each other inadvertently.

harpy/reports/stitch_collate.qmd (5)

2-3: Dynamic title and parameter block.
The r params$id usage for the report title is a neat approach to keep everything parameterized. The new params block is clearly defined and easy to maintain.


57-65: Model stats: consistent use of value boxes.
Using the Quarto .valuebox style with a list(...) is a modern approach to display info inline. All the new parameters (like k, usebx, etc.) are well documented.


71-129: Neat grouping of param display.
Organizing each parameter in its own chunk is well-structured. The consistent color usage helps visually differentiate them.


Line range hint 133-186: Histogram logic looks correct.
The approach to log-scale the X-axis for right-tailed data is a sensible choice for haplotype-length distributions.


Line range hint 261-275: Datatable columns are well-labeled.
Using user-friendly column titles like “Mean SNPs,” “Median SNPs,” etc. improves readability. Everything looks consistent.

harpy/reports/hapcut.qmd (1)

2-5: Parameter-based file references.
Adopting params for fields like blockfile and contigs is a solid step. This also aligns with the standard of removing direct references to snakemake@input.

harpy/reports/bx_count.qmd (5)

4-4: Parameter-based directory references.
Using params$indir for logs/bxcount makes the location flexible. Great approach for advanced usage or changing default paths in the future.


66-66: Check path existence for custom directory.
Currently, it’s assumed indir definitely exists. Consider verifying existence or providing a fallback if the directory isn’t found.


108-109: Auto-scaling figure height.
Setting figheight based on the number of samples is a neat approach to keep charts visually pleasing.


209-209: Pivot logic for invalid percentages.
Using pivot_longer() helps keep segment invalidation data tidy. This is a straightforward approach.


Line range hint 324-329: Plotly scatter approach is sensible.
The separate segments for A, B, C, and D are color-coded well. This effectively communicates which segment is invalid for each sample.

harpy/reports/naibr.qmd (16)

2-7: Use of Quarto parameters seems appropriate.
The new parameter block effectively replaces hardcoded paths and promotes maintainability. No immediate concerns.


12-14: Validate file parameters before reading.
While assigning infile, fai, and samplename from user params is a good approach, consider validating that these files exist or are not empty before proceeding to read, to handle potential errors more gracefully.


17-17: No concerns: heading update.
This simple heading for the "General Stats" section is clear and purposeful.


Line range hint 23-23: Error message clarity.
The message about missing variants is straightforward. Ensure that downstream processes or calls won't break if the execution halts here.


66-66: No concerns: simple section divider.
This line simply acts as a short heading or divider.


80-86: Appropriate usage of Quarto valuebox.
The dynamic color logic is straightforward and should make it visually clear when variants are present vs absent.


88-95: Valuebox for Deletions looks correct.
No performance or correctness issues identified.


99-105: Valuebox for Duplications is consistent.
Implementation remains uniform with the other variant counters.


109-115: Valuebox for Inversions is aligned with prior boxes.
No changes needed.


118-136: Introductory text and variant definitions.
The descriptive text provides good clarity for end-users. Content is well-structured and references are documented.


203-216: Chimeric variants table flows logically.
Good separation of chimeric vs non-chimeric content. No immediate concerns.


223-223: No concerns: R chunk delimiter.
Likely just a chunk boundary or closure.


225-225: No concerns: R chunk delimiter.
Same as above—closing a chunk.


229-229: General logic for advanced plotting.
No direct concerns, likely just an anchor line or chunk name for the subsequent code.


257-292: Dynamic legend text color function looks solid.
The function adjusting text color based on background luminance is well-implemented. No issues.


384-384: Close chunk delimiter.
No issues.

harpy/reports/naibr_pop.qmd (15)

2-6: New parameter block for population-level NAIBR analyses.
Transitioning to parameters for flexible input sources is a good approach.


31-33: Empty-file strategy improved.
Returning an empty data frame for missing rows is better than returning zero—clarifies handling. Good practice.


55-61: Variable naming and meaning.
indir, fai, and infiles are straightforward. Consider verifying each file’s existence before reading to alert the user if parameters are incorrect.


67-77: Handling of emptyfiles array.
The logic to skip further processing if all are empty is helpful for preventing downstream errors.


92-97: Valuebox for populations.
Works as intended. No major concerns.


101-108: Valuebox for deletions.
Displays average deletions across populations. Looks correct.


112-119: Valuebox for duplications.
Parallel logic to the deletion block—maintains consistency.


123-131: Valuebox for inversions.
Again, consistent with the other variant counters.


210-224: All Non-Chimeric Variants listing.
Straightforward table. Confirm performance for big files.


230-248: Chimeric variants listing.
Well-labeled for user clarity. No major concerns.


Line range hint 256-280: Circular plot setup.
The usage of scoping the top 30 contigs is good to avoid overly busy plots. Confirm that ignoring smaller contigs does not mislead some analyses.


Line range hint 337-363: BioCircos logic to handle inversions.
Implementation parallels the approach in naibr.qmd. This ensures uniform user experience.


Line range hint 403-431: Duplications logic reuses the same structure.
No further concerns.


Line range hint 470-499: Deletions logic with SNP track.
Consistent usage of BioCircosSNPTrack for small variant visualization.


531-550: Explanation table for the columns.
Clear descriptive metadata for columns fosters understanding.

harpy/reports/align_stats.qmd (17)

2-10: Parameterizing alignment stats.
Excellent transition from snakemake references to Quarto’s params. This fosters reusability.


32-34: Parameter assignment for input files.
Consider basic file checks to gracefully handle missing or corrupted files.


80-87: Valuebox: contigs, barcodes, and threshold.
Displays critical quick stats. The logic is consistent with typical Quarto valuebox usage.


99-110: Molecule threshold logic.
If sm_moldist is non-positive, it’s reported as NA. Verified no potential break in the code path.


122-128: Valid vs Invalid BX records.
Properly displayed. No concerns.


131-137: Singleton percentage calculation.
Ensure nrow(valids) is never zero when VALID_PRESENT is true—currently safe.


140-151: Average molecules per contig.
Comprehensive display. No performance warnings.


229-230: Histogram usage.
The code confirms distribution of reads/molecules. Summaries are intuitive.


Line range hint 263-299: Inferred molecule length distribution.
The use of log scale is helpful. Confirm the chosen binning for large data sets.


376-379: Reading coverage files from params.
Ensure the user-supplied coverage files exist or handle the error.


424-431: Second set of valueboxes for coverage metrics.
Consistent styling for contigs, intervals, average depth, etc.


444-449: Valuebox: Average depth.
Round to 2 decimals is typical; no issues.


453-458: Valuebox: Stdev Depth.
Works similarly. All good.


462-467: Valuebox: Mol. Average Depth.
Implementation parallel to alignment-based average; no concerns.


506-547: Depth outlier table and distribution.
The approach of ignoring intervals above the 99th percentile for histogram is standard. Good logical structure.


555-560: Dist text block.
Properly clarifies how distance threshold is used if mol_dist > 0.


Line range hint 607-667: Circular coverage plot.
Uses BioCircos to display coverage. The approach is consistent with other circular plots. No performance concerns beyond typical large-data issues.

harpy/container.py (1)

23-23: Extended Snakemake command for container-building.
Adding --conda-prefix .conda and --conda-cleanup-pkgs cache improves environment management. No immediate issues.

harpy/bin/check_bam.py (1)

76-76: Output field order changed.
Swapping NAME_MISMATCH and BAD_BX positions modifies the final TSV columns. Ensure downstream parsers are updated to reflect this new column order.

harpy/snakefiles/preflight_bam.smk (4)

65-66: Clarify the sort key for reliability
sort -k1 assumes the important data is in the first column, but if the log files contain different structures or the first column can have special characters, consider whether a numeric sort (-n) or lexical sort is more appropriate.


69-76: Validate copying of _quarto.yml
The report_config rule copies _quarto.yml into the output directory. This is straightforward, but ensure that any existing _quarto.yml is overwritten intentionally. Consider adding a check to confirm whether overwriting is acceptable.


86-86: Simplified log reference
Referencing the logging path directly is a good improvement, removing the concatenation. This enhances clarity and reduces the chance of path inaccuracies.


89-94: Ensure Quarto rendering is robust
You are using quarto render … --quiet -P infile:$INFILE; consider validating the result or capturing errors from Quarto for improved debugging, particularly if --quiet suppresses essential CLI messages.

harpy/snakefiles/preflight_fastq.smk (4)

63-64: Validate the sorting approach
Using sort -k1 can be sufficient if the first column is guaranteed to be consistent. Otherwise, confirm whether numeric or multidimensional sorting is needed.


67-76: Check overwriting _quarto.yml
Copying _quarto.yml over an existing file may discard manual user edits. Clarify or confirm that this overwrite is intentional.


84-84: Direct log reference
Directly using f"{outdir}/logs/report.log" is simpler. Good improvement in readability.


87-92: Ensure Quarto CLI feedback
If a build failure occurs, --quiet might hide necessary details. Consider capturing logs or using a --log file for more in-depth debugging.

harpy/deconvolve.py (4)

11-11: Centralized Snakemake command usage
Great use of SNAKEMAKE_CMD. It promotes consistent, maintainable command construction across modules.


54-55: Confirm deployment method logic
The combined conda apptainer method is appended only if conda is false, which might not be obvious at first glance. Make sure to verify the logic thoroughly and ensure it doesn’t conflict with HPC usage.


56-56: Ensure correct config file path
--snakefile {workflowdir}/deconvolve.smk depends on the assumption that fetch_rule has placed deconvolve.smk in workflowdir. Confirm that workflowdir is correct and that deconvolve.smk is actually there at runtime.


60-60: Maintain clarity of appended Snakemake arguments
When appending {snakemake}, ensure that no user-supplied flags inadvertently override your previous arguments. Consider checking for dangerous overlaps.

harpy/_conda.py (1)

56-56: Transition from R Markdown to Quarto
Replacing r-flexdashboard and r-ggplot2 with r-quarto is aligned with the broader move away from R Markdown. Confirm all the new Quarto-based reports still function as intended without r-ggplot2, or ensure it is installed elsewhere if required.

harpy/demultiplex.py (1)

66-72: Same suggestion regarding safe command construction.

The same command building approach appears here. Quoting file paths can help avoid unexpected failures if the file paths contain spaces.

harpy/metassembly.py (1)

59-65: Same suggestion regarding safe command construction.

As before, consider quoting or escaping shell arguments to handle whitespace or special characters in paths.

harpy/reports/preflight_fastq.qmd (7)

1-5: Header metadata looks good.
No issues found in the Quarto front matter specifying the title and input parameters.


6-6: Nice timestamp inclusion.
Calling format(Sys.time(), ...) directly in the document helps track report generation time.


21-33: Data loading and summarization logic appear correct.
Ensure that the user-supplied TSV always includes the columns referenced (e.g., reads, noBX, badBX, badSamSpec, bxNotLast).


73-79: Pass/fail conditions are clearly defined.
The use of a concise table for summarizing conditions is helpful.


81-109: DataTable construction is well-configured.
Conditional formatting properly highlights problem files. The logic is straightforward and user-friendly.


112-146: Comprehensive explanation of columns.
Excellent documentation and clarity. Users will likely appreciate the detailed definitions for each field.


148-169: Read header specification is well-documented.
The demonstration of a properly formatted read header clarifies requirements for SAM compliance.

harpy/impute.py (4)

13-13: Consistent usage of SNAKEMAKE_CMD.
Importing and using the constant aligns well with other modules, improving maintainability.


64-65: Streamlined command construction.
Switching from a hardcoded string to SNAKEMAKE_CMD and specifying --software-deployment-method helps unify the Snakemake invocation approach.


68-70: Conditional HPC and Snakemake parameters.
Appending HPC profile and optional snakemake parameters is well-handled. This pattern is consistent with the rest of the codebase.


81-82: Reports switched from .Rmd to .qmd.
Ensuring all references align with Quarto-based reporting is correct.

harpy/phase.py (4)

10-10: Importing SNAKEMAKE_CMD improves consistency.
Following the same pattern as other workflows makes the code more uniform and maintainable.


63-65: Parameterizing Snakemake command.
Defining the base command using SNAKEMAKE_CMD and --cores looks good. Ensures easier maintenance in the long term.


67-69: Appending HPC and Snakemake options conditionally.
Helps avoid clutter in the command. This is a clean approach.


80-80: Transition to Quarto reports.
Replacing .Rmd with .qmd references is consistent with the PR objective.

harpy/qc.py (4)

10-10: SNAKEMAKE_CMD import usage.
Refactoring out the hardcoded command string is consistent with the rest of the codebase.


68-70: Command construction refactored for consistency.
Similar approach to the other workflows, ensuring fewer hard-coded strings.


72-74: Flow control for HPC and custom Snakemake parameters.
The logic properly appends HPC and Snakemake values. Good job avoiding trailing whitespace.


87-87: Quarto report fetch is inline with the rest of the changes.
Fetching bx_count.qmd maintains uniformity and clearly fits the new reporting standard.

harpy/reports/preflight_bam.qmd (4)

1-5: Ensure parameter references are consistent throughout.
The front matter declares a parameter infile set to 'test.tsv'. Verify all references in the code consistently use params$infile.


21-29: Validate the effect of row-level operations.
The code uses rowwise() followed by a summarise across three columns. Ensure that rowwise operations and subsequent summarise calls are correct and do not unexpectedly collapse data. Test with both small and large datasets.


36-85: Check for potential naming mismatch in valuebox labels.
Each value box references a condition (e.g., ifelse(attention > 0, ...)). Confirm that these icons and color statuses are properly aligned—particularly lines 61–63 referencing noMItag instead of noBXtag.


167-187: Confirm code fences and output rendering.
Lines 183–185 show an example alignment record. Confirm that the fenced code block outputs as expected in Quarto.

harpy/snakefiles/qc.smk (2)

139-139: Validate the correctness of the new barcode_stats rule name.
Renaming from check_barcodes to barcode_stats is clear, but verify all upstream references are updated accordingly in the rest of the codebase.


159-176: Confirm the shell environment for Quarto.
You render barcode.summary.qmd with Quarto. Verify that the environment indeed contains a Quarto installation or a relevant Conda environment. Otherwise, the build might fail unexpectedly.

harpy/snakefiles/snp_freebayes.smk (3)

187-187: Revised stats output directory.
Writing variant stats to reports/data/variants.{type}.stats organizes data well. Ensure that any references to the old path are updated (e.g., downstream consumption or documentation).


195-202: Ensure consistent naming of report_config rule across files.
You introduced report_config for variants. Verify that these references match usage in parallel Smakefiles (e.g., qc.smk).


203-222: Check mandatory Quarto parameters.
The Quarto render parameters (--log, --quiet, -P infile:$INPATH) look correct. Ensure that all needed user inputs or environment variables are properly provided (-P vcf:{params} is present, but confirm no missing placeholders).

harpy/assembly.py (1)

12-12: Centralizing Snakemake command is a good practice.
Great job moving the snakemake command into SNAKEMAKE_CMD. This simplifies maintenance across different modules.

harpy/snakefiles/sv_naibr.smk (2)

97-98: Consider clarifying whether logs should be temporary or permanent.

Storing the vcf and log outputs in temporary locations (via temp(...)) can be convenient for cleanup, but it may make post-run debugging difficult if logs are removed prematurely.


198-205: Ensure correct handling of missing/quarto configuration files.

The new report_config rule copies _quarto.yml to the reports directory. Confirm that f"{outdir}/workflow/report/_quarto.yml" always exists (especially in distributed or containerized environments) to avoid runtime failures.

harpy/snakefiles/snp_mpileup.smk (1)

230-237: Validate presence of Quarto config file.

Within report_config, be sure that "{outdir}/workflow/report/_quarto.yml" is populated and not skipped through conditions (like skip_reports). Missing files will break the copy command.

harpy/_launch.py (3)

18-18: Confirm conda prefix usage in all environments.

The updated SNAKEMAKE_CMD includes a --conda-prefix .conda parameter. Verify that all Snakemake subcommands respect this location to maintain consistent environments.


36-36: Good approach for tracking workflow start time.

Storing start_time = datetime.now() is a straightforward way to measure the runtime. No issues here.


160-162: Excellent user-facing runtime info.

Passing elapsed_time to print_onsuccess helps highlight workflow duration. This is a nice improvement in user feedback.

harpy/snakefiles/sv_naibr_phase.smk (2)

193-194: Check log completeness for phased variant-calling.

Including a log file under call_variants is great for debugging. Ensure that critical errors are surfaced if the process terminates unexpectedly and not merely written to a temp file.


256-263: Same caution about missing _quarto.yml.

As in other workflows, ensure _quarto.yml is prepared before copying. This is crucial for reproducible Quarto-based reporting.

harpy/reports/leviathan.qmd (1)

246-260: Verify correct ring radius overlaps for visually distinct tracks

In multi-track circles, ensure each track is visually separated. Overlapping or minimal spacing may cause confusion in large or complex datasets. Confirm the chosen radius increments are appropriate for your use cases.

harpy/snakefiles/align_bwa.smk (4)

104-104: Confirm that a fixed 4-thread limit suffices for large-scale alignments

Previously, threads might have been dynamically assigned. Keeping it to 4 might slow down workflows with more available resources. Validate performance constraints or consider making it configurable.


188-188: Good approach for specifying the window size

Providing -w {params} for molecule coverage ensures a clear and flexible approach to coverage windows. This looks consistent with broader usage patterns.


203-210: New rule report_config added

The creation of an _quarto.yml configuration file makes the reporting logic more modular. Great step toward standardized Quarto-based workflows.


211-237: Check correctness when copying QMD files to final reports

The logic properly copies the QMD template and uses realpath for path resolution. Ensure relative paths in the QMD remain valid after copying.

harpy/snakefiles/sv_naibr_pop.smk (3)

177-177: Aggregating variants is a well-structured approach

Great job on systematically writing each structural variant classification to separate files. This separation can streamline further analyses.


231-238: New report_config rule added

Copying _quarto.yml into the final reports directory is consistent with the approach across other workflows. This fosters consistency in project-wide report generation.


264-285: Validate multi-population inputs in the aggregate_report rule

When collecting multiple bedpe files, confirm that no population is overlooked or duplicated. Also validate how partial or empty sets are handled (i.e., if a new population is introduced but lacks variants).

harpy/snakefiles/sv_leviathan_pop.smk (3)

233-240: Ensure consistency for newly introduced rule report_config.
Great idea to copy the Quarto _quarto.yml config at the start of the workflow. Please confirm that all downstream rules referencing f"{outdir}/reports/_quarto.yml" rely on this rule or have it in their dependencies.


242-263: Check for potential concurrency conflicts.
Both cp {input.qmd} {output.qmd} and the call to quarto render rely on consistent file naming. If parallel runs produce the same QMD with different partial fill parameters (e.g., multiple population wildcards), ensure that potential concurrency issues are avoided by either using distinct intermediate filenames or restricting concurrency.

Suggestion: Incorporate additional logic (like a rule-specific temp directory) to avoid accidental collisions when the target directory is shared by multiple parallel tasks.


266-287: Validate the aggregated stats coverage.
This rule aggregates stats from multiple populations. Consider verifying that the relevant stats files exist for all populations before proceeding. Also, ensure the Quarto rendering gracefully handles edge cases with empty data (e.g., no structural variants found).

harpy/snakefiles/align_strobealign.smk (5)

179-179: Additional checks for coverage script.
Ensure molecule_coverage.py accepts and properly interprets the numeric window size (-w {params}). Check for edge cases when windowsize is large or zero.


194-201: Report configuration rule looks good!
Copying _quarto.yml into reports/ aligns well with the new Quarto-based workflow. No issues found.


202-228: Confirm logic in rule sample_reports.
The new parameters for Quarto, such as -P mol_dist:{molecule_distance}, are consistent with the rest of your approach. Ensure all references to molecule_distance and windowsize are correct in the QMD. Also, confirm that the old sample_reports references are fully replaced or removed in the rest of the codebase.


261-279: Validate aggregated barcode stats.
Check that the aggregated data from all samples can be loaded correctly in align_bxstats.qmd. If some samples produce empty .bxstats.gz files, ensure the Quarto script handles them gracefully.


287-287: Confirm final reports in the workflow summary.
Your last input references bx_report = outdir + "/reports/barcode.summary.html". If the user sets skip_reports to True, confirm that the variable is properly disabled downstream to avoid referencing missing outputs.

harpy/snakefiles/impute.smk (3)

123-130: Validate rule contig_report_config.
Similar to other config copies, ensure that any references to _quarto.yml are consistent across all rules (particularly in cases with multiple paramsets or contigs).


134-159: Handle partial or missing stats.
For newly introduced contigs that might not have data in the statsfile or plotdir, consider verifying they exist before proceeding. Quarto might fail if inputs are missing.


223-250: Confirm consistent usage of stitch_extra in Quarto.
Passing command-line parameters with -P extra:{stitch_extra} is flexible. Should stitch_extra be a single string or key-value pairs for Quarto’s parameter passing? Confirm usage across all paramsets.

harpy/reports/align_bxstats.qmd (6)

6-6: Nitpick: Minimal effect.
Adding a timestamp here is purely cosmetic but is a nice way to keep track of generation times. No concerns.


82-83: Graceful handling of empty infiles.
You’ve now checked for an empty data frame and ended the report with knitr::knit_exit(). This is a good user experience improvement.


119-119: Good use of 'dom = Brtp'.
Keeps the table interface clean and easy to download. No issues found.


135-140: NX distribution usage.
The concept is well described. If extremely large molecule lengths are expected, ensure the density plot does not cause performance issues in extreme scenarios.


182-187: Per-sample metrics approach.
Using distinct sections for “Per-Sample Metrics” is a helpful structure. Keep it consistent with the rest of your QMD-based reports.


236-237: Dual coverage metrics are a fantastic addition.
You’re showing both fragment-based coverage and aligned base pairs. This should give a thorough perspective. Nicely done.

harpy/reports/bcftools_stats.qmd (11)

2-5: Use consistent YAML frontmatter structure.

Here, you've added parameters and updated the document title to reflect the Quarto style. Everything looks good; just ensure other .qmd files follow a similar pattern for consistency.


22-23: Check for read-only environments.

Reading lines from params$infile indicates direct file usage. If the environment is read-only or ephemeral, confirm that the file is properly available at runtime.


38-44: Check valuebox usage.

These lines produce a single value box for “Samples.” The logic appears correct, but ensure that the styling (#| content: valuebox) works as intended beyond your local environment.


62-68: Check for out-of-bounds indexing.

You reference sn[1,5] and sn[1,6] for data. Make sure sn has enough columns and that the indexing is valid even if the stats file is malformed or missing data.


78-84: Validate row indexing for “No-Alts.”

We see sn[1,3] for “No-Alts.” If the input CSV is incomplete, referencing an index can throw an error. Ensure robust error handling.


102-108: Ensure the “No QUAL section found” branch is tested.

Your code logs a message if no QUAL section is found. Confirm that your pipeline has test data that triggers this edge case.


195-195: Be mindful of leftover debugging or test code.

Double-check that line 195 for the function definition or calls is necessary. If not used, remove.


243-243: Break up the usage of triple colons for clarity.

The ::: usage is a Quarto or markdown block extension. Double-check your engine and any collisions with standard R markdown features.


256-260: Check alignment of table export buttons.

The code references 'Buttons' extension in DataTable. Ensure that the final rendered DT shows those buttons as expected.

Also applies to: 262-262


302-308: Ensure reusing “mergebins” logic for Indel Genotype Quality.

At lines 310–311, you reuse mergebins(qual[,-2]). Looks correct, but confirm that the logic is identical to the SNP branch.


341-341: Double-check default figure height.

Line 341 sets figheight dynamically. Ensure that large sample sets produce a readable plot.

harpy/snp.py (4)

13-13: Great use of SNAKEMAKE_CMD.

This import unifies your command execution. Implementation is consistent with the rest of the file.


87-93: Inconsistent HPC and snakemake usage.

You’re conditionally adding HPC and snakemake flags. Confirm that concatenating strings in this manner always yields the intended command and doesn’t introduce spacing errors.


111-111: Confirmed: fetch_report now references .qmd instead of .Rmd.

This aligns with the new Quarto-based approach. Ensure there are no leftover .Rmd references in non-committed files.


193-199: Check thread usage beyond default.

Same logic as above—concatenate HPC and snakemake. It’s easy to lose spacing or flags if not carefully tested. Suggest verifying with complex HPC setups.

harpy/sv.py (5)

13-13: Consistent approach with SNAKEMAKE_CMD.

You’re aligning this file’s approach to match harpy/snp.py. Good step for maintainability.


87-93: Check for HPC profile.

Same caution about HPC logic. If hpc is specified, do you need additional flags or environment setups? Validate HPC usage.


103-104: Review whether leviathan_pop.qmd and leviathan.qmd overlap.

You fetch both reports if populations is present. Confirm that the correct one is being rendered or both are needed.


187-193: Parallel HPC arguments with NAIBR.

Same HPC + snakemake concatenation issues as in other scripts—validate spacing and flags.


203-204: naibr_pop.qmd and naibr.qmd usage.

Similar to leviathan, you fetch both if populations is not None. Ensure the correct final report is invoked.

harpy/snakefiles/align_ema.smk (4)

287-287: Check piping into gzip.

molecule_coverage.py output is piped then gzipped. This looks straightforward, but watch out for situations with partial writes or concurrency.


289-296: report_config rule ensures _quarto.yml is referenced.

You’re copying _quarto.yml from workflow/report/ to the final location. This is good for templating. Confirm that you don’t need environment-specific paths.


297-308: Consider verifying that all required inputs exist for the sample reports.

Line 302 references align_stats.qmd. Ensure that file is properly populated or templated. Missing files can break the report generation.


382-382: Confirm single-sample vs. multi-sample scope.

Line 382 conditionally references barcode.summary.html if we’re not skipping or if there’s only one sample. Double-check logic for edge cases (e.g., multiple samples but still want a combined barcode summary).

harpy/align.py (13)

14-14: Good addition of SNAKEMAKE_CMD.

This improves maintainability by centralizing the Snakemake command.


98-100: Refactored command construction appears solid.

Using SNAKEMAKE_CMD alongside --snakefile and --configfile is clean and maintainable.


102-102: HPC profile integration looks correct.

Appending --workflow-profile {hpc} at runtime is consistent with the rest of the workflow.


104-104: Support for custom Snakemake params is well-structured.

This conditional append keeps the base command clean.


112-113: Report references switched to .qmd.

These updates align with the new Quarto-based reporting framework. No issues.


189-191: EMA Snakemake command logic follows the established pattern.

The approach is consistent and maintains readability.


193-193: Appending HPC profile parameter for EMA.

Continues the consistent approach used across aligners.


195-195: Additional Snakemake parameters appended.

Identical pattern to bwa, which keeps the code consistent.


217-218: Quarto report references for EMA.

Again, switching from .Rmd to .qmd is consistent with the project’s new reporting standard.


299-301: Command construction for Strobe aligns well with pattern.

Reusing SNAKEMAKE_CMD ensures consistency and easy updates.


303-303: HPC profile parameter appended for Strobe.

Good, matches the logic from bwa and ema.


305-305: Appending Snakemake params for Strobe.

Keeps usage consistent with other aligners.


313-314: Quarto-based reports for Strobe workflow.

Good to see the .qmd update here, too.

harpy/simulate.py (11)

11-11: Importing SNAKEMAKE_CMD for consistent command usage.

Centralized command references help avoid duplication.


158-160: Clear Snakemake command construction for linkedreads.

Using SNAKEMAKE_CMD is a good standardization effort.


162-164: Conditional HPC & Snakemake params appended properly.

Good approach, consistent with other workflows.


266-268: Using two cores for the SNP/indel simulator.

Explicitly specifying cores is a good practice to keep resource usage in check.


270-272: HPC profile & additional Snakemake param logic.

Follows the same design, no issues spotted.


382-384: Same pattern for inversion workflow.

Consistency with preceding commands is commendable.


386-388: Appending HPC and Snakemake params for inversion.

Matches the rest of the file’s structure.


498-500: Command creation for CNV simulation looks correct.

No errors or redundancies observed.


502-504: HPC & Snakemake integration for CNV.

Adheres to the same logic found in other commands.


604-606: Command string for translocation simulation well-formed.

Consistent usage of SNAKEMAKE_CMD and --cores 2.


608-610: Conditional HPC & Snakemake parameters appended.

No issues found, remains easy to maintain.

.github/filters.yml (10)

28-29: Updating preflight report references to .qmd.

Nicely aligns with the Quarto-based approach across the codebase.


54-55: .qmd references for align stats.

Consistently migrated from .Rmd — looks good.


67-68: EMA align stats references updated to .qmd.

All set for Quarto reporting.


79-80: Switch to .qmd for strobealign stats.

Maintains uniformity with other modules.


92-92: Bcftools stats now .qmd in mpileup.

One-step Quarto transition.


100-100: Bcftools stats .qmd for freebayes.

Continues the planned migration.


110-111: Impute & Stitch Collate Reports updated to .qmd.

Fully aligned with the PR objective of removing .Rmd.


120-120: Leviathan reports use wildcard .qmd.

Migration is thorough.


130-130: NAIBR report references changed to .qmd.

Consistent with the new Quarto standard.


139-139: HapCut report updated to .qmd.

No problems identified.

README.md (6)

12-13: Enhanced clarity for Conda installation steps.

The addition of a dedicated heading is helpful for new users.


18-23: Clear explanation on activating the environment.

Straightforward instruction that users can easily follow.


36-38: Installing into an existing environment.

Instructions are organized and aligned with standard Conda usage.


50-55: Update instructions for Harpy in Conda.

Presents a typical conda update approach. No issues.


60-61: Heading for Pixi installation included.

Segregating different package managers is user-friendly.


70-70: Pixi global install usage looks standard.

No concerns with the code snippet.

harpy/reports/_quarto.yml (2)

8-13: Remove duplicated logo URL and verify theme compatibility

  1. The logo URL is duplicated from the brand section. Consider using a YAML anchor and alias to maintain DRY principles:
 brand:
   logo: 
-    medium: https://raw.githubusercontent.com/pdimens/harpy/docs/static/logo_report.png
+    medium: &logo_url https://raw.githubusercontent.com/pdimens/harpy/docs/static/logo_report.png
 # ... other configurations ...
   dashboard:
     theme: materia
     scrolling: true
     expandable: false
-    logo: https://raw.githubusercontent.com/pdimens/harpy/docs/static/logo_report.png
+    logo: *logo_url
  1. Verify that the 'materia' theme is compatible with Quarto dashboard format.
🧰 Tools
🪛 yamllint (1.35.1)

[error] 8-8: trailing spaces

(trailing-spaces)


1-20: Verify Quarto configuration compatibility

Since this PR transitions from RMarkdown to Quarto, please verify:

  1. All Quarto-specific features are properly configured
  2. The dashboard format settings are compatible with the latest Quarto version
  3. The configuration works with the new Snakemake workflows mentioned in the PR
✅ Verification successful

Let me gather more information about the Quarto configuration.


Quarto configuration appears compatible and properly integrated

Based on the verification results:

  1. The Quarto configuration is properly integrated into the Snakemake workflows, with consistent usage across multiple workflow files (sv_naibr.smk, sv_leviathan_pop.smk, etc.).
  2. The dashboard format settings in _quarto.yml use standard Quarto features:
    • The theme "materia" is a supported Quarto theme
    • Common dashboard options like scrolling and expandable are correctly configured
    • The embed-resources option is properly set for standalone HTML output
  3. The Quarto render commands in the Snakemake files correctly use parameter passing (-P) and logging, indicating proper integration with the workflow system.

The configuration appears to follow Quarto best practices and is consistently used across the project's workflows.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for Quarto version specification in project
rg -l "quarto.*version" 

# Check for any Quarto-specific configuration in Snakemake files
rg "quarto" -g "*.smk"

# Look for other Quarto YAML examples in the project
fd -e yml -e yaml | xargs rg "format.*dashboard" -l

Length of output: 7360


Script:

#!/bin/bash
# Check for Quarto dashboard format examples in .qmd files
fd -e qmd | xargs rg "format.*dashboard" -l

# Check for theme-related configurations
fd -e yml -e yaml | xargs rg "theme.*materia" -l

# Look for any Quarto-specific errors or warnings in log files
fd -e log | xargs rg -i "quarto.*error|quarto.*warning" -l

# Check if there are any other _quarto.yml files in the project
fd "_quarto.yml"

Length of output: 226

🧰 Tools
🪛 yamllint (1.35.1)

[error] 2-2: trailing spaces

(trailing-spaces)


[warning] 5-5: wrong indentation: expected 2 but found 4

(indentation)


[error] 8-8: trailing spaces

(trailing-spaces)


[error] 20-20: no new line character at the end of file

(new-line-at-end-of-file)

resources/dev_install_pixi.sh Outdated Show resolved Hide resolved
resources/dev_install_pixi.sh Show resolved Hide resolved
resources/dev_install_conda.sh Outdated Show resolved Hide resolved
resources/dev_install_conda.sh Show resolved Hide resolved
resources/dev_install_conda.sh Outdated Show resolved Hide resolved
resources/buildlocal.sh Show resolved Hide resolved
harpy/bin/molecule_coverage.py Outdated Show resolved Hide resolved
harpy/reports/align_stats.qmd Outdated Show resolved Hide resolved
harpy/reports/preflight_fastq.qmd Outdated Show resolved Hide resolved
Repository owner deleted a comment from coderabbitai bot Jan 6, 2025
Repository owner deleted a comment from coderabbitai bot Jan 6, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

♻️ Duplicate comments (1)
resources/dev_install_pixi.sh (1)

24-25: ⚠️ Potential issue

Add error handling for compilation step.

The compilation of extractReads.cpp lacks error handling and output verification.

Apply this diff:

-# compilation
-g++ harpy/bin/extractReads.cpp -O3 -o ${ENV_PREFIX}/bin/extractReads
+# compilation
+if ! g++ harpy/bin/extractReads.cpp -O3 -o ${ENV_PREFIX}/bin/extractReads; then
+    echo "Error: Compilation failed"
+    exit 1
+fi
+
+# Verify executable was created
+if [ ! -x "${ENV_PREFIX}/bin/extractReads" ]; then
+    echo "Error: Compilation succeeded but executable not found or not executable"
+    exit 1
+fi
🧹 Nitpick comments (7)
harpy/bin/molecule_coverage.py (2)

134-146: Enhance database performance and safety.

While the in-memory database per contig is a good approach, consider these improvements:

  1. Use transaction management to batch updates
  2. Make database connection and cursor local to avoid global state

Here's how to implement these improvements:

-            conn = sqlite3.connect(':memory:')
-            cursor = conn.cursor()
+            conn = sqlite3.connect(':memory:', isolation_level='DEFERRED')
+            cursor = conn.cursor()
+            cursor.execute('BEGIN TRANSACTION')
             cursor.execute('''
                 CREATE TABLE number_counts (
                     number INTEGER PRIMARY KEY,
                     count INTEGER
                 )
             ''')
             initialize_contig(contigs[contig])
+            conn.commit()

And before calling print_windowed_depth:

+            conn.commit()  # Commit any pending changes
             print_windowed_depth(LASTCONTIG, args.window)

Line range hint 1-147: Great refactoring to improve memory efficiency!

The transition to a SQLite-based approach is well-implemented and aligns perfectly with the PR's objective of reducing RAM usage. The use of in-memory databases per contig and windowed aggregation provides an efficient solution for molecular coverage calculation.

Consider adding a comment in the docstring mentioning the memory optimization strategy using SQLite, as this would be valuable information for future maintainers.

harpy/reports/align_stats.qmd (5)

2-10: Leverage parameter validation for improved reliability
The front matter and parameter definitions look good. However, consider validating parameter inputs (e.g., ensuring file paths exist, numeric inputs are within expected ranges) to provide clear error messages.


25-25: External package installation
Calling install.packages directly from within the report may cause reproducibility issues in certain pipeline or cluster environments. It’s often preferable to handle package installations outside the reporting script.


99-110: Clarify “Molecule Threshold” terminology
Current wording “Molecule Threshold” might be slightly confusing since it represents the distance threshold. Consider renaming it to something like “Linking Distance” to better reflect its purpose.

-#| title: "Molecule Threshold"
+#| title: "Linking Distance"

168-193: Performance considerations for large datasets
The group_by and left_join operations here are solid for typical datasets. If the dataset grows very large, consider using more memory-efficient packages (e.g., data.table).


Line range hint 607-667: Circular coverage plot
This BioCircos-based approach is robust and visually engaging. Consider color legends or tooltips to clarify which ring represents coverage vs. molecule coverage, especially for large contig sets.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between f09d066 and 5e850bc.

📒 Files selected for processing (6)
  • .github/workflows/tests.yml (1 hunks)
  • harpy/bin/molecule_coverage.py (5 hunks)
  • harpy/reports/align_stats.qmd (13 hunks)
  • harpy/reports/preflight_fastq.qmd (1 hunks)
  • resources/dev_install_conda.sh (1 hunks)
  • resources/dev_install_pixi.sh (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • harpy/reports/preflight_fastq.qmd
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Build/Cache Container
🔇 Additional comments (31)
harpy/bin/molecule_coverage.py (3)

7-7: LGTM! Well-structured argument handling.

The addition of SQLite support and the window parameter is well implemented with proper validation and error handling.

Also applies to: 28-28, 42-43


71-98: LGTM! Well-implemented windowing logic.

The windowing implementation correctly handles:

  • Single position case (window=1)
  • Window grouping and aggregation
  • Edge case for the last window
  • Proper connection cleanup

64-69: Optimize range updates for better performance.

Using executemany for each position in the range can be inefficient for large ranges.

Replace with a single UPDATE statement:

 def process_alignment(start, end):
     """Function to insert or update counts for numbers in a given range"""
-    cursor.executemany(
-        'UPDATE number_counts SET count = count + 1 WHERE number BETWEEN ? AND ?',
-        [(num,) for num in range(start, end + 1)]
-    )
+    cursor.execute(
+        'UPDATE number_counts SET count = count + 1 WHERE number BETWEEN ? AND ?',
+        (start, end)
+    )
resources/dev_install_conda.sh (2)

14-18: LGTM! Proper error handling for pip install.

The implementation includes appropriate error handling and flags for pip installation.


24-26: ⚠️ Potential issue

Remove duplicate pip install command

These lines duplicate the pip install command from lines 14-18 that already has proper error handling.

Remove these duplicate lines as they're redundant and lack error handling:

-# install harpy proper
-pip install --no-deps --disable-pip-version-check -e . && \
-    rm -rf build

Likely invalid or redundant comment.

resources/dev_install_pixi.sh (3)

1-4: LGTM! Environment setup is correct.

The shebang and environment prefix are properly configured.


5-13: LGTM! Proper error handling for Pixi initialization.

The initialization includes appropriate error checks and editable install configuration.


15-18: LGTM! Directory change includes error handling.

The cd command is properly wrapped with error checking.

.github/workflows/tests.yml (1)

82-82: LGTM! Good use of a reusable action.

Replacing shell commands with the jlumbroso/free-disk-space@main action improves maintainability and reliability of disk space management across jobs.

harpy/reports/align_stats.qmd (22)

80-96: Value boxes for Contigs and Unique Barcodes
These value boxes effectively utilize Quarto’s list-based approach (replacing old valueBox calls). Looks consistent and easy to maintain.


113-137: Grouped value boxes
These three value boxes (unique molecules, valid BX records, invalid BX records) are logically consistent and straightforward.


140-151: Confirm singletons calculation
Summation of (valids$reads < 2) yields the number of singletons. Dividing by nrow(valids) to form a percentage is correct, but double-check that this matches the intended definition of “singletons”.


154-165: Average molecules per contig
This section accurately computes and displays the average number of molecules per contig.


197-200: Reads per molecule chart
This chunk has clear logic and easy-to-read variable names. Looks good.


231-238: Total Bases Aligned distribution
This histogram-based approach for visualizing total bases aligned is standard and effective.


Line range hint 265-299: Inferred Molecule Length distributions
The logic for splitting singletons versus non-singletons and computing frequencies is consistent with the prior sections.


Line range hint 304-338: Molecule coverage calculations
No issues found with coverage calculations or the histogram approach for coverage distribution. The code is clear.


376-379: Reading coverage and molecule coverage inputs
Correctly retrieves coverage and molecule coverage file paths from params. No issues detected.


424-431: Value box: Contigs
This straightforward usage of scales::comma for highlighting the number of contigs is good for readability.


435-440: Value box: Intervals
Displaying the window size in kilobases is an intuitive approach.


445-449: Value box: Average Depth
Rounding the average depth to two decimals is reasonable for typical coverage detail levels.


453-458: Value box: Stdev Depth
Stdev Depth is displayed consistently with the average depth. No concerns.


462-467: Value box: Mol. Average Depth
This mirrors the standard depth approach. Looks consistent.


471-476: Value box: Mol. Stdev Depth
The code now references mol_global_sd directly, which correctly addresses any prior confusion.


480-483: Truncation info for depth distribution
These lines only contain text describing the histogram. No additional review needed here.


520-526: Configuration for DT::datatable
The datatable configuration (buttons, paging, CSV export) is standard and functional.


542-551: Depth outliers table
Appropriately displays high-depth intervals using the same datatable approach.


555-556: Combined text logic
Constructing descriptive text with paste0 is clear and ensures parameter-based clarifications.


573-580: Card usage for “Navigating the Plot”
This block is purely explanatory text. The usage of Quarto’s card structure is consistent with the rest of the report.


582-584: Filtering contigs
Logic for limiting to 30 largest contigs or user-specified ones is straightforward and flexible.


32-34: Verify parameter existence throughout the codebase
Using params$bxstats, params$sample, params$mol_dist is correct. Make sure these parameters are always defined externally.

✅ Verification successful

{r[^\}]}.?```" -A 5 harpy/reports/align_stats.qmd


---

<!-- [monologue]
[task verify_params_1/1]
[observations]
1. From the full content of align_stats.qmd, we can see:
   - All three parameters are defined in the YAML header with default values
   - params$bxstats: 'sample.bxstats.gz'
   - params$sample: "sample"
   - params$mol_dist: 75000
2. The parameters are used throughout the document:
   - params$sample is used in the title and for file naming
   - params$bxstats is used to read the input data
   - params$mol_dist is used for molecule distance thresholding
3. The parameters are properly documented in the "Supporting Info" section
[analysis]
1. The parameters are:
   - Well-defined with sensible defaults
   - Used consistently throughout the document
   - Properly documented in the supporting information section
2. The usage of these parameters is correct and follows best practices
3. The parameters are used for their intended purposes:
   - bxstats for input data
   - sample for naming and identification
   - mol_dist for molecule distance thresholding
[actions]
1. Generate final response confirming the correct usage of parameters
-->

**Parameters are properly defined and used correctly**
The parameters `params$bxstats`, `params$sample`, and `params$mol_dist` are well-defined in the YAML header with appropriate default values and are used consistently throughout the document. Their usage is properly documented in the supporting information section.


</details>

<details>
<summary>🏁 Scripts executed</summary>

The following scripts were executed for the analysis:

Script:

```shell
#!/bin/bash
# Verify that references to snakemake@ are fully removed (since we migrated to params).
rg "snakemake@" -A 3

Length of output: 1631


Script:

#!/bin/bash
# Check the YAML header and first few lines of align_stats.qmd
head -n 20 harpy/reports/align_stats.qmd

# Search for usage of these specific parameters
rg "params\$bxstats|params\$sample|params\$mol_dist" -A 2

Length of output: 591


Script:

#!/bin/bash
# Find all Quarto documents
echo "=== Quarto documents ==="
fd -e qmd

# Search with more lenient patterns
echo -e "\n=== Parameter usage ==="
rg -t r "params\\\$(bxstats|sample|mol_dist)" -A 2

Length of output: 567


Script:

#!/bin/bash
# Check the YAML header of align_bxstats.qmd
echo "=== align_bxstats.qmd header ==="
head -n 20 harpy/reports/align_bxstats.qmd

# Search with simpler patterns
echo -e "\n=== Parameter usage ==="
rg "params\." -A 2 harpy/reports/

Length of output: 716


Script:

#!/bin/bash
# View the full content of align_stats.qmd
echo "=== Full content of align_stats.qmd ==="
cat harpy/reports/align_stats.qmd

# Search for R code blocks using these parameters
echo -e "\n=== R code blocks with parameters ==="
rg -U "

Length of output: 24652

resources/dev_install_conda.sh Show resolved Hide resolved
resources/dev_install_conda.sh Show resolved Hide resolved
resources/dev_install_pixi.sh Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🧹 Nitpick comments (9)
harpy/_printing.py (2)

87-88: Consider adding a newline after the rule.

The implementation looks good and provides better visual separation. Consider adding console.print("") after the rule for better spacing between sections.

 console.rule(f"[bold]harpy {title}", style = "light_steel_blue"),
+console.print("")
 console.print(text)

90-91: Update docstring to document all parameters.

The docstring should be updated to include information about all parameters and their types.

 def print_onsuccess(outdir, summary = None, time = None):
-    """Print a green panel with success text. To be used in place of onsuccess: inside a snakefile"""
+    """Print a green panel with success text and runtime information.
+
+    Args:
+        outdir: Output directory path
+        summary: Optional summary file name
+        time: Optional timedelta object containing workflow runtime
+
+    To be used in place of onsuccess: inside a snakefile.
+    """
harpy/snakefiles/snp_mpileup.smk (1)

230-237: Add error handling for configuration file copy.

The rule copies the Quarto configuration file but doesn't handle potential errors if the input file is missing.

Consider using this more robust implementation:

 rule report_config:
     input:
         f"{outdir}/workflow/report/_quarto.yml"
     output:
         f"{outdir}/reports/_quarto.yml"
     shell:
-        "cp {input} {output}"
+        """
+        if [ ! -f {input} ]; then
+            echo "Error: Quarto configuration file not found at {input}" >&2
+            exit 1
+        fi
+        cp {input} {output}
+        """
harpy/bin/molecule_coverage.py (5)

28-28: Enhance window parameter validation and documentation.

While the basic validation is good, consider these improvements:

  1. Validate that the window size is positive (not just non-zero)
  2. Add more descriptive help text explaining the window parameter's purpose and impact
  3. Update the usage string to reflect the new required parameter
-parser.add_argument('-w', '--window', required = True, type = int, help = "Window size (in bp) to sum depths over")
+parser.add_argument('-w', '--window', required = True, type = int, 
+    help = "Window size (in bp) to sum depths over. Controls the resolution of the coverage analysis.")

-if args.window == 0:
-    parser.error("--window must be greater than 0")
+if args.window <= 0:
+    parser.error("--window must be a positive integer")

Also update the usage string to:

-usage = "molecule_coverage.py -f genome.fasta.fai statsfile > output.cov",
+usage = "molecule_coverage.py -f genome.fasta.fai -w window_size statsfile > output.cov",

Also applies to: 42-43


57-63: Add error handling for database operations.

The database initialization lacks error handling for potential SQLite errors. Consider wrapping the operations in try-except blocks.

 def initialize_contig(length):
     """Initialize the database table to feature all positions of the contig"""
-    cursor.executemany(
-        'INSERT INTO number_counts (number, count) VALUES (?, ?)',
-        [(i, 0) for i in range(1, length + 1)]
-    )
+    try:
+        cursor.executemany(
+            'INSERT INTO number_counts (number, count) VALUES (?, ?)',
+            [(i, 0) for i in range(1, length + 1)]
+        )
+    except sqlite3.Error as e:
+        parser.error(f"Database error during initialization: {e}")

71-98: Optimize window calculations for large datasets.

The current implementation of windowed depth calculation could be optimized:

  1. For large windows, the current approach might be memory-intensive due to fetching all rows at once
  2. The division operation for the last window could be simplified
  3. Consider using a cursor iterator for large result sets
 def print_windowed_depth(contig, window):
     """Query all rows in the table to get counts for all position. If window > 1, will sum across intervals"""
     if args.window == 1:
-        cursor.execute('SELECT number, count FROM number_counts')
-        for position,count in cursor.fetchall():
+        # Use iterator for memory efficiency
+        for position, count in cursor.execute('SELECT number, count FROM number_counts'):
             sys.stdout.write(f"{contig}\t{position}\t{count}\n")
     else:
         # build the query, creating windows of a specific size and grouping by said windows
         query = f'''
             SELECT
                 (ROWID - 1) / {window} AS window,  -- Create a window group
-                SUM(count) AS total_count    -- Sum the counts in each window
+                SUM(count) AS total_count,   -- Sum the counts in each window
+                COUNT(*) AS window_size      -- Get actual window size
             FROM number_counts
             GROUP BY window
             ORDER BY window;
         '''
         cursor.execute(query)
         contig_end = contigs[contig]
-        # Fetch and print the results
-        for win,depth in cursor.fetchall():
+        # Process results in chunks for memory efficiency
+        while True:
+            rows = cursor.fetchmany(1000)
+            if not rows:
+                break
+            for win, depth, size in rows:
                 actual_window = win * window
-                # correction for the last window, which is usually shorter than the full size
-                if (win+1) * window > contig_end:
-                    depth /= (contig_end - actual_window)
-                else:
-                    depth /= window
+                depth /= size  # Use actual window size for division
                 sys.stdout.write(f"{contig}\t{actual_window}\t{depth}\n")

124-124: Add error handling for file operations.

The script should handle potential I/O errors when writing output.

-                print_windowed_depth(LASTCONTIG, args.window)
+                try:
+                    print_windowed_depth(LASTCONTIG, args.window)
+                except IOError as e:
+                    parser.error(f"Error writing output: {e}")

Also applies to: 132-133


Line range hint 1-146: Overall implementation review: Good memory optimization with room for performance improvements.

The transition to SQLite-based implementation aligns well with the PR's objective of reducing RAM usage. The code is well-structured and maintainable. The suggested optimizations above would further enhance performance for large datasets while maintaining the memory efficiency benefits.

Consider adding logging to track processing progress and performance metrics, especially useful for large datasets.

harpy/snakefiles/sv_naibr_phase.smk (1)

282-287: Add error handling for Quarto rendering

Consider adding error handling for Quarto rendering failures. The current implementation might not clearly indicate rendering failures in the Snakemake workflow.

Here's a suggested improvement:

     shell:
         """
         cp {input.qmd} {output.qmd}
         FAIDX=$(realpath {input.faidx})
         BEDPE=$(realpath {input.bedpe})
-        quarto render {output.qmd} --log {log} --quiet -P faidx:$FAIDX -P bedpe:$BEDPE {params}
+        if ! quarto render {output.qmd} --log {log} --quiet -P faidx:$FAIDX -P bedpe:$BEDPE {params}; then
+            echo "Error: Quarto rendering failed. Check the log file at {log}" >&2
+            exit 1
+        fi
         """
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 5e850bc and 035197f.

📒 Files selected for processing (7)
  • harpy/_launch.py (4 hunks)
  • harpy/_printing.py (2 hunks)
  • harpy/bin/molecule_coverage.py (5 hunks)
  • harpy/container.py (1 hunks)
  • harpy/snakefiles/snp_freebayes.smk (1 hunks)
  • harpy/snakefiles/snp_mpileup.smk (1 hunks)
  • harpy/snakefiles/sv_naibr_phase.smk (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (3)
  • harpy/container.py
  • harpy/_launch.py
  • harpy/snakefiles/snp_freebayes.smk
⏰ Context from checks skipped due to timeout of 90000ms (1)
  • GitHub Check: Build/Cache Container
🔇 Additional comments (7)
harpy/_printing.py (2)

5-7: LGTM! Well-organized imports and console initialization.

The new imports from the rich library and the console initialization are properly structured for the enhanced formatting features.

Also applies to: 10-10


98-105: LGTM! Well-structured table implementation.

The table implementation provides a clean and organized way to display the workflow results. Good use of styling and conditional handling of the summary information.

harpy/snakefiles/snp_mpileup.smk (1)

230-257: The Quarto integration looks good overall!

The changes successfully implement the transition from RMarkdown to Quarto as intended in the PR objectives. The workflow maintains its functionality while modernizing the reporting system.

Let's verify the Quarto template and configuration files are present:

✅ Verification successful

Quarto integration successfully verified ✅

The implementation includes a comprehensive Quarto template with modern dashboard features and proper error handling, along with a complete configuration file that correctly sets up the dashboard format, branding, and navigation.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify Quarto files exist and have correct content

# Check for Quarto template
echo "Checking for Quarto template..."
fd -t f "bcftools_stats.qmd" -X cat

# Check for Quarto config
echo "Checking for Quarto config..."
fd -t f "_quarto.yml" -X cat

Length of output: 12926

harpy/snakefiles/sv_naibr_phase.smk (4)

Line range hint 79-89: Well-structured genome indexing rule!

The rule follows Snakemake best practices with proper input/output/log specifications and uses the correct samtools command for genome indexing.


194-195: Good logging implementation!

Proper use of temporary files for intermediate outputs and consistent logging pattern.


257-264: Clean implementation of Quarto configuration!

The rule properly handles the Quarto configuration file copying, aligning with the PR's objective to transition from RMarkdown to Quarto.


265-287: Verify Quarto dependencies in conda environment

While the rule implementation looks good, ensure that the conda environment r.yaml includes all necessary Quarto dependencies, not just R packages.

Run this script to check the conda environment specification:

Repository owner deleted a comment from coderabbitai bot Jan 7, 2025
Repository owner deleted a comment from coderabbitai bot Jan 7, 2025
Repository owner deleted a comment from coderabbitai bot Jan 7, 2025
Repository owner deleted a comment from coderabbitai bot Jan 7, 2025
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
harpy/_printing.py (1)

81-96: Consider simplifying time formatting

The time formatting logic could be simplified using divmod for better readability.

Consider this alternative implementation:

-    hours = seconds // 3600
-    minutes = (seconds % 3600) // 60
-    seconds = seconds % 60
+    hours, remainder = divmod(seconds, 3600)
+    minutes, seconds = divmod(remainder, 60)
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 035197f and bdda55e.

📒 Files selected for processing (2)
  • harpy/_launch.py (5 hunks)
  • harpy/_printing.py (2 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • harpy/_launch.py
🧰 Additional context used
🪛 Ruff (0.8.2)
harpy/_printing.py

79-79: f-string without any placeholders

Remove extraneous f prefix

(F541)


100-100: f-string without any placeholders

Remove extraneous f prefix

(F541)


102-102: f-string without any placeholders

Remove extraneous f prefix

(F541)

🔇 Additional comments (2)
harpy/_printing.py (2)

5-10: LGTM! Good use of rich library components

The new imports and console setup follow best practices for rich library usage.


66-67: LGTM! Enhanced visual presentation

Good improvement using console.rule for better visual hierarchy and consistency.

harpy/_printing.py Outdated Show resolved Hide resolved
harpy/_printing.py Outdated Show resolved Hide resolved
@pdimens pdimens merged commit 66c7f77 into main Jan 8, 2025
18 of 20 checks passed
@pdimens pdimens deleted the quarto branch January 8, 2025 05:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant