Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix param call #164

Merged
merged 37 commits into from
Nov 19, 2024
Merged

fix param call #164

merged 37 commits into from
Nov 19, 2024

Conversation

pdimens
Copy link
Owner

@pdimens pdimens commented Nov 15, 2024

Summary by CodeRabbit

  • New Features

    • Enhanced argument parsing for barcode length to ensure it is treated as an integer.
    • Streamlined the HaploSim.pl script to focus solely on creating linked reads from provided haplotypes.
    • Introduced a quiet mode option across various simulation commands to suppress output messages.
    • Added a new parameter to haplotag_barcodes.py for specifying the number of barcodes to print.
    • Updated Snakemake workflow for simulating linked reads, enhancing clarity and organization.
    • Implemented a new script, HaploSim.pl, for generating linked reads from haplotypes.
  • Bug Fixes

    • Adjusted error checks for output prefix and barcodes file to prevent potential issues.
    • Improved error handling for invalid barcode entries and duplicate barcodes.
  • Documentation

    • Added detailed logging throughout the read simulation process for better insights.
    • Updated usage instructions for various scripts to reflect new parameters and functionality.

Copy link

coderabbitai bot commented Nov 15, 2024

📝 Walkthrough

Walkthrough

The pull request introduces several modifications across multiple scripts within the Harpy framework. Key changes include updating the --length argument in inline_to_haplotag.py from str to int, enhancing type safety for barcode length. The barcode reading logic is also modified to handle two entries per line in the barcode file. In LRSIM_harpy.pl, the script is restructured to focus on creating linked reads from two haplotypes, with command-line options adjusted and error handling improved. Additional updates include modifications to validation functions, workflow configurations, and environment setup scripts.

Changes

File Path Change Summary
harpy/bin/inline_to_haplotag.py Changed --length argument type from str to int; updated barcode reading logic and error handling.
harpy/scripts/LRSIM_harpy.pl Removed previous functionalities; streamlined to create linked reads from haplotypes; adjusted command-line options; improved error handling.
.github/workflows/tests.yml Updated barcode generation command in the workflow; replaced file decompression with direct generation.
.gitignore Added entry to ignore haplotag.bc files.
harpy/_validations.py Removed check_demux_fastq; enhanced validate_barcodefile and validate_fastq_bx functions; simplified error handling.
harpy/align.py Updated ema function to pass an additional argument for barcode validation.
harpy/bin/haplotag_barcodes.py Modified argument parser to include -n for specifying the number of barcodes to print; updated output logic.
harpy/simulate.py Added quiet parameter to several functions to control output verbosity during simulations.
harpy/snakefiles/simulate_linkedreads.smk Renamed several rules for clarity; updated input/output specifications and logging paths.
resources/createharpyenv.sh Simplified environment creation command to consistently use conda.

Possibly related PRs

  • haplotagging barcodes as default #162: The changes in this PR involve modifications to inline_to_haplotag.py, which is directly related to the main PR's updates to the same file, enhancing barcode processing and argument handling.
  • Leverage heterozygosity to create diploid genomes #140: This PR also modifies harpy/simulate.py, which is relevant as the main PR updates the barcode processing logic in inline_to_haplotag.py, potentially affecting how barcodes are handled in simulations.
  • better sim demux support #163: This PR includes changes to the simulate_linkedreads workflow, which is relevant as it may interact with the barcode processing logic introduced in the main PR.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share
🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

  • Review comments: Directly reply to a review comment made by CodeRabbit. Example:
    • I pushed a fix in commit <commit_id>, please review it.
    • Generate unit testing code for this file.
    • Open a follow-up GitHub issue for this discussion.
  • Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
    • @coderabbitai generate unit testing code for this file.
    • @coderabbitai modularize this function.
  • PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
    • @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
    • @coderabbitai read src/utils.ts and generate unit testing code.
    • @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
    • @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

  • @coderabbitai pause to pause the reviews on a PR.
  • @coderabbitai resume to resume the paused reviews.
  • @coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
  • @coderabbitai full review to do a full review from scratch and review all the files again.
  • @coderabbitai summary to regenerate the summary of the PR.
  • @coderabbitai resolve resolve all the CodeRabbit review comments.
  • @coderabbitai configuration to show the current CodeRabbit configuration for the repository.
  • @coderabbitai help to get help.

Other keywords and placeholders

  • Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
  • Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
  • Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (.coderabbit.yaml)

  • You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
  • Please see the configuration documentation for more information.
  • If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

  • Visit our Documentation for detailed information on how to use CodeRabbit.
  • Join our Discord Community to get help, request features, and share feedback.
  • Follow us on X/Twitter for updates and announcements.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Outside diff range and nitpick comments (2)
harpy/bin/inline_to_haplotag.py (2)

18-18: Add validation for positive barcode length.

The type change from str to int improves type safety. However, consider adding validation to ensure the barcode length is positive to prevent potential issues with negative string slicing.

Consider adding a custom type validator:

def positive_int(value):
    ivalue = int(value)
    if ivalue <= 0:
        raise argparse.ArgumentTypeError(f"Barcode length must be positive, got {value}")
    return ivalue

# Use in argument parser
parser.add_argument("-l", "--length", required=True, type=positive_int, help="Length of the barcodes (all must be one length)")

Line range hint 63-65: Add bounds checking for sequence length.

The code assumes the sequence is at least as long as the barcode length. Consider adding validation to prevent index errors.

Add length validation in the process_record function:

 if fw_entry:
+    if len(fw_entry[1]) < bc_len:
+        raise ValueError(f"Forward read sequence length ({len(fw_entry[1])}) is shorter than barcode length ({bc_len})")
     bc_inline = fw_entry[1][:bc_len]
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between a86e607 and e96e3f4.

📒 Files selected for processing (2)
  • harpy/bin/inline_to_haplotag.py (1 hunks)
  • harpy/scripts/LRSIM_harpy.pl (1 hunks)
🔇 Additional comments (1)
harpy/bin/inline_to_haplotag.py (1)

18-18: Verify compatibility with existing usage.

The type change might affect how this script is called from other tools or scripts.

✅ Verification successful

The previous search might have been too specific. Let's broaden our search to find any references to the script and examine the repository structure better to make an informed decision.


We found references to the script in a Snakefile and README.md. Let's examine their contents to understand how the script is being used and verify if the type change impacts existing usage.


No compatibility issues with type change

The type change for the --length parameter is safe because:

  1. The script is only called from the Snakefile simulate_linkedreads.smk where it's used with {params.bc_len} which is derived from barcode_len config parameter.
  2. The parameter is used internally to slice strings and validate barcodes, where integer type is appropriate.
  3. No documentation or test files show examples of usage that would be affected by this type enforcement.
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Check for any documentation or test files that might need updating
echo "Checking for documentation and test files..."
rg -l "inline_to_haplotag.py.*--length" 

# Check for any shell scripts or other files that might be calling this script
echo "Checking for script usage..."
rg -l "inline_to_haplotag.py.*-l ['\"].*['\"]"

Length of output: 260


Script:

#!/bin/bash
# Let's try a broader search to find any references to this script
echo "Checking for any references to the script..."
rg -l "inline_to_haplotag\.py"

# Also check for any test files or examples
echo "Checking for test files..."
fd -e py "test.*\.py$"

# Look for documentation files
echo "Checking documentation..."
fd -e md -e rst

Length of output: 382


Script:

#!/bin/bash
# Let's check the content of the Snakefile that references our script
echo "Checking simulate_linkedreads.smk content..."
cat harpy/snakefiles/simulate_linkedreads.smk

# Check README.md for any usage examples
echo "Checking README content..."
cat README.md

# Let's also look at the script itself to understand its usage context
echo "Checking script content..."
cat harpy/bin/inline_to_haplotag.py

Length of output: 16033

harpy/scripts/LRSIM_harpy.pl Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (13)
resources/createharpyenv.sh (1)

Line range hint 1-10: Add input validation and error handling

The script should validate its inputs and handle potential errors.

Consider adding these safety checks:

 #! /usr/bin/env bash
+set -euo pipefail
 
 ## Use the first positional argument to set a name, usually `harpy` or `harpytest`
+if [ $# -eq 0 ]; then
+    echo "Error: Environment name not provided" >&2
+    echo "Usage: $0 <env_name>" >&2
+    exit 1
+fi
+
+YAML_FILE="resources/harpy.yaml"
+if [ ! -f "$YAML_FILE" ]; then
+    echo "Error: Configuration file $YAML_FILE not found" >&2
+    exit 1
+fi
 
 if ! command -v conda &> /dev/null; then
     echo "Error: conda is not installed or not in PATH" >&2
     exit 1
 fi
 
-conda env create -n $1 -f resources/harpy.yaml
+conda env create -n "$1" -f "$YAML_FILE"

These changes:

  1. Enable strict error handling with set -euo pipefail
  2. Validate that an environment name is provided
  3. Check if the yaml file exists
  4. Quote variables to handle spaces in paths
harpy/bin/haplotag_barcodes.py (2)

9-12: Consider enhancing the usage and help text.

The argument parser changes look good, but consider making the usage and help text more descriptive:

  • Usage could specify that it outputs one barcode per line
  • Help text could mention that barcodes are concatenated ACBD sequences
-    usage = "haplotag_barcodes.py [-n] > barcodes.txt"
+    usage = "haplotag_barcodes.py [-n NUMBER] > barcodes.txt # outputs one barcode per line"
-    help = "How many barcodes to print (min: 1, max: 84934656)")
+    help = "Number of concatenated ACBD barcodes to print (min: 1, max: 84934656)")

15-19: Enhance error handling and documentation.

The validation logic is correct, but consider these improvements:

  1. Add more context to the error message
  2. Better document the zero-indexing adjustment
 if args.number < 1 or args.number > 96**4:
-    parser.error("--number must be between 1 and 96^4 (84934656)")
+    parser.error(f"--number must be between 1 and 96^4 (84934656), got {args.number}")
 
-# subtract 1 b/c of python 0-indexing
+# Subtract 1 from args.number because enumerate in the loop below is zero-indexed
+# This ensures we generate exactly the number of barcodes requested
 args.number -= 1
harpy/bin/inline_to_haplotag.py (1)

78-78: Remove unnecessary f-string

The f-string prefix is not needed as there are no placeholders in the string.

-        rv_entry[0] = rv_entry[0].split()[0] + f"\tBX:Z:A00C00B00D00"
+        rv_entry[0] = rv_entry[0].split()[0] + "\tBX:Z:A00C00B00D00"
🧰 Tools
🪛 Ruff

78-78: f-string without any placeholders

Remove extraneous f prefix

(F541)

harpy/snakefiles/simulate_linkedreads.smk (2)

Line range hint 56-75: Consider making the read pairs multiplier configurable.

The hardcoded multiplier 500000 in the read pairs calculation should be moved to the configuration for better flexibility:

readpairs = int(config["read_pairs"] * config.get("read_pairs_multiplier", 500000))

Line range hint 86-115: Consider consolidating LRSIM parameters.

The large number of individual parameters in make_linked_reads could be consolidated into a configuration object for better maintainability:

params:
    lrsim_config = config.get("lrsim", {}),
    cmd = lambda w, p: (
        f"perl {p.lrsim} "
        f"-g {p.lrsim_config.get('input_files')} "
        f"-b {p.lrsim_config.get('barcodes')} "
        # ... other parameters
    )
harpy/align.py (1)

213-213: LGTM! Consider using named parameters for better readability.

The parameter fix looks good. However, consider using named parameters to make the code more explicit and maintainable:

-        validate_barcodefile(barcode_list, False, quiet)
+        validate_barcodefile(barcode_list, length_limit=False, quiet=quiet)
harpy/scripts/LRSIM_harpy.pl (3)

Line range hint 37-74: Enhance input validation for command-line options

The current validation checks are incomplete. Consider adding:

  1. Validation for negative values in numeric parameters (e.g., -d, -l, -x, -f, -t, -m)
  2. File existence checks for input haplotypes specified via -g
  3. More descriptive error messages including valid value ranges
 die "Please provide a output prefix with -p\n" if ( not defined $opts{p} );
 die "Output prefix (-p) cannot end with a /\n" if ( $opts{p} =~ /\/$/ );
 die "Please provide a barcodes file with -b\n" if ( not defined $opts{b} );
 die "Barcodes file $opts{b} not found\n"       if ( !-s "$opts{b}" );
+die "Please provide input haplotypes with -g\n" if ( not defined $opts{g} );
+foreach my $hap (split /,/, $opts{g}) {
+    die "Haplotype file $hap not found or empty\n" if ( !-s $hap );
+}
+die "Number of molecules (-m) must be positive\n" if ( $opts{m} <= 0 );
+die "Number of partitions (-t) must be positive\n" if ( $opts{t} <= 0 );
+die "Fragment size (-f) must be positive\n" if ( $opts{f} <= 0 );
+die "Number of read pairs (-x) must be positive\n" if ( $opts{x} <= 0 );
+die "Barcode length (-l) must be at least 6\n" if ( $opts{l} < 6 );

79-158: Optimize barcode error rate lookup tables

The current implementation has several inefficiencies:

  1. Hardcoded error rate tables take up significant space
  2. Entries 16-24 contain duplicate data
  3. Tables extend beyond typical barcode length

Consider:

  1. Moving the error rate tables to a separate configuration file
  2. Implementing a function to generate error rates for positions beyond 15
  3. Adding validation to ensure barcode length doesn't exceed maximum supported length
# Example implementation for a separate config file
sub load_error_rates {
    my $config_file = shift;
    my %rates;
    # Load base rates (0-15) from config file
    # Generate rates for positions > 15 using the pattern from position 15
    return %rates;
}

Line range hint 1-1: Consider performance optimizations for large-scale operations

The script processes large amounts of data and performs repeated file operations. Consider the following optimizations:

  1. Implement memory-mapped files for large data structures
  2. Add caching for frequently accessed barcode patterns
  3. Consider parallel processing for read simulation across multiple haplotypes

Example implementation for memory mapping:

use Sys::Mmap;

# Map the barcode file into memory
my $mapped_data;
mmap($mapped_data, 0, PROT_READ, MAP_SHARED, $fh) 
    or die "Unable to mmap: $!";
harpy/simulate.py (1)

169-169: Document the purpose of the boolean parameter.

The validate_barcodefile function is called with a boolean parameter True, but its purpose is not immediately clear. Consider adding a comment or using a named parameter to improve code readability.

harpy/_validations.py (2)

354-355: Simplify loop indexing using enumerate()

Consider using enumerate() to simplify the loop and make the code cleaner.

Apply this diff to refactor the loop:

-            idx = 0
-            for line in fin:
-                idx += 1
+            for idx, line in enumerate(fin, start=1):
🧰 Tools
🪛 Ruff

355-355: Use enumerate() for index variable idx in for loop

(SIM113)


475-476: Rename unused loop variable to _

The loop variable future is not used within the loop body. Renaming it to _ indicates that it is intentionally unused.

Apply this diff:

-        for future in as_completed(futures):
+        for _ in as_completed(futures):
🧰 Tools
🪛 Ruff

475-475: Loop control variable future not used within loop body

Rename unused future to _future

(B007)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between e96e3f4 and d6d10de.

📒 Files selected for processing (10)
  • .github/workflows/tests.yml (1 hunks)
  • .gitignore (1 hunks)
  • harpy/_validations.py (6 hunks)
  • harpy/align.py (1 hunks)
  • harpy/bin/haplotag_barcodes.py (1 hunks)
  • harpy/bin/inline_to_haplotag.py (2 hunks)
  • harpy/scripts/LRSIM_harpy.pl (7 hunks)
  • harpy/simulate.py (1 hunks)
  • harpy/snakefiles/simulate_linkedreads.smk (8 hunks)
  • resources/createharpyenv.sh (1 hunks)
✅ Files skipped from review due to trivial changes (1)
  • .gitignore
🧰 Additional context used
🪛 Ruff
harpy/_validations.py

355-355: Use enumerate() for index variable idx in for loop

(SIM113)


475-475: Loop control variable future not used within loop body

Rename unused future to _future

(B007)

harpy/bin/inline_to_haplotag.py

78-78: f-string without any placeholders

Remove extraneous f prefix

(F541)

🔇 Additional comments (15)
resources/createharpyenv.sh (1)

7-7: Verify yaml file location and environment creation

Let's verify the yaml file location and check for existing environments.

✅ Verification successful

The yaml file path and conda environment creation are correctly configured

The verification confirms:

  • harpy.yaml exists in the correct location (resources/harpy.yaml)
  • The yaml file is in the same directory as the script (createharpyenv.sh)
  • The yaml file contains valid conda environment configuration with appropriate dependencies
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify yaml file location and conda environment setup

# Check if harpy.yaml exists relative to script location
echo "Checking yaml file location..."
fd -t f "harpy.yaml$"

# Check for references to conda environment creation in other files
echo -e "\nChecking conda environment references..."
rg -l "conda.*env.*create" --type-add 'yaml:*.{yml,yaml}' -t yaml

Length of output: 280


Script:

#!/bin/bash
# Let's verify the content of harpy.yaml and check its structure
echo "Content of resources/harpy.yaml:"
cat resources/harpy.yaml

# Check if the script and yaml are in the same directory
echo -e "\nChecking relative locations:"
ls -la resources/

Length of output: 831

harpy/bin/haplotag_barcodes.py (1)

21-24: Verify completeness of barcode sequences.

Let's verify that all barcode sequences in the BX dictionary are valid and complete.

✅ Verification successful

Let me try a different approach to verify the sequences since the previous script had issues with the regex pattern.


Let me try one more verification script with a simpler approach to parse the sequences correctly:


Barcode sequences verified successfully

The verification confirms that all barcode sequences are valid and complete:

  • All sequences are exactly 6 nucleotides long and contain only valid nucleotides (ACGT)
  • Each key (A, B, C, D) has 96 unique sequences with no duplicates
  • No invalid characters or malformed sequences were found
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify barcode sequences

# Check if all sequences have the same length (should be 6)
echo "Checking sequence lengths..."
for key in A C B D; do
    echo "Key: $key"
    rg -o '\["([^"]+)"' "harpy/bin/haplotag_barcodes.py" | \
    rg -o '"([^"]+)"' | \
    rg -o '[^"]+' | \
    awk -v key="$key" 'BEGIN{print "Sequence lengths for " key ":"}length($0)!=6{print "Invalid length: " length($0) " for sequence: " $0}'
done

# Check for duplicate sequences within each key
echo -e "\nChecking for duplicates..."
for key in A C B D; do
    echo "Key: $key"
    rg -o '\["([^"]+)"' "harpy/bin/haplotag_barcodes.py" | \
    rg -o '"([^"]+)"' | \
    rg -o '[^"]+' | \
    sort | uniq -d
done

Length of output: 2439


Script:

#!/bin/bash
# Description: Verify barcode sequences with improved pattern matching

# Extract and analyze sequences for each key
for key in A C B D; do
    echo "=== Analysis for key $key ==="
    # Extract sequences for the specific key
    rg "\"$key\": \[(.*?)\]" -U harpy/bin/haplotag_barcodes.py -o --json | \
    jq -r '.data.submatches[0].match' | \
    tr ',' '\n' | \
    tr -d ' "' | \
    grep -v '^$' | \
    while read seq; do
        # Check length
        if [ ${#seq} != 6 ]; then
            echo "Invalid length (${#seq}): $seq"
        fi
        # Check if sequence contains only valid nucleotides
        if [[ ! $seq =~ ^[ACGT]+$ ]]; then
            echo "Invalid nucleotides: $seq"
        fi
    done
    
    # Count total sequences
    echo "Total sequences:"
    rg "\"$key\": \[(.*?)\]" -U harpy/bin/haplotag_barcodes.py -o --json | \
    jq -r '.data.submatches[0].match' | \
    tr ',' '\n' | \
    tr -d ' "' | \
    grep -v '^$' | wc -l
    
    # Check for duplicates
    echo "Duplicates:"
    rg "\"$key\": \[(.*?)\]" -U harpy/bin/haplotag_barcodes.py -o --json | \
    jq -r '.data.submatches[0].match' | \
    tr ',' '\n' | \
    tr -d ' "' | \
    grep -v '^$' | \
    sort | uniq -d
done

Length of output: 18311


Script:

#!/bin/bash
# Description: Verify barcode sequences with simplified parsing

for key in A C B D; do
    echo "=== Analysis for key $key ==="
    
    # Extract sequences more reliably using a simpler pattern
    rg "\"$key\":\s*\[(.*?)\]" harpy/bin/haplotag_barcodes.py -o | \
    grep -o '"[A-Z]\{6\}"' | \
    tr -d '"' | \
    while read -r seq; do
        # Verify sequence is exactly 6 nucleotides and contains only ACGT
        if [[ ! $seq =~ ^[ACGT]{6}$ ]]; then
            echo "Invalid sequence: $seq"
        fi
    done
    
    # Count unique sequences
    echo "Total unique sequences:"
    rg "\"$key\":\s*\[(.*?)\]" harpy/bin/haplotag_barcodes.py -o | \
    grep -o '"[A-Z]\{6\}"' | \
    tr -d '"' | \
    sort -u | wc -l
    
    # Find duplicates
    echo "Duplicates:"
    rg "\"$key\":\s*\[(.*?)\]" harpy/bin/haplotag_barcodes.py -o | \
    grep -o '"[A-Z]\{6\}"' | \
    tr -d '"' | \
    sort | uniq -d
done

Length of output: 19699

harpy/bin/inline_to_haplotag.py (3)

66-72: Improved string handling implementation

The changes to use join() for constructing the entries are a good improvement:

  • More efficient than string concatenation
  • Improved readability
  • Consistent approach across both forward and reverse read handling

Also applies to: 78-79


Line range hint 1-112: Well-structured implementation with good practices

The overall implementation demonstrates several good practices:

  • Efficient handling of large files using iterators
  • Proper error handling and input validation
  • Good separation of concerns between functions
  • Memory-efficient processing of gzipped files
🧰 Tools
🪛 Ruff

78-78: f-string without any placeholders

Remove extraneous f prefix

(F541)


18-18: Verify impact of type change from str to int

The change of --length argument type from str to int is logical but represents a breaking change that could affect existing scripts.

✅ Verification successful

Type change for --length argument is safe

The change from str to int for the --length argument is safe because in the only usage found (harpy/snakefiles/simulate_linkedreads.smk), the parameter is passed as {params.bc_len} which is derived from barcode_len config variable. This variable is already handled as a numeric value in the workflow, as evidenced by its usage with -l parameter in the LRSIM command. No string literals are passed to this argument.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for any scripts in the codebase that might be calling this script
# to verify the impact of the breaking change
rg -l "inline_to_haplotag\.py.*--length.*['\"]" 

# Also check for any workflow or configuration files
rg -l "inline_to_haplotag\.py.*-l.*['\"]"

Length of output: 139


Script:

#!/bin/bash
# Check the content of the Snakemake file to see how inline_to_haplotag.py is called
cat harpy/snakefiles/simulate_linkedreads.smk

Length of output: 7982

harpy/snakefiles/simulate_linkedreads.smk (3)

5-5: LGTM! Good improvements to the setup section.

The additions improve code organization and readability:

  • shutil import supports proper file handling
  • barcode_len configuration centralizes barcode length management
  • genodict mapping improves code clarity for genome file selection

Also applies to: 20-20, 22-22


24-35: Well-implemented file handling with proper compression support!

Good improvements:

  • Efficient handling of both gzipped and uncompressed files
  • Proper use of shutil.copyfileobj for streaming large files
  • Correct symlink handling with path resolution

129-141: Verify impact of output path changes.

The change from '10X' to 'multiplex' in output paths might affect downstream tools expecting the old format.

✅ Verification successful

The previous search was too specific. Let's broaden our search to find any references to these output files and potential downstream dependencies.


Let's examine the actual content of these files to understand if there are any dependencies on these output files and how they're being used.


Path changes from '10X' to 'multiplex' are safe to proceed with

After examining the codebase, the change from '10X' to 'multiplex' in the output paths is safe because:

  1. The output files from extract_linked_reads rule are immediately consumed by the demultiplex_barcodes rule in the same workflow
  2. No other rules or files in the codebase reference the old '10X' path format
  3. The final outputs of the workflow are in a different location (/sim_hap{hap}.R{1,2}.fq.gz), so downstream dependencies are not affected
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for potential dependencies on the old '10X' path format
rg -l '10X.*multiplex.*fastq'

Length of output: 31


Script:

#!/bin/bash
# Search for any references to the output files or paths containing 'multiplex'
rg -l "multiplex.*fastq"

# Search for rules that might consume these output files
ast-grep --pattern 'rule $_ {
  input:
    $$$multiplex$$$
  $$$
}'

# Search for potential configuration or parameter files
fd -e yaml -e config -e json

Length of output: 210


Script:

#!/bin/bash
# Let's examine the content of the file to see if there are any downstream rules using these outputs
cat harpy/snakefiles/simulate_linkedreads.smk

# Also check the config files for any relevant paths
cat resources/harpy.yaml
cat resources/meta.yaml

Length of output: 10228

.github/workflows/tests.yml (1)

749-749: LGTM! Dynamic barcode generation improves flexibility.

The change from using a static compressed file to dynamically generating barcodes is a good improvement. It reduces repository size and allows for easier adjustment of barcode counts.

Let's verify the barcode generation script's output format and count:

harpy/scripts/LRSIM_harpy.pl (1)

368-368: LGTM: Barcode length handling

The barcode length is now properly handled using the command-line option.

Note: A previous review already highlighted the need to handle potential negative values in barcode length calculations.

harpy/simulate.py (2)

Line range hint 1-700: Well-structured command implementations!

The code demonstrates consistent patterns across all commands for:

  • Configuration management
  • Error handling
  • Parameter validation
  • Workflow setup

Line range hint 1-700: Verify quiet parameter handling across the codebase.

The quiet parameter has been added consistently to all simulation commands. Let's verify its usage in related functions.

✅ Verification successful

The quiet parameter is properly implemented and consistent across the codebase

The verification shows that:

  • The quiet parameter is consistently defined across all commands with the same default value (False) and help text
  • The launch_snakemake function properly handles the quiet parameter to control output visibility
  • The validate_barcodefile function accepts and uses the quiet parameter
  • The implementation follows a consistent pattern where "if not quiet:" is used to conditionally show output
  • The quiet parameter is integrated uniformly across the entire harpy codebase, not just in the simulate commands

The implementation is robust and follows best practices for controlling verbosity.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check the implementation of quiet parameter in related functions

# Check validate_barcodefile implementation
ast-grep --pattern 'def validate_barcodefile($$$quiet$$$)'

# Check launch_snakemake implementation
ast-grep --pattern 'def launch_snakemake($$$quiet$$$)'

# Check for any TODOs related to quiet parameter
rg -i "todo.*quiet"

Length of output: 139


Script:

#!/bin/bash
# Let's try a broader search for the functions and their implementations

# Search for validate_barcodefile function
rg -A 10 "def validate_barcodefile"

# Search for launch_snakemake function
rg -A 10 "def launch_snakemake"

# Search for any quiet parameter usage
rg "quiet.*=.*False"

# Search for print statements that might be controlled by quiet
rg "if not quiet:"

Length of output: 5173

harpy/_validations.py (3)

172-181: LGTM

The updated code correctly implements multithreading with progress updates.


379-410: FASTA validation enhancements are correct

The improved validation checks for FASTA files are properly implemented and enhance error handling.


478-510: Improvements to barcode file validation are appropriate

The modifications to the validate_barcodefile function improve barcode validation and are correctly implemented.

resources/createharpyenv.sh Outdated Show resolved Hide resolved
harpy/bin/haplotag_barcodes.py Show resolved Hide resolved
harpy/snakefiles/simulate_linkedreads.smk Outdated Show resolved Hide resolved
harpy/scripts/LRSIM_harpy.pl Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (14)
harpy/bin/inline_to_haplotag.py (5)

12-12: Remove unnecessary f-string prefix

The usage string doesn't contain any placeholders, so the f-string prefix is not needed.

-    usage = f"inline_to_haplotag.py -f <forward.fq.gz> -r <reverse.fq.gz> -b <barcodes.txt> -p <prefix>",
+    usage = "inline_to_haplotag.py -f <forward.fq.gz> -r <reverse.fq.gz> -b <barcodes.txt> -p <prefix>",
🧰 Tools
🪛 Ruff

12-12: f-string without any placeholders

Remove extraneous f prefix

(F541)


67-67: Remove unnecessary f-string prefix

The string doesn't contain any placeholders, so the f-string prefix is not needed.

-        rv_entry[0] = rv_entry[0].split()[0] + f"\tBX:Z:A00C00B00D00"
+        rv_entry[0] = rv_entry[0].split()[0] + "\tBX:Z:A00C00B00D00"
🧰 Tools
🪛 Ruff

67-67: f-string without any placeholders

Remove extraneous f prefix

(F541)


49-69: Consider extracting the default barcode as a constant

The default barcode "A00C00B00D00" appears multiple times in the code. Consider defining it as a constant at the module level to improve maintainability and reduce the risk of inconsistencies.

+DEFAULT_BARCODE = "A00C00B00D00"
+
 def process_record(fw_entry, rv_entry, barcode_dict, bc_len):
     """convert the barcode to haplotag"""
     # [0] = header, [1] = seq, [2] = +, [3] = qual
     if fw_entry:
         bc_inline = fw_entry[1][:bc_len]
-        bc_hap = barcode_dict.get(bc_inline, "A00C00B00D00")
+        bc_hap = barcode_dict.get(bc_inline, DEFAULT_BARCODE)
         ...
     else:
         # no forward read, therefor no barcode to search for
-        rv_entry[0] = rv_entry[0].split()[0] + "\tBX:Z:A00C00B00D00"
+        rv_entry[0] = rv_entry[0].split()[0] + f"\tBX:Z:{DEFAULT_BARCODE}"
🧰 Tools
🪛 Ruff

67-67: f-string without any placeholders

Remove extraneous f prefix

(F541)


90-90: Remove unnecessary f-string prefix

The string concatenation doesn't need an f-string prefix as it uses string concatenation operator.

-        sys.stderr.write(f"Can only search sequences for barcodes of a single length, but multiple barcode legnths detected: " + ",".join([str(i) for i in lengths]))
+        sys.stderr.write("Can only search sequences for barcodes of a single length, but multiple barcode legnths detected: " + ",".join([str(i) for i in lengths]))
🧰 Tools
🪛 Ruff

90-90: f-string without any placeholders

Remove extraneous f prefix

(F541)


79-86: Enhance error handling for barcode validation

The error messages are clear, but the script exits immediately on the first error. Consider collecting all validation errors and reporting them together.

+validation_errors = []
 with opener(args.barcodes, mode) as bc_file:
     for line in bc_file:
         try:
             ATCG,ACBD = line.rstrip().split()
         except ValueError:
-            sys.stderr.write(f"Invalid barcode entry: {line.rstrip()}\nExpected two entries: a nucleotide barcode and ACBD format barcode with a space/tab separatng them, e.g. ATATCAGA A01C22B13D93")
-            sys.exit(1)
+            validation_errors.append(f"Invalid barcode entry: {line.rstrip()}\nExpected two entries: a nucleotide barcode and ACBD format barcode with a space/tab separating them, e.g. ATATCAGA A01C22B13D93")
+            continue
         if not set(ATCG).issubset(nucleotides):
-            sys.stderr.write(f"Invalid barcode format: {ATCG}. Barcodes must be captial letters and only contain standard nucleotide values ATCG.\n")
-            sys.exit(1)
+            validation_errors.append(f"Invalid barcode format: {ATCG}. Barcodes must be capital letters and only contain standard nucleotide values ATCG.")
+            continue
+if validation_errors:
+    sys.stderr.write("\n".join(validation_errors) + "\n")
+    sys.exit(1)
harpy/snakefiles/simulate_linkedreads.smk (3)

72-90: Consider parameterizing read lengths.

The read lengths (135 and 151) are hard-coded in the dwgsim command. Consider moving these to the configuration for better flexibility.

-        dwgsim -N {params.readpairs} -e 0.0001,0.0016 -E 0.0001,0.0016 -d {params.outerdist} -s {params.distsd} -1 135 -2 151 -H -y 0 -S 0 -c 0 -R 0 -o 1 -r {params.mutationrate} -F 0 -m /dev/null {input} {params.prefix} 2> {log}
+        dwgsim -N {params.readpairs} -e 0.0001,0.0016 -E 0.0001,0.0016 -d {params.outerdist} -s {params.distsd} -1 {config[read1_length]} -2 {config[read2_length]} -H -y 0 -S 0 -c 0 -R 0 -o 1 -r {params.mutationrate} -F 0 -m /dev/null {input} {params.prefix} 2> {log}

Line range hint 102-133: Add input validation for critical files.

The rule handles multiple input files but doesn't validate their existence before running LRSIM. Consider adding validation to fail fast if any required files are missing.

 rule make_molecules:
     input:
         hap_reads = collect(outdir + "/dwgsim/sim_reads.{hap}.12.fastq", hap = [0,1]),
         fasta_fai = collect(outdir + "/workflow/input/hap.{hap}.fasta.fai", hap = [0,1]),
         barcodes = barcode_file
     output:
         collect(outdir + "/lrsim/sim.{hap}.manifest", hap = [0,1]),
         collect(outdir + "/lrsim/sim.{hap}.fp", hap = [0,1]),
         temp(f"{outdir}/lrsim/.status")
     run:
+        # Validate input files
+        for file_path in input.hap_reads + input.fasta_fai + [input.barcodes]:
+            if not os.path.exists(file_path):
+                raise FileNotFoundError(f"Required input file not found: {file_path}")
         shell("perl {params} -z {threads} 2> {log}")

Line range hint 177-215: Consider adding version information to the workflow summary.

The summary is comprehensive but lacks version information for key tools (dwgsim, LRSIM). This would be valuable for reproducibility.

     run:
         summary = ["The harpy simulate linkedreas workflow ran using these parameters:"]
+        # Add version information
+        summary.append("\nTool versions:")
+        summary.append(f"dwgsim: $(dwgsim --version 2>&1)")
+        summary.append(f"LRSIM: $(perl {params.lrsim} --version 2>&1)")
         summary.append(f"Genome haplotype 1: {gen_hap1}")
harpy/scripts/LRSIM_harpy.pl (2)

429-430: Update usage documentation

The usage documentation has been updated for the new options but could be more descriptive.

Enhance the documentation for the new options:

-    -g STRING   Haploid FASTA .FAI files, separated by comma
-    -a STRING   DWGSIM sequences, interleaved and separated by comma
+    -g STRING   Haploid FASTA .FAI files for input haplotypes, comma-separated
+    -a STRING   DWGSIM sequences for read simulation, interleaved and comma-separated
+                Format: sequence1,sequence2 for two haplotypes

398-398: Fix typo in log message

There's a typo in the log message.

-            &Log("Reads remaning: $readsCountDown");
+            &Log("Reads remaining: $readsCountDown");
harpy/_validations.py (4)

172-181: Consider renaming unused loop variable

The loop variable future in the as_completed loop is only used to get its result. Consider renaming it to _future to indicate it's intentionally unused.

- for future in as_completed(futures):
-     result = future.result()
+ for _future in as_completed(futures):
+     result = _future.result()

354-355: Simplify loop using enumerate

Replace manual index increment with Python's enumerate function for better readability and maintainability.

- idx = 0
- for line in fin:
-     idx += 1
+ for idx, line in enumerate(fin, start=1):
🧰 Tools
🪛 Ruff

355-355: Use enumerate() for index variable idx in for loop

(SIM113)


472-476: Consider renaming unused loop variable

Similar to the earlier instance, the loop variable in the as_completed loop is unused. Consider renaming it to _future.

- for future in as_completed(futures):
+ for _future in as_completed(futures):
🧰 Tools
🪛 Ruff

475-475: Loop control variable future not used within loop body

Rename unused future to _future

(B007)


478-480: Address the TODO comment for duplicate barcode check

The TODO comment indicates that duplicate barcode checking functionality is missing. This is important to prevent potential issues with duplicate barcodes in the analysis pipeline.

Would you like me to help implement the duplicate barcode check functionality or create a GitHub issue to track this task?

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between d6d10de and 1ce1aa9.

📒 Files selected for processing (4)
  • harpy/_validations.py (6 hunks)
  • harpy/bin/inline_to_haplotag.py (2 hunks)
  • harpy/scripts/LRSIM_harpy.pl (10 hunks)
  • harpy/snakefiles/simulate_linkedreads.smk (5 hunks)
🧰 Additional context used
🪛 Ruff
harpy/_validations.py

355-355: Use enumerate() for index variable idx in for loop

(SIM113)


475-475: Loop control variable future not used within loop body

Rename unused future to _future

(B007)

harpy/bin/inline_to_haplotag.py

12-12: f-string without any placeholders

Remove extraneous f prefix

(F541)


67-67: f-string without any placeholders

Remove extraneous f prefix

(F541)


90-90: f-string without any placeholders

Remove extraneous f prefix

(F541)

🔇 Additional comments (7)
harpy/bin/inline_to_haplotag.py (1)

95-99: LGTM! Well-structured file handling and read pairing.

The code demonstrates good practices:

  • Uses context managers for proper file handling
  • Specifies compression level for output files
  • Correctly handles uneven read pairs using zip_longest
harpy/snakefiles/simulate_linkedreads.smk (3)

5-8: LGTM! Clean setup with necessary imports and constraints.

The added imports and wildcard constraints are well-justified by their usage in the workflow. The configuration setup is clean and organized.

Also applies to: 16-17, 25-25


40-51: LGTM! Robust file handling implementation.

The rule handles both gzipped and uncompressed files appropriately, with proper symlink handling for uncompressed files.


Line range hint 1-215: LGTM! Well-structured workflow with clear dependencies.

The overall workflow is well-organized with proper rule dependencies and consistent output paths. The renaming of rules has improved clarity and maintainability.

harpy/scripts/LRSIM_harpy.pl (2)

Line range hint 37-75: Verify command-line option validation

The script has been modified to handle new command-line options, but some validations could be improved:

  1. The -g and -a options are required but not validated for existence
  2. The -l (barcode length) option should be validated against minimum requirements

Add these validations after the existing checks:

 die "Barcodes file $opts{b} not found\n"       if ( !-s "$opts{b}" );
+die "Please provide haplotype FAI files with -g\n" if ( not defined $opts{g} );
+die "Please provide DWGSIM sequences with -a\n" if ( not defined $opts{a} );
+die "Barcode length (-l) must be at least 6\n" if ( $opts{l} < 6 );

367-389: ⚠️ Potential issue

Commented out barcode mismatch simulation

A significant block of code for simulating barcode mismatches has been commented out. This could affect the quality and realism of the simulated data.

Either:

  1. Remove the commented code if it's no longer needed
  2. Restore the functionality if barcode mismatches should be simulated
  3. Add a comment explaining why this functionality is disabled

If this is intentional, consider adding a comment explaining the decision:

+# Barcode mismatch simulation disabled in Harpy fork as mismatches are handled elsewhere
-#for ( my $k = 0 ; $k < $barcodeLength ; ++$k ) {
-#    my $isErr =
...
harpy/_validations.py (1)

379-410: LGTM! Thorough FASTA validation implementation

The FASTA validation implementation is comprehensive, covering all essential format requirements with clear error messages and solutions.

harpy/bin/inline_to_haplotag.py Outdated Show resolved Hide resolved
harpy/snakefiles/simulate_linkedreads.smk Outdated Show resolved Hide resolved
harpy/scripts/LRSIM_harpy.pl Outdated Show resolved Hide resolved
harpy/_validations.py Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 8

🧹 Outside diff range and nitpick comments (12)
harpy/bin/inline_to_haplotag.py (3)

11-12: Remove unnecessary f-string prefix

The f-string prefix is not needed since there are no variables being interpolated in the usage string.

-    usage = f"inline_to_haplotag.py -f <forward.fq.gz> -r <reverse.fq.gz> -b <barcodes.txt> -p <prefix>",
+    usage = "inline_to_haplotag.py -f <forward.fq.gz> -r <reverse.fq.gz> -b <barcodes.txt> -p <prefix>",
🧰 Tools
🪛 Ruff

12-12: f-string without any placeholders

Remove extraneous f prefix

(F541)


49-68: Extract default barcode as a constant and remove unnecessary f-string

The function logic is correct, but there are two improvements to consider:

  1. The default barcode "A00C00B00D00" is hardcoded in multiple places
  2. There's an unnecessary f-string prefix when no variables are being interpolated
+DEFAULT_BARCODE = "A00C00B00D00"
+
 def process_record(fw_entry, rv_entry, barcode_dict, bc_len):
     """convert the barcode to haplotag"""
     # [0] = header, [1] = seq, [2] = +, [3] = qual
     if fw_entry:
         bc_inline = fw_entry[1][:bc_len]
-        bc_hap = barcode_dict.get(bc_inline, "A00C00B00D00")
+        bc_hap = barcode_dict.get(bc_inline, DEFAULT_BARCODE)
         fw_entry[0] = fw_entry[0].split()[0] + f"\tOX:Z:{bc_inline}\tBX:Z:{bc_hap}"
         fw_entry[1] = fw_entry[1][bc_len:]
         fw_entry[3] = fw_entry[3][bc_len:]
         _new_fw = "\n".join(fw_entry) + "\n"
         if rv_entry:
             rv_entry[0]  = rv_entry[0].split()[0] + f"\tOX:Z:{bc_inline}\tBX:Z:{bc_hap}"
             _new_rv = "\n".join(rv_entry) + "\n"
         else:
             _new_rv = None
         return _new_fw, _new_rv
     else:
         # no forward read, therefor no barcode to search for
-        rv_entry[0] = rv_entry[0].split()[0] + f"\tBX:Z:A00C00B00D00"
+        rv_entry[0] = rv_entry[0].split()[0] + "\tBX:Z:" + DEFAULT_BARCODE
         _new_rv = "\n".join(rv_entry) + "\n"
         return None, _new_rv
🧰 Tools
🪛 Ruff

67-67: f-string without any placeholders

Remove extraneous f prefix

(F541)


72-92: Fix typo in error message and remove unnecessary f-string

The barcode validation logic is thorough, but there are two minor issues:

  1. There's a typo in "legnths" in the error message
  2. There's an unnecessary f-string prefix when concatenating strings
     if len(lengths) > 1:
-        sys.stderr.write(f"Can only search sequences for barcodes of a single length, but multiple barcode legnths detected: " + ",".join([str(i) for i in lengths]))
+        sys.stderr.write("Can only search sequences for barcodes of a single length, but multiple barcode lengths detected: " + ",".join([str(i) for i in lengths]))
🧰 Tools
🪛 Ruff

90-90: f-string without any placeholders

Remove extraneous f prefix

(F541)

harpy/snakefiles/simulate_linkedreads.smk (2)

72-90: Consider adding parameter validation for dwgsim

While the implementation is solid, consider adding validation for critical parameters:

  • Mutation rate should be between 0 and 1
  • Read pairs should be positive
  • Distance parameters should be reasonable for the sequencing technology

Path to HaploSim.pl script needs correction

The script exists at harpy/scripts/HaploSim.pl, but the rule is looking for it at {outdir}/workflow/scripts/HaploSim.pl. Update the lrsim parameter in the rule to use the correct path:

lrsim = "harpy/scripts/HaploSim.pl"
🔗 Analysis chain

Line range hint 102-133: Verify HaploSim.pl script location

Ensure that the HaploSim.pl script is properly installed in the expected location:

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Verify HaploSim.pl script exists and is executable
fd -t f "HaploSim.pl$" -x test -x {} \;

Length of output: 80


Script:

#!/bin/bash
# Let's first check if the script exists without testing execution permissions
fd "HaploSim.pl$"

# Also search with ripgrep to see any references to this script
rg "HaploSim.pl" -l

Length of output: 151

harpy/_validations.py (5)

172-181: Consider simplifying the futures loop.

The implementation is good, but we can make it slightly more efficient by avoiding the unnecessary result variable assignment.

 with harpy_progressbar(quiet) as progress, ThreadPoolExecutor(max_workers=threads) as executor:
     task_progress = progress.add_task("[green]Checking RG tags", total=len(bamlist))
     futures = [executor.submit(check_RG, i) for i in bamlist]
     for future in as_completed(futures):
-        result = future.result()
-        progress.update(task_progress, advance=1)
-        if result:
-            culpritfiles.append(result[0])
-            culpritIDs.append(result[1])
+        if (result := future.result()):
+            culpritfiles.append(result[0])
+            culpritIDs.append(result[1])
+        progress.update(task_progress, advance=1)

325-336: Simplify file opening logic.

The file opening logic could be more concise by using a context manager directly.

-        opener = gzip.open if is_gzip(genome) else open
-        mode = "rt" if is_gzip(genome) else "r"
-        with opener(genome, mode) as fopen:
+        with (gzip.open(genome, 'rt') if is_gzip(genome) else open(genome, 'r')) as fopen:

382-389: Simplify file opening logic for consistency.

Similar to the previous suggestion, the file opening logic could be more concise.

-    opener = gzip.open if is_gzip(genofile) else open
-    mode = "rt" if is_gzip(genofile) else "r"
-    with opener(genofile, mode) as fasta:
+    with (gzip.open(genofile, 'rt') if is_gzip(genofile) else open(genofile, 'r')) as fasta:

475-479: Rename unused loop variable.

The loop variable future is not used within the loop body. Consider renaming it to _future to indicate it's intentionally unused.

-        for future in as_completed(futures):
+        for _future in as_completed(futures):
             progress.update(task_progress, advance=1)
🧰 Tools
🪛 Ruff

478-478: Loop control variable future not used within loop body

Rename unused future to _future

(B007)


482-483: Address the TODO comment for duplicate check.

The TODO comment indicates that duplicate barcode checking functionality needs to be implemented.

Would you like me to help implement the duplicate barcode check functionality or create a GitHub issue to track this task?

harpy/scripts/HaploSim.pl (2)

383-386: Potential performance improvement with threading

The simReads function is called sequentially for each haplotype. Utilizing threads could improve performance by running simulations concurrently.

Example implementation:

use threads;
...
my @threads;
for ( my $i = 0 ; $i < $opts{d} ; ++$i ) {
    my $thr = threads->create( \&simReads, $i );
    push @threads, $thr;
    sleep( 2 + int( rand(3) ) );
}
# Wait for all threads to complete
$_->join() for @threads;

45-45: Unclear default value for option -u

The default value of $opts{u} is set to 99, which does not correspond to any listed steps in the usage message. This might confuse users about how to continue from specific steps.

Consider setting a default value that aligns with documented steps or provide more clarity in the usage message.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 1ce1aa9 and a17c527.

📒 Files selected for processing (7)
  • .github/filters.yml (1 hunks)
  • harpy/_validations.py (6 hunks)
  • harpy/bin/inline_to_haplotag.py (2 hunks)
  • harpy/scripts/HaploSim.pl (1 hunks)
  • harpy/simulate.py (1 hunks)
  • harpy/snakefiles/simulate_linkedreads.smk (5 hunks)
  • resources/createharpyenv.sh (1 hunks)
🚧 Files skipped from review as they are similar to previous changes (2)
  • harpy/simulate.py
  • resources/createharpyenv.sh
🧰 Additional context used
🪛 Ruff
harpy/_validations.py

358-358: Use enumerate() for index variable idx in for loop

(SIM113)


478-478: Loop control variable future not used within loop body

Rename unused future to _future

(B007)

harpy/bin/inline_to_haplotag.py

12-12: f-string without any placeholders

Remove extraneous f prefix

(F541)


67-67: f-string without any placeholders

Remove extraneous f prefix

(F541)


90-90: f-string without any placeholders

Remove extraneous f prefix

(F541)

🔇 Additional comments (10)
.github/filters.yml (1)

158-158: Verify dependencies for the new simulation script

The addition of HaploSim.pl to the simreads module may require updates to related files.

✅ Verification successful

Let me gather more information about the script dependencies and integration.


Based on the results so far, I can see that HaploSim.pl is properly integrated in the workflow and there are no references to any old script (LRSIM_harpy). However, let me check one more aspect to ensure there are no missing dependencies or configuration requirements.


Dependencies are properly integrated and documented

The verification shows that:

  1. HaploSim.pl is a fork of LRSIM specifically modified for Harpy, with clear documentation of changes
  2. The script is properly integrated into the workflow via simulate.py and simulate_linkedreads.smk
  3. Required dependencies (Perl modules) are standard and documented in the script header
  4. No old script references exist in the codebase
  5. The workflow configuration correctly uses the new script with appropriate parameters
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for references to both the old and new script names
# and verify the integration of HaploSim.pl in the workflow

# Check for references to both scripts
echo "Checking references to both scripts:"
rg -l "LRSIM_harpy\.pl|HaploSim\.pl" --type yaml --type python --type perl

# Check simulate.py for workflow integration
echo -e "\nChecking simulate.py integration:"
rg "HaploSim\.pl" "harpy/simulate.py"

# Check snakemake workflow files
echo -e "\nChecking workflow files:"
rg "HaploSim\.pl" "harpy/snakefiles/simulate_linkedreads.smk"

Length of output: 857


Script:

#!/bin/bash
# Check for any references to the old script name and related configurations
echo "Checking for old script references:"
rg -l "LRSIM_harpy"

# Check the scripts directory for both old and new scripts
echo -e "\nListing simulation scripts:"
fd -t f . "harpy/scripts"

# Check for any configuration or parameter files that might need updates
echo -e "\nChecking for config files:"
fd "config" -t f

# Check the fetch_script function implementation
echo -e "\nChecking fetch_script implementation:"
rg -A 5 "def fetch_script" "harpy/simulate.py"

Length of output: 526


Script:

#!/bin/bash
# Check the content of HaploSim.pl for any dependencies
echo "Checking HaploSim.pl dependencies:"
head -n 50 "harpy/scripts/HaploSim.pl"

# Check for any related test files
echo -e "\nChecking for related test files:"
fd -t f "test" -e py -e pl

# Check workflow configuration
echo -e "\nChecking workflow configuration:"
rg "simreads" -A 5 "harpy/snakefiles/simulate_linkedreads.smk"

Length of output: 2438

harpy/bin/inline_to_haplotag.py (2)

7-7: LGTM!

Good choice using zip_longest for handling potentially uneven FASTQ files.


95-99: LGTM!

The file handling implementation is robust:

  • Proper use of context managers ensures files are closed correctly
  • Good choice of compression level (6) balances compression ratio and speed
  • Correct handling of encoding when writing to gzipped files
harpy/snakefiles/simulate_linkedreads.smk (5)

5-5: LGTM: Import and constraint additions are appropriate

The added shutil import and haplotype wildcard constraint enhance the workflow's robustness:

  • shutil is used correctly for file operations
  • The haplotype constraint properly restricts values to 0 or 1

Also applies to: 16-18


27-38: Add input validation for barcode file format

The previous review comment about adding validation for barcode format is still applicable. Consider implementing the suggested checks for:

  • Non-empty lines
  • Valid barcode characters
  • Consistent barcode lengths

40-51: LGTM: Robust file handling implementation

The rule effectively handles both gzipped and regular files with proper error handling and efficient file operations.


144-158: LGTM: Rule updates are consistent with workflow changes

The renamed rule and updated paths maintain consistency with the overall workflow structure.


208-208: Fix string interpolation in summary command

The previous review comment about string interpolation is still valid. The single quotes prevent variable interpolation in the bxconvert command.

harpy/scripts/HaploSim.pl (2)

340-361: Commented-out code for barcode error simulation

The code responsible for introducing barcode mismatches is entirely commented out. If simulating barcode errors is still required, consider uncommenting and reviewing this section.

Please verify if barcode error simulation is necessary. If so, uncomment the code and ensure it functions as intended.


81-82: Possible mismatch between the number of haplotypes and FAI files

Ensure that the number of haplotypes (@haplotypes) matches the number of FASTA index files (@fastafai). A mismatch can lead to indexing errors during simulation.

Run the following script to verify the counts:

harpy/snakefiles/simulate_linkedreads.smk Outdated Show resolved Hide resolved
harpy/_validations.py Outdated Show resolved Hide resolved
harpy/_validations.py Outdated Show resolved Hide resolved
harpy/scripts/HaploSim.pl Outdated Show resolved Hide resolved
harpy/scripts/HaploSim.pl Outdated Show resolved Hide resolved
harpy/scripts/HaploSim.pl Outdated Show resolved Hide resolved
harpy/scripts/HaploSim.pl Show resolved Hide resolved
harpy/scripts/HaploSim.pl Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 6

🧹 Outside diff range and nitpick comments (5)
harpy/snakefiles/simulate_linkedreads.smk (3)

40-51: Enhance error handling in genome file processing.

The file handling logic should include explicit error handling for common failure scenarios.

Consider this improvement:

 rule link_genome:
     input:
         lambda wc: genodict[wc.get("hap")]
     output: 
         outdir + "/workflow/input/hap.{hap}.fasta"
+    log:
+        outdir + "/logs/link_genome.{hap}.log"
     run:
-        if input[0].lower().endswith("gz"):
-            with gzip.open(input[0], 'rb') as gzip_file, open(output[0], 'wb') as output_file:
-                shutil.copyfileobj(gzip_file, output_file)
-        else:
-            if not (Path(output[0]).is_symlink() or Path(output[0]).exists()):
-                Path(output[0]).symlink_to(Path(input[0]).resolve()) 
+        try:
+            if input[0].lower().endswith("gz"):
+                with gzip.open(input[0], 'rb') as gzip_file, open(output[0], 'wb') as output_file:
+                    shutil.copyfileobj(gzip_file, output_file)
+            else:
+                if not (Path(output[0]).is_symlink() or Path(output[0]).exists()):
+                    Path(output[0]).symlink_to(Path(input[0]).resolve())
+        except (OSError, IOError) as e:
+            with open(log[0], 'w') as log_file:
+                log_file.write(f"Error processing genome file: {str(e)}\n")
+            raise

Line range hint 102-133: Simplify parameter handling in make_molecules rule.

The current parameter handling is verbose and could be simplified using a dictionary or list comprehension.

Consider this improvement:

 rule make_molecules:
     params:
-        haplosim = f"{outdir}/workflow/scripts/HaploSim.pl",
-        reads_in = f"-a {outdir}/dwgsim/sim_reads.0.12.fastq,{outdir}/dwgsim/sim_reads.1.12.fastq",
-        fai_in = f"-g {outdir}/workflow/input/hap.0.fasta.fai,{outdir}/workflow/input/hap.1.fasta.fai",
-        inbarcodes = f"-b {barcode_file}",
-        proj_dir = f"-p {outdir}/linked_molecules/sim",
-        outdist  = f"-i {config['outer_distance']}",
-        dist_sd  = f"-s {config['distance_sd']}",
-        n_pairs  = f"-x {config['read_pairs']}",
-        mol_len  = f"-f {config['molecule_length']}",
-        parts    = f"-t {config['partitions']}",
-        mols_per = f"-m {config['molecules_per_partition']}",
-        bc_len   = f"-l {barcode_len}",
-        static   = "-o 1 -d 2 -u 4"
+        script = f"{outdir}/workflow/scripts/HaploSim.pl",
+        options = {
+            "a": f"{outdir}/dwgsim/sim_reads.0.12.fastq,{outdir}/dwgsim/sim_reads.1.12.fastq",
+            "g": f"{outdir}/workflow/input/hap.0.fasta.fai,{outdir}/workflow/input/hap.1.fasta.fai",
+            "b": barcode_file,
+            "p": f"{outdir}/linked_molecules/sim",
+            "i": config['outer_distance'],
+            "s": config['distance_sd'],
+            "x": config['read_pairs'],
+            "f": config['molecule_length'],
+            "t": config['partitions'],
+            "m": config['molecules_per_partition'],
+            "l": barcode_len,
+            "o": 1,
+            "d": 2,
+            "u": 4
+        }
     shell:
-        "perl {params} -z {threads} 2> {log}"
+        "perl {params.script} {' '.join([f'-{k} {v}' for k, v in params.options.items()])} -z {threads} 2> {log}"

Line range hint 177-215: Improve summary generation readability.

The current string concatenation approach could be improved using a more structured format.

Consider this improvement:

     run:
-        summary = ["The harpy simulate linkedreas workflow ran using these parameters:"]
-        summary.append(f"Genome haplotype 1: {gen_hap1}")
-        summary.append(f"Genome haplotype 2: {gen_hap2}")
-        summary.append(f"Barcode file: {barcode_file}")
+        summary_sections = {
+            "Input Files": {
+                "Genome haplotype 1": gen_hap1,
+                "Genome haplotype 2": gen_hap2,
+                "Barcode file": barcode_file
+            },
+            "DWGSIM Parameters": {
+                "Command": f"dwgsim -N {params.dwgreadpairs} -e 0.0001,0.0016 -E 0.0001,0.0016 " +
+                          f"-d {params.dwgouterdist} -s {params.dwgdistsd} -1 135 -2 151 -H -y 0 " +
+                          f"-S 0 -c 0 -R 0 -r {params.dwgmutationrate} -F 0 -o 1 -m /dev/null GENO PREFIX"
+            },
+            "HaploSim Parameters": {
+                "Command": f"HaploSim.pl -g genome1,genome2 -a dwgsimreads1,dwgsimreads2 " +
+                          f"-l {params.lrbc_len} -p {params.lrsproj_dir}/linked_molecules/lrsim " +
+                          f"-b BARCODES -i {params.lrsoutdist} -s {params.lrsdist_sd} " +
+                          f"-x {params.lrsn_pairs} -f {params.lrsmol_len} -t {params.lrsparts} " +
+                          f"-m {params.lrsmols_per} -z THREADS {params.lrsstatic}"
+            }
+        }
+        
+        summary = ["The harpy simulate linkedreads workflow ran using these parameters:"]
+        for section, params in summary_sections.items():
+            summary.append(f"\n{section}:")
+            for param, value in params.items():
+                summary.append(f"  {param}: {value}")
harpy/scripts/HaploSim.pl (2)

1-13: Enhance documentation with version and usage details.

While the header provides good context about the fork and modifications, consider adding:

  • Version information
  • Input/output format specifications
  • Usage examples with common scenarios

15-33: Add dependency version requirements and checks.

Consider adding:

  • Version requirements for critical modules
  • Checks for required modules at startup
  • Error messages for missing dependencies
+use v5.10;  # Specify minimum Perl version
+
+# Check for required modules
+BEGIN {
+    my @modules = qw(
+        threads
+        threads::shared
+        Inline::C
+        Math::Random
+    );
+    for my $module (@modules) {
+        eval "use $module";
+        die "Required module $module not found: $@" if $@;
+    }
+}
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between a17c527 and 2e24b90.

📒 Files selected for processing (2)
  • harpy/scripts/HaploSim.pl (1 hunks)
  • harpy/snakefiles/simulate_linkedreads.smk (4 hunks)
🔇 Additional comments (5)
harpy/snakefiles/simulate_linkedreads.smk (4)

5-17: LGTM: Import and constraint additions are appropriate.

The added shutil import and haplotype wildcard constraint enhance the workflow's robustness.


27-38: Previous review comment about barcode validation still applies.


144-158: LGTM: Rule changes are well-structured.

The renamed rule and updated paths improve clarity and organization.


161-175: Previous review comment about missing barcode length parameter still applies.

harpy/scripts/HaploSim.pl (1)

427-435: ⚠️ Potential issue

Fix parameter handling in Log function.

The Log function uses $_[0] both as file path and message, which is confusing and error-prone.

 sub Log {
     state $statusFH;
-    if ( not defined $statusFH ) {
-        open $statusFH, ">>$_[0]" or die "Error opening $_[0].\n";
-    }
     my $time = localtime;
-    print $statusFH "$time: $_[0]\n";
-    print STDERR "$time: $_[0]\n";
+    my ($message) = @_;
+    if ( not defined $statusFH ) {
+        my $logfile = "$opts{p}.status";
+        open $statusFH, ">>", $logfile or die "Error opening $logfile: $!\n";
+    }
+    print $statusFH "$time: $message\n";
+    print STDERR "$time: $message\n";
 }

Likely invalid or redundant comment.

harpy/snakefiles/simulate_linkedreads.smk Show resolved Hide resolved
harpy/scripts/HaploSim.pl Outdated Show resolved Hide resolved
harpy/scripts/HaploSim.pl Outdated Show resolved Hide resolved
harpy/scripts/HaploSim.pl Outdated Show resolved Hide resolved
harpy/scripts/HaploSim.pl Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

🧹 Outside diff range and nitpick comments (9)
harpy/bin/inline_to_haplotag.py (4)

7-7: LGTM! Minor optimization for f-string usage.

The import of zip_longest and updated description are good improvements. However, there's an unnecessary f-string usage.

-    usage = f"inline_to_haplotag.py -f <forward.fq.gz> -r <reverse.fq.gz> -b <barcodes.txt> -p <prefix>",
+    usage = "inline_to_haplotag.py -f <forward.fq.gz> -r <reverse.fq.gz> -b <barcodes.txt> -p <prefix>",

Also applies to: 11-12


30-35: Remove debug print statements.

The function contains commented-out debug print statements that should be removed.

def valid_record(fq_rec, FR):
    """fastq format sanity check"""
-    #print(fq_rec)
-    #sys.exit(0)
    if not (fq_rec[0].startswith("@") and fq_rec[2] == "+"):
        raise ValueError(f"Invalid FASTQ format for {FR} reads")

66-86: LGTM! Good validation with minor optimization.

The barcode processing logic has excellent validation and error handling. However, there's an unnecessary f-string usage in the error message.

-        sys.stderr.write(f"Can only search sequences for barcodes of a single length, but multiple barcode legnths detected: " + ",".join([str(i) for i in lengths]))
+        sys.stderr.write("Can only search sequences for barcodes of a single length, but multiple barcode legnths detected: " + ",".join([str(i) for i in lengths]))

Also, there's a typo in "legnths" that should be fixed.

🧰 Tools
🪛 Ruff

84-84: f-string without any placeholders

Remove extraneous f prefix

(F541)


89-113: Optimize error handling and remove commented code.

The file processing logic is good, but there are some opportunities for improvement:

  1. Remove the commented-out code
  2. Simplify error handling using contextlib.suppress
+from contextlib import suppress
 with gzip.open(args.forward, "r") as fw_i, gzip.open(args.reverse, "r") as rv_i,\
     gzip.open(f"{args.prefix}.R1.fq.gz", "wb", 6) as fw_out,\
     gzip.open(f"{args.prefix}.R2.fq.gz", "wb", 6) as rv_out:
-    #for fw_record, rv_record in zip_longest(iter_fastq_records(fw_i), iter_fastq_records(rv_i)),:
     record_F = []
     record_R = []
     for fw_record, rv_record in zip_longest(fw_i, rv_i):
-        try:
+        with suppress(AttributeError):
             record_F.append(fw_record.decode().rstrip("\n"))
-        except AttributeError:
-            # if the file ends before the other one
-            pass
-        try:
+        with suppress(AttributeError):
             record_R.append(rv_record.decode().rstrip("\n"))
-        except AttributeError:
-            pass
🧰 Tools
🪛 Ruff

96-100: Use contextlib.suppress(AttributeError) instead of try-except-pass

(SIM105)


101-104: Use contextlib.suppress(AttributeError) instead of try-except-pass

Replace with contextlib.suppress(AttributeError)

(SIM105)

harpy/snakefiles/simulate_linkedreads.smk (1)

102-131: Improve parameter handling in create_molecules rule.

Consider refactoring the parameter handling for better maintainability:

 rule create_molecules:
     params:
-        haplosim = f"{outdir}/workflow/scripts/HaploSim.pl",
-        reads_in = f"-a {outdir}/dwgsim/sim_reads.0.12.fastq,{outdir}/dwgsim/sim_reads.1.12.fastq",
-        fai_in   = f"-g {outdir}/workflow/input/hap.0.fasta.fai,{outdir}/workflow/input/hap.1.fasta.fai",
-        bccodes  = f"-b {barcode_file}",
-        proj_dir = f"-p {outdir}/linked_molecules/lrsim",
-        outdist  = f"-i {config['outer_distance']}",
-        dist_sd  = f"-s {config['distance_sd']}",
-        n_pairs  = f"-x {config['read_pairs']}",
-        mol_len  = f"-f {config['molecule_length']}",
-        parts    = f"-t {config['partitions']}",
-        mols_per = f"-m {config['molecules_per_partition']}",
-        bc_len   = f"-l {barcode_len}"
+        script = f"{outdir}/workflow/scripts/HaploSim.pl",
+        options = {
+            "a": lambda w: f"{outdir}/dwgsim/sim_reads.0.12.fastq,{outdir}/dwgsim/sim_reads.1.12.fastq",
+            "g": lambda w: f"{outdir}/workflow/input/hap.0.fasta.fai,{outdir}/workflow/input/hap.1.fasta.fai",
+            "b": lambda w: barcode_file,
+            "p": lambda w: f"{outdir}/linked_molecules/lrsim",
+            "i": lambda w: config['outer_distance'],
+            "s": lambda w: config['distance_sd'],
+            "x": lambda w: config['read_pairs'],
+            "f": lambda w: config['molecule_length'],
+            "t": lambda w: config['partitions'],
+            "m": lambda w: config['molecules_per_partition'],
+            "l": lambda w: barcode_len
+        }
     shell:
-        "perl {params} -z {threads} -o 1 -d 2 2> {log}"
+        "perl {params.script} "
+        "{params.options:map('-' + key + ' ' + str(value) for key, value in sorted(params.options.items()))} "
+        "-z {threads} -o 1 -d 2 2> {log}"
harpy/scripts/HaploSim.pl (1)

1-18: Documentation needs improvement

The header comments provide basic context but could be enhanced:

  • Add usage examples
  • Document the expected format of input files
  • Explain the significance of the -g and -a options more clearly
harpy/_validations.py (3)

475-479: Rename unused variable for better code clarity

The loop control variable future is not used within the loop body.

Apply this change:

- for future in as_completed(futures):
+ for _future in as_completed(futures):
🧰 Tools
🪛 Ruff

478-478: Loop control variable future not used within loop body

Rename unused future to _future

(B007)


521-522: Improve string formatting in error message

The error message has a syntax error in the string formatting for lengths.

Apply this fix:

- print_error("Incorrect format", f"Barcodes in [blue]{infile}[/blue] must all be a single length, but multiple lengths were detected: [yellow]" + ", ".join(lengths))
+ print_error("Incorrect format", f"Barcodes in [blue]{infile}[/blue] must all be a single length, but multiple lengths were detected: [yellow]{', '.join(map(str, lengths))}[/yellow]")

357-358: Use enumerate for better readability

Replace manual line counting with Python's built-in enumerate function.

Apply this change:

- idx = 0
- for line in fin:
-     idx += 1
+ for idx, line in enumerate(fin, 1):
🧰 Tools
🪛 Ruff

358-358: Use enumerate() for index variable idx in for loop

(SIM113)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 2e24b90 and fe1e773.

📒 Files selected for processing (5)
  • .github/workflows/tests.yml (1 hunks)
  • harpy/_validations.py (6 hunks)
  • harpy/bin/inline_to_haplotag.py (2 hunks)
  • harpy/scripts/HaploSim.pl (1 hunks)
  • harpy/snakefiles/simulate_linkedreads.smk (3 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
  • .github/workflows/tests.yml
🧰 Additional context used
🪛 Ruff
harpy/_validations.py

358-358: Use enumerate() for index variable idx in for loop

(SIM113)


478-478: Loop control variable future not used within loop body

Rename unused future to _future

(B007)

harpy/bin/inline_to_haplotag.py

12-12: f-string without any placeholders

Remove extraneous f prefix

(F541)


84-84: f-string without any placeholders

Remove extraneous f prefix

(F541)


96-100: Use contextlib.suppress(AttributeError) instead of try-except-pass

(SIM105)


101-104: Use contextlib.suppress(AttributeError) instead of try-except-pass

Replace with contextlib.suppress(AttributeError)

(SIM105)

🔇 Additional comments (13)
harpy/bin/inline_to_haplotag.py (1)

37-63: LGTM! Well-structured error handling.

The process_record function has good error handling and properly manages both forward and reverse reads with their barcodes.

harpy/snakefiles/simulate_linkedreads.smk (6)

5-5: LGTM! Well-structured setup improvements.

The additions improve code organization and type safety:

  • shutil import supports efficient file operations
  • Wildcard constraint ensures haplotype values are valid (0 or 1)
  • genodict provides clean mapping between haplotype IDs and genome files

Also applies to: 16-17, 25-26


27-38: Add input validation for barcode file.

This comment was previously made and is still valid. The barcode mapping logic needs validation for:

  • Non-empty lines
  • Valid barcode characters
  • Consistent barcode lengths

72-90: Parameterize hardcoded values in dwgsim command.

This comment was previously made and is still valid. The error rates and read lengths should be configurable parameters.


159-173: Add missing barcode length parameter.

This comment was previously made and is still valid. The inline_to_haplotag.py command is missing the barcode length parameter.


206-206: Fix string interpolation in summary command.

This comment was previously made and is still valid. The single quotes prevent variable interpolation.


Line range hint 1-213: Verify file paths and dependencies.

Let's verify the consistency of file paths and dependencies across the workflow:

✅ Verification successful

Based on the output from the shell scripts, I can now generate the final response:

File paths and dependencies are properly structured

The verification of file paths and dependencies in the simulate_linkedreads.smk workflow shows:

  1. All output paths are consistently constructed using the outdir variable
  2. Directory structure follows a clear hierarchy with dedicated subdirectories for each step (workflow/input, dwgsim, linked_molecules, multiplex, etc.)
  3. Dependencies between rules are properly managed through input/output declarations
  4. Temporary files are appropriately marked with temp() directive
  5. File extensions and naming conventions are consistent throughout the workflow

The workflow demonstrates good practices in terms of file organization and dependency management, with no issues found in the path structure or rule dependencies.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify file paths and dependencies across the workflow

# Check for potential path inconsistencies
echo "Checking for potential path inconsistencies..."
rg -g '*.smk' -g '*.py' "outdir \+" | grep -v "workflow/input\|workflow/scripts\|logs\|dwgsim\|linked_molecules\|multiplex"

# Check for potential missing dependencies
echo "Checking for potential missing dependencies..."
rg -g '*.smk' "input:|output:" -A 1 | grep -v "outdir\|input:\|output:"

Length of output: 54234

harpy/scripts/HaploSim.pl (3)

71-76: LGTM! Input validation is thorough

The input validation checks are comprehensive and provide clear error messages.


303-311: 🛠️ Refactor suggestion

Log subroutine needs improvement

The Log subroutine uses $_[0] both as the log file path and message, which is confusing and error-prone.

 sub Log {
     state $statusFH;
+    my ($message) = @_;
     if ( not defined $statusFH ) {
-        open $statusFH, ">>$_[0]" or die "Error opening $_[0].\n";
+        my $log_file = "$opts{p}.status.log";
+        open $statusFH, ">>", $log_file or die "Error opening $log_file: $!\n";
     }
     my $time = localtime;
-    print $statusFH "$time: $_[0]\n";
-    print STDERR "$time: $_[0]\n";
+    print $statusFH "$time: $message\n";
+    print STDERR "$time: $message\n";
 }

Likely invalid or redundant comment.


260-264: Verify thread safety of the simulation loop

The sleep between simulations might not be sufficient to prevent resource contention.

✅ Verification successful

Let me gather more information about the simulation loop and its thread safety implications.


Based on the gathered information, I can now provide the final response:

Thread safety concerns in simulation loop are properly handled

The code is actually thread-safe with appropriate synchronization mechanisms in place:

  • Uses Perl's threading module (use threads) with proper shared variables
  • Implements mutex locking for barcode access (lock($barcodesMutexLock))
  • Each simulation runs in its own thread context with proper signal handlers
  • File operations use thread-local file handles
  • The sleep between iterations provides additional safety margin

The random sleep (2-5 seconds) between simulations is not actually for thread safety but appears to be for resource management, which is appropriate given the implementation.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check for potential thread safety issues in file operations

# Look for file operations that might cause race conditions
rg -A 2 'open.*[>"]' harpy/scripts/HaploSim.pl

# Check for shared resource access patterns
ast-grep --pattern 'lock($barcodesMutexLock)'

Length of output: 909


Script:

#!/bin/bash
# Check for any threading/parallel processing mechanisms
rg -A 5 'fork|thread|parallel|Process' harpy/scripts/HaploSim.pl

# Look for the simReads function implementation to understand resource handling
rg -A 10 '^sub\s+simReads' harpy/scripts/HaploSim.pl

# Check for any global variables or shared resources
rg -B 2 'our\s+\$|my\s+\$\w+\s*=\s*' harpy/scripts/HaploSim.pl

Length of output: 4771

harpy/_validations.py (3)

172-181: LGTM: Well-structured parallel execution with proper resource management

The implementation correctly uses context managers to handle both progress bar and thread pool executor, ensuring proper resource cleanup.


382-413: LGTM: Comprehensive FASTA validation with clear error handling

The implementation thoroughly validates FASTA format requirements with descriptive error messages and solutions.


481-497: LGTM: Comprehensive barcode validation with proper error handling

The implementation thoroughly validates barcodes with proper checks for format, duplicates, and length limits.

harpy/snakefiles/simulate_linkedreads.smk Show resolved Hide resolved
harpy/scripts/HaploSim.pl Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Outside diff range and nitpick comments (6)
harpy/bin/inline_to_haplotag.py (3)

12-12: Remove unnecessary f-string prefix

The usage string doesn't contain any variables, so the f prefix is not needed.

-    usage = f"inline_to_haplotag.py -f <forward.fq.gz> -r <reverse.fq.gz> -b <barcodes.txt> -p <prefix>",
+    usage = "inline_to_haplotag.py -f <forward.fq.gz> -r <reverse.fq.gz> -b <barcodes.txt> -p <prefix>",
🧰 Tools
🪛 Ruff

12-12: f-string without any placeholders

Remove extraneous f prefix

(F541)


75-79: Consider using a context manager for error handling

The error handling could be improved by moving the validation logic inside the context manager to ensure proper file cleanup on errors.

with opener(args.barcodes, mode) as bc_file:
+    def validate_and_process_line(line):
+        try:
+            ATCG, ACBD = line.rstrip().split()
+            if not set(ATCG).issubset(nucleotides):
+                raise ValueError(f"Invalid barcode format: {ATCG}")
+            return ATCG, ACBD
+        except ValueError as e:
+            return None, str(e)

    for line in bc_file:
-        try:
-            ATCG,ACBD = line.rstrip().split()
-        except ValueError:
-            sys.stderr.write(f"Invalid barcode entry: {line.rstrip()}\nExpected two entries: a nucleotide barcode and ACBD format barcode with a space/tab separatng them, e.g. ATATCAGA A01C22B13D93")
-            sys.exit(1)
-        if not set(ATCG).issubset(nucleotides):
-            sys.stderr.write(f"Invalid barcode format: {ATCG}. Barcodes must be captial letters and only contain standard nucleotide values ATCG.\n")
-            sys.exit(1)
+        ATCG, result = validate_and_process_line(line)
+        if not ATCG:
+            sys.stderr.write(f"{result}\n")
+            return

94-102: Simplify exception handling with contextlib.suppress

The current try-except blocks with pass statements can be simplified using contextlib.suppress.

+from contextlib import suppress

 for fw_record, rv_record in zip_longest(fw_i, rv_i):
-    try:
+    with suppress(AttributeError):
         record_F.append(fw_record.decode().rstrip("\n"))
-    except AttributeError:
-        # if the file ends before the other one
-        pass
-    try:
+    with suppress(AttributeError):
         record_R.append(rv_record.decode().rstrip("\n"))
-    except AttributeError:
-        pass
🧰 Tools
🪛 Ruff

94-98: Use contextlib.suppress(AttributeError) instead of try-except-pass

(SIM105)


99-102: Use contextlib.suppress(AttributeError) instead of try-except-pass

Replace with contextlib.suppress(AttributeError)

(SIM105)

harpy/_validations.py (3)

172-181: Consider renaming unused variable.

The implementation looks good, but consider renaming the unused future variable to _future to indicate it's intentionally unused.

- for future in as_completed(futures):
+ for _future in as_completed(futures):

475-479: Consider renaming unused variable.

Similar to the earlier instance, consider renaming the unused future variable to _future.

- for future in as_completed(futures):
+ for _future in as_completed(futures):
🧰 Tools
🪛 Ruff

478-478: Loop control variable future not used within loop body

Rename unused future to _future

(B007)


481-524: Fix string formatting in error message.

The implementation looks good overall, but the error message formatting on line 521 could be improved for better readability.

- print_error("Incorrect format", f"Barcodes in [blue]{infile}[/blue] must all be a single length, but multiple lengths were detected: [yellow]" + ", ".join(lengths) + "[/yellow]")
+ print_error("Incorrect format", f"Barcodes in [blue]{infile}[/blue] must all be a single length, but multiple lengths were detected: [yellow]{', '.join(map(str, lengths))}[/yellow]")
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between fe1e773 and cab9dbd.

📒 Files selected for processing (3)
  • harpy/_validations.py (6 hunks)
  • harpy/bin/inline_to_haplotag.py (2 hunks)
  • harpy/snakefiles/simulate_linkedreads.smk (3 hunks)
🧰 Additional context used
🪛 Ruff
harpy/_validations.py

358-358: Use enumerate() for index variable idx in for loop

(SIM113)


478-478: Loop control variable future not used within loop body

Rename unused future to _future

(B007)

harpy/bin/inline_to_haplotag.py

12-12: f-string without any placeholders

Remove extraneous f prefix

(F541)


94-98: Use contextlib.suppress(AttributeError) instead of try-except-pass

(SIM105)


99-102: Use contextlib.suppress(AttributeError) instead of try-except-pass

Replace with contextlib.suppress(AttributeError)

(SIM105)

🔇 Additional comments (9)
harpy/bin/inline_to_haplotag.py (2)

30-61: Well-structured helper functions with proper validation

The helper functions are well-implemented with:

  • Robust FASTQ format validation
  • Proper error handling for missing reads
  • Clean barcode processing logic

80-81: Verify consistent barcode dictionary usage across codebase

Let's verify that the barcode dictionary structure (nucleotide to ACBD format mapping) is used consistently across the codebase.

✅ Verification successful

Barcode dictionary usage is consistent across the codebase

The verification shows consistent barcode handling patterns:

  • The bc_dict is only used in inline_to_haplotag.py for mapping nucleotide sequences to ACBD format
  • Invalid or missing barcodes consistently use the format A00C00B00D00 across multiple files
  • The ACBD format follows the pattern [ABCD]\d{2}[ABCD]\d{2}[ABCD]\d{2}[ABCD]\d{2} throughout the codebase
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Search for barcode dictionary usage patterns
rg -A 2 "bc_dict\[.*\]" --type python

# Search for ACBD format pattern usage
rg -A 2 "[ABCD]\d{2}[ABCD]\d{2}[ABCD]\d{2}[ABCD]\d{2}" --type python

Length of output: 172


Script:

#!/bin/bash
# Search for barcode dictionary usage patterns without file type filter
rg -A 2 "bc_dict\[.*\]"

# Search for files with .py extension that might contain barcode processing
fd ".py$" --exec grep -l "bc_dict"

# Search for ACBD format pattern usage without file type filter
rg -A 2 "[ABCD]\d{2}[ABCD]\d{2}[ABCD]\d{2}[ABCD]\d{2}"

Length of output: 3207

harpy/snakefiles/simulate_linkedreads.smk (6)

5-5: LGTM! Well-structured imports and configuration.

The added imports and configuration settings are appropriate for the workflow's requirements. The wildcard constraint for haplotypes is a good practice for preventing invalid values.

Also applies to: 8-8, 16-17, 23-24


40-51: LGTM! Efficient file handling implementation.

The rule properly handles both gzipped and uncompressed files, with efficient file copying using shutil.copyfileobj and appropriate symlink handling.


72-90: Parameterize hardcoded values in dwgsim command.

This is a duplicate of a previous review comment. The hardcoded error rates and read lengths should be configurable parameters.


102-131: LGTM! Well-structured rule with proper parameter handling.

The rule effectively manages multiple inputs and parameters, with good use of temp() for intermediate files. The parameter passing to HaploSim.pl is comprehensive and well-organized.


159-173: Add missing barcode length parameter.

This is a duplicate of a previous review comment. The inline_to_haplotag.py command is missing the barcode length parameter.


Line range hint 175-213: LGTM! Comprehensive workflow summary implementation.

The rule provides a detailed summary of all workflow parameters and commands, with proper string formatting and organization.

harpy/_validations.py (1)

382-413: LGTM! Comprehensive FASTA validation.

The implementation provides thorough validation of FASTA format with clear error messages and helpful solutions.

harpy/snakefiles/simulate_linkedreads.smk Show resolved Hide resolved
harpy/_validations.py Outdated Show resolved Hide resolved
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 7

🧹 Outside diff range and nitpick comments (6)
harpy/bin/inline_to_haplotag.py (4)

13-13: Remove unnecessary f-string prefix

The usage string doesn't contain any placeholders, so the f prefix can be removed.

-    usage = f"inline_to_haplotag.py -f <forward.fq.gz> -r <reverse.fq.gz> -b <barcodes.txt> -p <prefix>",
+    usage = "inline_to_haplotag.py -f <forward.fq.gz> -r <reverse.fq.gz> -b <barcodes.txt> -p <prefix>",
🧰 Tools
🪛 Ruff

13-13: f-string without any placeholders

Remove extraneous f prefix

(F541)


31-34: Enhance FASTQ validation

The current validation only checks the header and plus line. Consider adding sequence and quality score validation:

  • Sequence should only contain valid nucleotides
  • Quality scores should be valid ASCII characters
  • Sequence and quality score lengths should match
 def valid_record(fq_rec, FR):
     """fastq format sanity check"""
     if not (fq_rec[0].startswith("@") and fq_rec[2] == "+"):
         raise ValueError(f"Invalid FASTQ format for {FR} reads")
+    if len(fq_rec[1]) != len(fq_rec[3]):
+        raise ValueError(f"Sequence and quality score lengths don't match in {FR} reads")
+    if not set(fq_rec[1].upper()).issubset({'A', 'C', 'G', 'T', 'N'}):
+        raise ValueError(f"Invalid nucleotides in sequence of {FR} reads")

58-84: Add docstring parameters and return values

The function's docstring should document its parameters and return values for better maintainability.

 def process_record(fw_entry, rv_entry, barcode_database, bc_len):
-    """convert the barcode to haplotag"""
+    """Convert the barcode to haplotag format.
+    
+    Args:
+        fw_entry (list): Forward read entry [header, seq, +, qual]
+        rv_entry (list): Reverse read entry [header, seq, +, qual]
+        barcode_database (sqlite3.Connection): SQLite database connection
+        bc_len (int): Length of the barcode
+    
+    Returns:
+        tuple: (processed_forward, processed_reverse) where each is either
+               a string containing the modified FASTQ record or None
+    """

129-137: Use contextlib.suppress for cleaner exception handling

Replace the try-except blocks with contextlib.suppress for more concise code.

+from contextlib import suppress
+
     for fw_record, rv_record in zip_longest(fw_i, rv_i):
-        try:
+        with suppress(AttributeError):
             record_F.append(fw_record.decode().rstrip("\n"))
-        except AttributeError:
-            # if the file ends before the other one
-            pass
-        try:
+        with suppress(AttributeError):
             record_R.append(rv_record.decode().rstrip("\n"))
-        except AttributeError:
-            pass
🧰 Tools
🪛 Ruff

129-133: Use contextlib.suppress(AttributeError) instead of try-except-pass

(SIM105)


134-137: Use contextlib.suppress(AttributeError) instead of try-except-pass

Replace with contextlib.suppress(AttributeError)

(SIM105)

harpy/scripts/HaploSim.pl (1)

1-17: Add version information to the header documentation.

While the header clearly documents the fork and modifications, it would be beneficial to add:

  • Version number of this fork
  • Last updated date
  • Minimum Perl version requirement
harpy/_validations.py (1)

172-181: Consider renaming unused loop variable.

The loop variable future in the as_completed loop is only used to get its result. Consider renaming it to _future to indicate it's intentionally unused.

- for future in as_completed(futures):
-     result = future.result()
+ for _future in as_completed(futures):
+     result = _future.result()
📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between cab9dbd and d20ee09.

📒 Files selected for processing (4)
  • harpy/_validations.py (6 hunks)
  • harpy/bin/inline_to_haplotag.py (2 hunks)
  • harpy/scripts/HaploSim.pl (1 hunks)
  • harpy/snakefiles/simulate_linkedreads.smk (3 hunks)
🧰 Additional context used
🪛 Ruff
harpy/_validations.py

476-476: Loop control variable future not used within loop body

Rename unused future to _future

(B007)

harpy/bin/inline_to_haplotag.py

13-13: f-string without any placeholders

Remove extraneous f prefix

(F541)


129-133: Use contextlib.suppress(AttributeError) instead of try-except-pass

(SIM105)


134-137: Use contextlib.suppress(AttributeError) instead of try-except-pass

Replace with contextlib.suppress(AttributeError)

(SIM105)

🔇 Additional comments (8)
harpy/bin/inline_to_haplotag.py (1)

86-95: LGTM! Well-structured database setup

The in-memory SQLite database is an efficient choice for this use case, and the table structure is appropriate for key-value storage.

harpy/snakefiles/simulate_linkedreads.smk (3)

5-5: LGTM! Well-structured configuration setup.

The additions improve the code organization with:

  • Appropriate use of shutil for file operations
  • Clear wildcard constraint for haplotype values
  • Clean configuration variable structure

Also applies to: 16-17, 25-25


113-124: Verify HaploSim script availability and parameters.

The switch from LRSIM to HaploSim needs verification to ensure:

  1. The script exists at the specified location
  2. All parameters are compatible with the new implementation
✅ Verification successful

HaploSim.pl script exists and parameters are correctly aligned

The script exists at harpy/scripts/HaploSim.pl and all parameters used in the Snakefile match the script's expected parameters:

  • -a for input reads (comma-separated)
  • -g for FAI files (comma-separated)
  • -b for barcodes list
  • -p for output prefix
  • -i and -s for distance parameters
  • -x for read pairs
  • -f for molecule length
  • -t for partitions
  • -m for molecules per partition
  • -l for barcode length
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify HaploSim script existence and parameter compatibility

# Check if HaploSim.pl exists in the workflow/scripts directory
fd -t f "HaploSim.pl" -x echo "Found HaploSim script: {}"

# Check for any usage documentation or parameter information
rg -A 10 "usage|help|parameters" --glob "HaploSim.pl"

Length of output: 2782


147-147: Verify extractReads command compatibility.

The command is used without version specification. Ensure that:

  1. The command is available in the container
  2. The output format is compatible with the new multiplex directory structure

Also applies to: 155-155

✅ Verification successful

extractReads command is properly integrated in the build system

The command extractReads is a part of the package itself:

  • It's compiled from harpy/bin/extractReads.cpp during installation
  • Build scripts ensure it's available in the ${CONDA_PREFIX}/bin directory
  • The command is used consistently across the codebase without version specification

The output path change from '10X' to 'multiplex' is intentional and the command remains compatible with the new directory structure as it only processes input files and writes to the specified output location.

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Verify extractReads command availability and usage

# Check if extractReads is mentioned in any documentation or container definition
rg "extractReads" --glob "Dockerfile"
rg "extractReads" --glob "*.md"

Length of output: 216


Script:

#!/bin/bash
# Let's check for any container or environment setup files with different extensions
fd -e yml -e yaml -e config -e env | xargs rg -i "extractReads"

# Check the actual Snakefile where the command is used
rg -A 2 -B 2 "extractReads" 

# Look for any conda/environment specifications
fd "environment" -e yml -e yaml | xargs cat

# Check if there are any other similar commands or patterns in Snakefiles
fd "Snakefile" | xargs rg "collect.*fastq"

Length of output: 2378

harpy/scripts/HaploSim.pl (1)

259-262: Verify thread safety in concurrent operations.

The script creates multiple threads but there might be race conditions in:

  1. File operations
  2. Shared resource access
  3. Memory management
harpy/_validations.py (3)

325-336: LGTM! Correct handling of contig lengths.

The removal of the zero-based adjustment is correct. Contig lengths should represent the actual number of bases without adjustment.


380-411: LGTM! Comprehensive FASTA validation.

The implementation thoroughly validates FASTA format with clear error messages and helpful solutions for each case.


Line range hint 457-477: LGTM! Proper validation of barcode tags.

The implementation correctly ensures mutually exclusive BX/BC tags with appropriate error handling.

🧰 Tools
🪛 Ruff

476-476: Loop control variable future not used within loop body

Rename unused future to _future

(B007)

harpy/bin/inline_to_haplotag.py Show resolved Hide resolved
harpy/bin/inline_to_haplotag.py Show resolved Hide resolved
harpy/snakefiles/simulate_linkedreads.smk Show resolved Hide resolved
harpy/scripts/HaploSim.pl Show resolved Hide resolved
harpy/scripts/HaploSim.pl Show resolved Hide resolved
harpy/scripts/HaploSim.pl Show resolved Hide resolved
harpy/_validations.py Show resolved Hide resolved
@pdimens
Copy link
Owner Author

pdimens commented Nov 19, 2024

fixes #165

@pdimens pdimens merged commit b56bf1c into main Nov 19, 2024
19 of 20 checks passed
@pdimens pdimens deleted the haplotag_lrsims branch November 19, 2024 16:49
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant