haplotagging barcodes as default #162

pdimens · 2024-11-13T17:39:06Z

This PR changes the default behavior for linked read simulation to use the standard set of haplotagging barcodes rather than the standard 10X set of barcodes.

new haplotag_barcodes.py script that's similar to haplotag_acbd.py but writes a single file to stdout
10XtoHaplotag.py renamed to inline_to_haplotag.py because that's really what it's doing
upper bound on --molecules-per to reflect the author note to not let this value exceed 4700

Summary by CodeRabbit

Release Notes

New Features
- Introduced two new scripts for barcode processing: one for generating haplotagging barcodes and another for converting inline barcodes.
Updates
- Enhanced the linkedreads command with improved barcode generation and input validation.
- Updated Snakemake workflows to reflect changes in barcode processing methods and improve overall clarity and efficiency.
Bug Fixes
- Improved error handling and output clarity across various commands.

coderabbitai · 2024-11-13T17:39:16Z

📝 Walkthrough

Walkthrough

The pull request introduces modifications to several scripts and configuration files related to the simreads module. It adds two new scripts, haplotag_barcodes.py and inline_to_haplotag.py, while removing the outdated 10xtoHaplotag.py. The command-line options for the linkedreads simulation in simulate.py have been updated, particularly regarding barcode generation. Additionally, the Snakemake workflow has been adjusted to reflect these changes, including renaming rules and updating commands to utilize the new scripts.

Changes

File	Change Summary
.github/filters.yml	- Added: `harpy/bin/inline_to_haplotag.py` and `harpy/bin/haplotag_barcodes.py` - Removed: `harpy/bin/10xtoHaplotag.py`
harpy/bin/haplotag_barcodes.py	- New script added to generate haplotagging ACBD barcodes, utilizing `argparse` for command-line arguments and outputting combinations of barcodes.
harpy/bin/inline_to_haplotag.py	- New script added for converting inline barcodes to haplotag format, with error handling and command-line argument parsing.
harpy/simulate.py	- Updated command-line options for `linkedreads`, including a max value for `--molecules-per` and changes to barcode generation behavior.
harpy/snakefiles/simulate_linkedreads.smk	- Renamed `download_barcodes` rule to `create_barcodes` and updated to call `haplotag_barcodes.py`. - Renamed `convert_to_haplotag` rule to `demultiplex_barcodes`.
harpy/snakefiles/simulate_snpindel.smk	- Modified line reading in `diploid_snps` rule to use a `for` loop for improved readability.
harpy/snakefiles/simulate_variants.smk	- Updated `diploid_variants` rule to read lines from input VCF file using a `for` loop.
harpy/snakefiles/snp_freebayes.smk	- Changed line reading from `regioninput` file to a `for` loop for better clarity.
harpy/snakefiles/snp_mpileup.smk	- Replaced `while` loop for reading `regioninput` file with a `for` loop to enhance readability.

Possibly related PRs

Leverage heterozygosity to create diploid genomes #140: The changes in this PR involve modifications to harpy/simulate.py, which is relevant because the main PR updates the .github/filters.yml file to include the new haplotag_barcodes.py and inline_to_haplotag.py scripts, both of which are likely used in the simulation processes that simulate.py manages.
Add Metassembly #144: This PR introduces changes to the Snakemake workflows and includes modifications to the metassembly process, which is relevant to the changes made in the main PR regarding the simreads module paths.
Molecule coverage #149: This PR adds functionality for calculating molecular coverage, which may relate to the changes in the .github/filters.yml file as it involves files that are part of the simreads module.
add quast reports to assembly #155: This PR integrates QUAST and BUSCO into the assembly workflows, which could be relevant to the changes in the main PR as it expands the assembly-related operations that might interact with the simreads module.
add singleton/non stats #156: This PR adds functionality for reporting singleton and non-singleton statistics, which may relate to the overall data processing and reporting capabilities that could be affected by the changes in the main PR regarding the simreads module paths.

Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media?

❤️ Share

🪧 Tips

Chat

There are 3 ways to chat with CodeRabbit:

Review comments: Directly reply to a review comment made by CodeRabbit. Example:
- I pushed a fix in commit <commit_id>, please review it.
- Generate unit testing code for this file.
- Open a follow-up GitHub issue for this discussion.
Files and specific lines of code (under the "Files changed" tab): Tag @coderabbitai in a new review comment at the desired location with your query. Examples:
- @coderabbitai generate unit testing code for this file.
- @coderabbitai modularize this function.
PR comments: Tag @coderabbitai in a new PR comment to ask questions about the PR branch. For the best results, please provide a very specific query, as very limited context is provided in this mode. Examples:
- @coderabbitai gather interesting stats about this repository and render them as a table. Additionally, render a pie chart showing the language distribution in the codebase.
- @coderabbitai read src/utils.ts and generate unit testing code.
- @coderabbitai read the files in the src/scheduler package and generate a class diagram using mermaid and a README in the markdown format.
- @coderabbitai help me debug CodeRabbit configuration file.

Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments.

CodeRabbit Commands (Invoked using PR comments)

@coderabbitai pause to pause the reviews on a PR.
@coderabbitai resume to resume the paused reviews.
@coderabbitai review to trigger an incremental review. This is useful when automatic reviews are disabled for the repository.
@coderabbitai full review to do a full review from scratch and review all the files again.
@coderabbitai summary to regenerate the summary of the PR.
@coderabbitai resolve resolve all the CodeRabbit review comments.
@coderabbitai configuration to show the current CodeRabbit configuration for the repository.
@coderabbitai help to get help.

Other keywords and placeholders

Add @coderabbitai ignore anywhere in the PR description to prevent this PR from being reviewed.
Add @coderabbitai summary to generate the high-level summary at a specific location in the PR description.
Add @coderabbitai anywhere in the PR title to generate the title automatically.

CodeRabbit Configuration File (`.coderabbit.yaml`)

You can programmatically configure CodeRabbit by adding a .coderabbit.yaml file to the root of your repository.
Please see the configuration documentation for more information.
If your editor has YAML language server enabled, you can add the path at the top of this file to enable auto-completion and validation: # yaml-language-server: $schema=https://coderabbit.ai/integrations/schema.v2.json

Documentation and Community

Visit our Documentation for detailed information on how to use CodeRabbit.
Join our Discord Community to get help, request features, and share feedback.
Follow us on X/Twitter for updates and announcements.

coderabbitai

Actionable comments posted: 4

🧹 Outside diff range and nitpick comments (11)

harpy/bin/inline_to_haplotag.py (4)

2-2: Enhance the module docstring with more details.

The current docstring is quite brief. Consider expanding it to include:

Input format expectations
Output format specifications
Description of haplotag style barcodes
Example usage

-"""Convert inline barcodes into haplotag style ones"""
+"""
+Convert inline barcodes into haplotag style ones.
+
+This script processes paired-end FASTQ files containing inline barcodes and converts
+them to haplotag format, where barcodes are stored in BX:Z: and OX:Z: SAM tags
+in the read headers.
+
+Input:
+    - Paired FASTQ files with inline barcodes in first 16bp of R1 reads
+    - List of valid barcodes
+
+Output:
+    - Modified FASTQ files with:
+        - Barcodes removed from R1 sequences
+        - Added BX:Z: and OX:Z: tags in headers
+        - Barcode conversion mapping to stdout
+
+Example haplotag format: A01C02B03D04 (represents concatenated indices)
+"""

Line range hint 32-43: Improve function documentation and robustness.

Several improvements could be made to the process_record function:

The docstring lacks parameter and return value documentation
The function relies on global variables (bc_dict, bc_generator)
Missing validation for the expected 16bp barcode length

Consider this refactoring:

-def process_record(fw_entry, rv_entry):
-    """convert the barcode to haplotag"""
+def process_record(fw_entry, rv_entry, bc_dict, bc_generator):
+    """Convert inline barcode to haplotag format.
+    
+    Args:
+        fw_entry (list): Forward read entry [header, seq, '+', qual]
+        rv_entry (list): Reverse read entry [header, seq, '+', qual]
+        bc_dict (dict): Barcode conversion mapping
+        bc_generator: Generator for new haplotag barcodes
+    
+    Returns:
+        tuple: (new_forward_entry, new_reverse_entry)
+        
+    Raises:
+        ValueError: If forward read sequence is shorter than 16bp
+    """
+    if len(fw_entry[1]) < 16:
+        raise ValueError("Forward read sequence must be at least 16bp long")
+
     bc10x = fw_entry[1][:16]
     bchap = bc_dict.get(bc10x, "A00C00B00D00")
     if not bchap:
         bchap = "".join(next(bc_generator))
         bc_dict[bc10x] = bchap

Line range hint 74-97: Add error handling and improve processing efficiency.

The main processing loop could be improved in several ways:

Add error handling for malformed FASTQ files
Process records more memory efficiently
Add progress indication for large files

Consider implementing a FASTQ record iterator and adding error handling:

def iter_fastq_records(file_handle):
    """Iterate over FASTQ records in a file.
    
    Args:
        file_handle: Opened gzip file handle
        
    Yields:
        list: FASTQ record [header, seq, '+', qual]
        
    Raises:
        ValueError: If file is not in FASTQ format
    """
    record = []
    for line in file_handle:
        line = line.decode().rstrip("\n")
        record.append(line)
        if len(record) == 4:
            if not (record[0].startswith("@") and record[2] == "+"):
                raise ValueError("Invalid FASTQ format")
            yield record
            record = []
    if record:
        raise ValueError("Incomplete FASTQ record at end of file")

Then use it in the main loop:

from tqdm import tqdm
import os

# Get file size for progress bar
fw_size = os.path.getsize(fw_reads)
with gzip.open(fw_reads, "r") as fw_i, gzip.open(rv_reads, "r") as rv_i:
    pbar = tqdm(total=fw_size, unit='B', unit_scale=True)
    for fw_record, rv_record in zip(iter_fastq_records(fw_i), iter_fastq_records(rv_i)):
        new_fw, new_rv = process_record(fw_record, rv_record, bc_dict, bc_generator)
        fw_out.write(new_fw.encode("utf-8"))
        rv_out.write(new_rv.encode("utf-8"))
        pbar.update(sum(len(line) + 1 for line in fw_record))

Line range hint 52-63: Fix barcode file handling and add validation.

The barcode file handling has two issues:

The gzipped file check is incorrect (should check before opening)
Missing validation of barcode format

-# read in barcodes
-with open(args.barcodes, "r") as bc_file:
-    while True:
-        # Get next line from file
-        if args.barcodes.endswith("gz"):
-            line = bc_file.readline().decode()
-        else:
-            line = bc_file.readline()
-        # if line is empty
-        # end of file is reached
-        if not line:
-            break
-        bc = line.rstrip("\n").split()
-        _10X = str(bc[0])
-        bc_dict[_10X] = None
+def validate_barcode(barcode):
+    """Validate barcode format (16bp of A,C,G,T)."""
+    if not (len(barcode) == 16 and set(barcode).issubset({'A','C','G','T'})):
+        raise ValueError(f"Invalid barcode format: {barcode}")
+
+# read in barcodes
+opener = gzip.open if args.barcodes.endswith('.gz') else open
+mode = 'rt' if args.barcodes.endswith('.gz') else 'r'
+
+with opener(args.barcodes, mode) as bc_file:
+    for line in bc_file:
+        barcode = line.rstrip().split()[0]
+        validate_barcode(barcode)
+        bc_dict[barcode] = None

harpy/bin/haplotag_barcodes.py (2)

15-20: Consider moving barcode sequences to a configuration file.

The hardcoded barcode sequences make the code harder to maintain and update. Consider:

Moving the sequences to a separate JSON/YAML configuration file

Adding validation to ensure sequences only contain valid DNA bases (ACGT)

Example implementation:
import json
from pathlib import Path

def validate_sequence(seq):
    valid_bases = set('ACGT')
    if not all(base in valid_bases for base in seq):
        raise ValueError(f"Invalid DNA sequence: {seq}")
    return seq

def load_barcodes(config_path):
    with open(config_path) as f:
        data = json.load(f)
    return {
        key: [validate_sequence(seq) for seq in sequences]
        for key, sequences in data.items()
    }

# Load from config file
BX = load_barcodes(Path(__file__).parent / "barcodes.json")
22-24: Add error handling and progress indication.

The barcode generation logic is correct, but consider these improvements:

Add error handling for write operations

Add progress indication for large output (4700+ combinations)
+from tqdm import tqdm
+
 bc_generator = product(BX["A"], BX["C"], BX["B"], BX["D"])
-for BC in bc_generator:
-    sys.stdout.write("".join(BC) + "\n")
+try:
+    for BC in tqdm(bc_generator, desc="Generating barcodes"):
+        sys.stdout.write("".join(BC) + "\n")
+except IOError as e:
+    print(f"Error writing to stdout: {e}", file=sys.stderr)
+    sys.exit(1)
Don't forget to add tqdm to your project's dependencies.

harpy/snakefiles/simulate_linkedreads.smk (3)

53-59: Add error handling and logging for barcode generation.

While the rule correctly generates haplotag barcodes, it lacks error handling and logging. Consider adding these to ensure robustness:
 rule create_barcodes:
     output:
         barcodefile
+    log:
+        outdir + "/logs/create_barcodes.log"
     container:
         None
     shell:
-        "haplotag_barcodes.py > {output}"
+        "haplotag_barcodes.py > {output} 2> {log} || (rm -f {output} && exit 1)"
197-198: Enhance workflow summary with barcode conversion details.

While the summary correctly reflects the script rename, consider adding more details about the barcode conversion process to help users understand the workflow better.
         bxconvert = "Inline barcodes were converted in haplotag BX:Z tags using:\n"
-        bxconvert += "\tinline_to_haplotag.py"
+        bxconvert += f"\tinline_to_haplotag.py -f READ1.fq -r READ2.fq -b {barcodefile} -p PREFIX"
Fix log redirection and update log directory name for consistency

Two issues need attention:

The log redirection should use 2> to capture stderr instead of >.

The log directory name still uses the old "10XtoHaplotag" naming scheme.
    log:
-        outdir + "/logs/10XtoHaplotag/hap{hap}"
+        outdir + "/logs/demultiplex_barcodes/hap{hap}"
    shell:
-        "inline_to_haplotag.py -f {input.fw} -r {input.rv} -b {input.barcodes} -p {params} > {log}"
+        "inline_to_haplotag.py -f {input.fw} -r {input.rv} -b {input.barcodes} -p {params} 2> {log}"
🔗 Analysis chain

Line range hint 151-166: Fix log redirection and verify script availability.

The rule has been correctly renamed to better reflect its functionality. However, there are two concerns:

The log redirection uses > which captures stdout instead of stderr. This might miss error messages.

We should verify that the renamed script exists in the expected location.
     shell:
-        "inline_to_haplotag.py -f {input.fw} -r {input.rv} -b {input.barcodes} -p {params} > {log}"
+        "inline_to_haplotag.py -f {input.fw} -r {input.rv} -b {input.barcodes} -p {params} 2> {log}"
🏁 Scripts executed

The following scripts were executed for the analysis:

Script:
#!/bin/bash
# Description: Verify the existence and permissions of the renamed script

# Check if the script exists and is executable
fd -t x "inline_to_haplotag.py"

# Check if there are any remaining references to the old script name
rg "10XtoHaplotag\.py"
Length of output: 88

Script:
#!/bin/bash
# Let's verify if the script is actually executable and check its content
ls -l harpy/bin/inline_to_haplotag.py

# Check if there are any references to the old name without the .py extension
rg "10XtoHaplotag"

# Check the script content to ensure it's the correct implementation
head -n 10 harpy/bin/inline_to_haplotag.py
Length of output: 515

harpy/simulate.py (2)

129-129: Document the rationale for the molecules-per limit.

While adding the upper limit of 4700 aligns with the PR objectives, it would be helpful to document why this specific limit was chosen, either in the help text or as a code comment.

151-153: Avoid hardcoding file size in documentation.

The docstring mentions a specific file size ("~2GB disk space") which could become outdated if the barcode generation changes. Consider making this more general or removing the size reference to reduce maintenance overhead.

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 9c8a01d and 4e18a03.

📒 Files selected for processing (5)

.github/filters.yml (1 hunks)
harpy/bin/haplotag_barcodes.py (1 hunks)
harpy/bin/inline_to_haplotag.py (2 hunks)
harpy/simulate.py (3 hunks)
harpy/snakefiles/simulate_linkedreads.smk (5 hunks)

🔇 Additional comments (4)

.github/filters.yml (1)

156-157: Verify module assignments for new scripts.

The addition of the new scripts to the simreads module aligns with the PR objectives. However, since haplotag_barcodes.py handles barcode generation similar to haplotag_acbd.py, consider if it should also be included in the demux module's file list.

✅ Verification successful

Let me try another verification approach to check the actual content and purpose of these files.

Module assignments for new scripts are correct

Based on the verification results, both scripts are correctly assigned to the simreads module only:

inline_to_haplotag.py converts 10x linked reads to haplotag format, which is a simulation-specific task
haplotag_barcodes.py generates barcode lists specifically for the simulation workflow
Both files are exclusively referenced in simulate_linkedreads.smk and are part of the simulation pipeline
While there are demultiplexing-related files in the codebase, these new scripts serve different purposes and don't share functionality with the demultiplexing module

🏁 Scripts executed

The following scripts were executed for the analysis:

Script:

#!/bin/bash
# Description: Check if haplotag_barcodes.py has similar functionality to files in demux module

# Check for similar function signatures or patterns between the two files
echo "Comparing haplotag_barcodes.py with haplotag_acbd.py:"
rg -A 5 "def" "harpy/bin/haplotag_barcodes.py" "harpy/bin/haplotag_acbd.py" || echo "Files not found"

# Check if the new files are referenced in demux-related files
echo -e "\nChecking references in demux files:"
rg "haplotag_barcodes|inline_to_haplotag" "harpy/snakefiles/demultiplex*.smk" || echo "No references found in demux files"

Length of output: 549

Script:

#!/bin/bash
# Check the actual content and purpose of the new files
echo "Content of inline_to_haplotag.py:"
rg -A 10 "def|class|import" "harpy/bin/inline_to_haplotag.py" || echo "File not found"

echo -e "\nContent of haplotag_barcodes.py:"
rg -A 10 "def|class|import" "harpy/bin/haplotag_barcodes.py" || echo "File not found"

# Check if these files are referenced in simulation-related files
echo -e "\nChecking references in simulation files:"
fd -e smk -e py -e sh . | xargs rg "haplotag_barcodes|inline_to_haplotag"

# Check for any demux-related files in the codebase
echo -e "\nListing demux-related files:"
fd "demux|demultiplex" .

Length of output: 2941

harpy/bin/haplotag_barcodes.py (1)

1-6: LGTM! Clean and minimal imports.

The script setup is well-structured with appropriate imports and a clear docstring.

harpy/snakefiles/simulate_linkedreads.smk (1)

20-20: LGTM: Barcode file path configuration is correct.

The ternary operator correctly handles both custom and default barcode file paths, aligning with the PR's objective of using haplotagging barcodes by default.

harpy/simulate.py (1)

Line range hint 214-600: LGTM: Other simulation commands maintain consistency.

The remaining simulation commands (snpindel, inversion, cnv, translocation) maintain a consistent structure and proper error handling.

harpy/bin/inline_to_haplotag.py

harpy/bin/haplotag_barcodes.py

harpy/simulate.py

coderabbitai

Actionable comments posted: 5

🧹 Outside diff range and nitpick comments (9)

harpy/bin/parse_phaseblocks.py (1)

Line range hint 26-38: Consider adding input validation and error handling

The script assumes a specific file format but doesn't validate it. Consider adding basic error handling for:

Empty files
Malformed block headers
Missing required fields in data lines

Here's a suggested implementation:

 with open(args.input, "r", encoding="utf-8") as blocks:
     FIRST_LINE = True
+    found_block = False
     for line in blocks:
         lsplit = line.split()
+        if not lsplit:  # Skip empty lines
+            continue
         if lsplit[0] == "BLOCK:":
+            if len(lsplit) < 9:
+                raise ValueError(f"Malformed block header: {line.strip()}")
             n_snp     = int(lsplit[6])
             len_block = int(lsplit[8])
             FIRST_LINE = True
+            found_block = True
         else:
             if FIRST_LINE:
+                if len(lsplit) < 5:
+                    raise ValueError(f"Malformed data line: {line.strip()}")
                 pos_start = int(lsplit[4])
                 contig    = lsplit[3]
                 FIRST_LINE = False
                 sys.stdout.write(f"{samplename}\t{contig}\t{n_snp}\t{pos_start}\t{len_block}\n")
             else:
                 continue
+    if not found_block:
+        raise ValueError("No valid blocks found in input file")

harpy/bin/make_windows.py (2)

Line range hint 55-59: LGTM! The for loop is a more Pythonic approach.

The change from a while loop to a for loop for reading FAI files is a good improvement that makes the code more readable and maintainable.

Consider applying similar improvements to the FASTA reading section (lines 64-91) which still uses while loops. Here's a suggested refactor:

-    line = fopen.readline()
-    while True:
-        C_LEN=0
-        if not line:
-            break
-        if line.startswith(">"):
-            # remove newline, > starting symbol, and any comments after name
-            contig = line.rstrip("\n").lstrip(">").split()[0]
-            # keep reading until hitting another ">"
-            HEADER = False
-            while not HEADER:
-                line = fopen.readline()
-                if not line:
-                    break
-                line = line.rstrip("\n")
-                if line.startswith(">"):
-                    HEADER = True
-                    break
-                else:
-                    C_LEN += len(line)-1
-                    HEADER = False
+    current_contig = None
+    current_len = 0
+    
+    for line in fopen:
+        line = line.rstrip("\n")
+        if line.startswith(">"):
+            if current_contig:
+                current_len += args.mode
+                makewindows(current_contig, current_len, args.window)
+            current_contig = line.lstrip(">").split()[0]
+            current_len = 0
+        else:
+            current_len += len(line)
+    
+    if current_contig:
+        current_len += args.mode
+        makewindows(current_contig, current_len, args.window)

This refactoring:

Eliminates nested while loops
Simplifies the logic by tracking the current contig and length
Processes windows when encountering a new contig or at EOF
Maintains the same functionality with cleaner code

Line range hint 1-91: Consider enhancing robustness and user experience.

While the core functionality is solid, consider these improvements:

Add validation to prevent negative window positions
Add progress indication for large files
Optimize memory usage for large FASTA files

Here are specific suggestions:

Add position validation in the makewindows function:

def makewindows(_contig, _c_len, windowsize):
    """create a file of the specified windows"""
    if _c_len < 0:
        raise ValueError(f"Invalid contig length {_c_len} for {_contig}")
    # ... rest of the function

Add progress reporting using sys.stderr:

def log_progress(msg):
    print(msg, file=sys.stderr)

For large FASTA files, consider processing sequences in chunks:

def process_sequence_chunk(chunk, chunk_size=1024*1024):
    """Process sequence data in chunks to optimize memory usage"""
    for i in range(0, len(chunk), chunk_size):
        yield chunk[i:i + chunk_size]

harpy/bin/molecule_coverage.py (1)

48-48: LGTM! Great improvement to use Pythonic file iteration.

The change from while True loop to for loop is a good improvement that makes the code more idiomatic and efficient.

Consider using a more descriptive variable name than fai to improve code readability, for example:
-with open(args.fai, "r", encoding= "utf-8") as fai:
-    for line in fai:
+with open(args.fai, "r", encoding= "utf-8") as fasta_index:
+    for line in fasta_index:

harpy/snakefiles/simulate_variants.smk (1)

Line range hint 83-89: LGTM! Consider adding explicit error handling.

The change from a while True loop to a for loop is a good improvement that makes the code more Pythonic and efficient.

Consider adding explicit error handling for potential I/O issues:

-            for line in in_vcf:
-                if line.startswith("#") or rng.uniform(0, 1) >= params.het:
-                    # write header lines and homozygous variants to both files
-                    hap1.write(line)
-                    hap2.write(line)
-                elif rng.random() < 0.5:
-                    hap1.write(line)
-                else:
-                    hap2.write(line)
+            try:
+                for line in in_vcf:
+                    if line.startswith("#") or rng.uniform(0, 1) >= params.het:
+                        # write header lines and homozygous variants to both files
+                        hap1.write(line)
+                        hap2.write(line)
+                    elif rng.random() < 0.5:
+                        hap1.write(line)
+                    else:
+                        hap2.write(line)
+            except IOError as e:
+                raise WorkflowError(f"Error processing VCF file: {e}")

harpy/snakefiles/simulate_snpindel.smk (1)

Line range hint 121-128: LGTM! Consider extracting the random assignment logic.

The change from a while loop to a for loop improves readability while maintaining the same functionality. The random assignment of variants to haplotypes is correctly implemented.

Consider extracting the random assignment logic into a helper function to improve maintainability:

 rule diploid_snps:
     input:
         f"{outdir}/{outprefix}.snp.vcf"
     output:
         f"{outdir}/diploid/{outprefix}.snp.hap1.vcf",
         f"{outdir}/diploid/{outprefix}.snp.hap2.vcf"
     params:
         het = heterozygosity
     run:
+        def assign_variant(rng, heterozygosity):
+            """Helper function to determine if and where to assign a variant."""
+            if rng.uniform(0, 1) >= heterozygosity:
+                return "both"
+            return "hap1" if rng.random() < 0.5 else "hap2"
+
         rng = random.Random(randomseed) if randomseed else random.Random()
         with open(input[0], "r") as in_vcf, open(output[0], "w") as hap1, open(output[1], "w") as hap2:
             for line in in_vcf:
-                if line.startswith("#") or rng.uniform(0, 1) >= params.het:
-                    # write header lines and homozygous variants to both files
+                if line.startswith("#"):
                     hap1.write(line)
                     hap2.write(line)
-                elif rng.random() < 0.5:
-                    hap1.write(line)
-                else:
-                    hap2.write(line)
+                    continue
+                
+                assignment = assign_variant(rng, params.het)
+                if assignment == "both":
+                    hap1.write(line)
+                    hap2.write(line)
+                elif assignment == "hap1":
+                    hap1.write(line)
+                else:
+                    hap2.write(line)

harpy/snakefiles/snp_mpileup.smk (1)

38-41: Add input validation for region file format.

The code assumes each line contains exactly three space-separated columns. Consider adding validation to handle malformed lines gracefully.

 for line in reg_in:
-    cont,startpos,endpos = line.split()
-    intervals.add(f"{cont}:{startpos}-{endpos}")
+    parts = line.strip().split()
+    if len(parts) != 3:
+        raise ValueError(f"Invalid region format in line: {line.strip()}")
+    cont, startpos, endpos = parts
+    try:
+        # Validate positions are integers
+        int(startpos), int(endpos)
+        intervals.add(f"{cont}:{startpos}-{endpos}")
+    except ValueError:
+        raise ValueError(f"Invalid position values in line: {line.strip()}")

harpy/bin/inline_to_haplotag.py (2)

53-53: Correct typo in error message

There's a typo in the word "captial"; it should be "capital" in the error message.

Apply this diff to fix the typo:

-    raise ValueError(f"Invalid barcode format: {barcode}. Barcodes must be captial letters and only contain standard nucleotide values ATCG.")
+    raise ValueError(f"Invalid barcode format: {barcode}. Barcodes must be capital letters and only contain standard nucleotide values ATCG.")

86-87: Use context managers when opening output files

To ensure proper resource management, use context managers (with statements) when opening the output files.

Apply this diff:

-fw_out = gzip.open(f"{args.prefix}.R1.fq.gz", "wb", 6)
-rv_out = gzip.open(f"{args.prefix}.R2.fq.gz", "wb", 6)
+with gzip.open(f"{args.prefix}.R1.fq.gz", "wb", 6) as fw_out, \
+     gzip.open(f"{args.prefix}.R2.fq.gz", "wb", 6) as rv_out:
+    # Process the input files within this block
+    with gzip.open(args.forward, "r") as fw_i, gzip.open(args.reverse, "r") as rv_i:
+        for fw_record, rv_record in zip(iter_fastq_records(fw_i), iter_fastq_records(rv_i)):
+            new_fw, new_rv = process_record(fw_record, rv_record, bc_dict, bc_generator)
+            fw_out.write(new_fw.encode("utf-8"))
+            rv_out.write(new_rv.encode("utf-8"))

🧰 Tools

🪛 Ruff

86-86: Use a context manager for opening files

(SIM115)

87-87: Use a context manager for opening files

(SIM115)

📜 Review details

Configuration used: CodeRabbit UI
Review profile: CHILL

📥 Commits

Reviewing files that changed from the base of the PR and between 4e18a03 and 4802867.

📒 Files selected for processing (9)

harpy/bin/inline_to_haplotag.py (1 hunks)
harpy/bin/make_windows.py (1 hunks)
harpy/bin/molecule_coverage.py (1 hunks)
harpy/bin/parse_phaseblocks.py (1 hunks)
harpy/simulate.py (3 hunks)
harpy/snakefiles/simulate_snpindel.smk (1 hunks)
harpy/snakefiles/simulate_variants.smk (1 hunks)
harpy/snakefiles/snp_freebayes.smk (1 hunks)
harpy/snakefiles/snp_mpileup.smk (1 hunks)

🚧 Files skipped from review as they are similar to previous changes (1)

harpy/simulate.py

🧰 Additional context used

🪛 Ruff

harpy/bin/inline_to_haplotag.py

86-86: Use a context manager for opening files

(SIM115)

87-87: Use a context manager for opening files

(SIM115)

🔇 Additional comments (3)

harpy/bin/parse_phaseblocks.py (1)

26-26: LGTM! More Pythonic file iteration

The change to use a for loop for file iteration is a good improvement. It's more idiomatic Python and eliminates the need for explicit EOF checking.

harpy/snakefiles/snp_mpileup.smk (1)

38-41: LGTM! Good simplification of the region parsing logic.

The change from a while loop to a for loop makes the code more Pythonic and easier to read.

harpy/bin/inline_to_haplotag.py (1)

11-11: Fix inconsistent program description

The description still refers to "inline linked read barcodes," which may not accurately reflect that the script handles generic inline barcodes. Consider updating it for clarity.

harpy/snakefiles/snp_freebayes.smk

harpy/bin/inline_to_haplotag.py

coderabbitai · 2024-11-13T19:42:13Z

harpy/bin/inline_to_haplotag.py

+    for fw_record, rv_record in zip_longest(iter_fastq_records(fw_i), iter_fastq_records(rv_i)):
+        new_fw, new_rv = process_record(fw_record, rv_record, bc_dict, bc_generator)
+        fw_out.write(new_fw.encode("utf-8"))
+        rv_out.write(new_rv.encode("utf-8"))


⚠️ Potential issue

Handle mismatched number of records in paired FASTQ files

Using zip_longest can lead to None values if the files have different numbers of records, causing errors in process_record. Paired FASTQ files should have matching records.

Modify the code to use zip and check for mismatches:

with gzip.open(args.forward, "r") as fw_i, gzip.open(args.reverse, "r") as rv_i: - for fw_record, rv_record in zip_longest(iter_fastq_records(fw_i), iter_fastq_records(rv_i)): + for fw_record, rv_record in zip(iter_fastq_records(fw_i), iter_fastq_records(rv_i)): new_fw, new_rv = process_record(fw_record, rv_record, bc_dict, haplotag_bc) fw_out.write(new_fw.encode("utf-8")) rv_out.write(new_rv.encode("utf-8")) + # Check for any remaining records + if next(iter_fastq_records(fw_i), None) or next(iter_fastq_records(rv_i), None): + raise ValueError("Mismatched number of records in paired FASTQ files.")

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

for fw_record, rv_record in zip_longest(iter_fastq_records(fw_i), iter_fastq_records(rv_i)):

new_fw, new_rv = process_record(fw_record, rv_record, bc_dict, bc_generator)

fw_out.write(new_fw.encode("utf-8"))

rv_out.write(new_rv.encode("utf-8"))

for fw_record, rv_record in zip(iter_fastq_records(fw_i), iter_fastq_records(rv_i)):

new_fw, new_rv = process_record(fw_record, rv_record, bc_dict, bc_generator)

fw_out.write(new_fw.encode("utf-8"))

rv_out.write(new_rv.encode("utf-8"))

# Check for any remaining records

if next(iter_fastq_records(fw_i), None) or next(iter_fastq_records(rv_i), None):

raise ValueError("Mismatched number of records in paired FASTQ files.")

pdimens added 3 commits November 13, 2024 12:17

use haplotag barcodes instead

8989459

improve wording

357eea6

rename script for better generalizing

d3a89cf

rm accidental file

4e18a03

coderabbitai bot reviewed Nov 13, 2024

View reviewed changes

harpy/bin/inline_to_haplotag.py Outdated Show resolved Hide resolved

harpy/bin/haplotag_barcodes.py Show resolved Hide resolved

harpy/bin/haplotag_barcodes.py Show resolved Hide resolved

harpy/simulate.py Outdated Show resolved Hide resolved

pdimens added 4 commits November 13, 2024 14:21

make more efficient following the rabbit's suggestions

14dec85

fix typo

e74e2d7

simplify file reading

7e9a698

convert to for loops for ease

4802867

coderabbitai bot reviewed Nov 13, 2024

View reviewed changes

pdimens merged commit 9334a66 into main Nov 14, 2024
19 of 20 checks passed

This was referenced Nov 14, 2024

better sim demux support #163

Merged

fix param call #164

Merged

coderabbitai bot mentioned this pull request Nov 22, 2024

swap simuG to conda-based install #168

Merged

coderabbitai bot mentioned this pull request Dec 10, 2024

revisit downsample as a better SM workflow #173

Merged

coderabbitai bot mentioned this pull request Dec 19, 2024

add priorities, punch up naibr move/copy logic #178

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

haplotagging barcodes as default #162

haplotagging barcodes as default #162

pdimens commented Nov 13, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 13, 2024 •

edited

Loading

Walkthrough

Changes

Possibly related PRs

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (`.coderabbit.yaml`)

Documentation and Community

coderabbitai bot left a comment

coderabbitai bot left a comment

coderabbitai bot Nov 13, 2024

haplotagging barcodes as default #162

haplotagging barcodes as default #162

Conversation

pdimens commented Nov 13, 2024 • edited by coderabbitai bot Loading

Summary by CodeRabbit

Release Notes

coderabbitai bot commented Nov 13, 2024 • edited Loading

Walkthrough

Changes

Possibly related PRs

Chat

CodeRabbit Commands (Invoked using PR comments)

Other keywords and placeholders

CodeRabbit Configuration File (.coderabbit.yaml)

Documentation and Community

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot left a comment

Choose a reason for hiding this comment

coderabbitai bot Nov 13, 2024

Choose a reason for hiding this comment

pdimens commented Nov 13, 2024 •

edited by coderabbitai bot

Loading

coderabbitai bot commented Nov 13, 2024 •

edited

Loading

CodeRabbit Configuration File (`.coderabbit.yaml`)