-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
haplotagging barcodes as default #162
Conversation
📝 Walkthrough📝 WalkthroughWalkthroughThe pull request introduces modifications to several scripts and configuration files related to the Changes
Possibly related PRs
Thank you for using CodeRabbit. We offer it for free to the OSS community and would appreciate your support in helping us grow. If you find it useful, would you consider giving us a shout-out on your favorite social media? 🪧 TipsChatThere are 3 ways to chat with CodeRabbit:
Note: Be mindful of the bot's finite context window. It's strongly recommended to break down tasks such as reading entire modules into smaller chunks. For a focused discussion, use review comments to chat about specific files and their changes, instead of using the PR comments. CodeRabbit Commands (Invoked using PR comments)
Other keywords and placeholders
CodeRabbit Configuration File (
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 4
🧹 Outside diff range and nitpick comments (11)
harpy/bin/inline_to_haplotag.py (4)
2-2
: Enhance the module docstring with more details.The current docstring is quite brief. Consider expanding it to include:
- Input format expectations
- Output format specifications
- Description of haplotag style barcodes
- Example usage
-"""Convert inline barcodes into haplotag style ones""" +""" +Convert inline barcodes into haplotag style ones. + +This script processes paired-end FASTQ files containing inline barcodes and converts +them to haplotag format, where barcodes are stored in BX:Z: and OX:Z: SAM tags +in the read headers. + +Input: + - Paired FASTQ files with inline barcodes in first 16bp of R1 reads + - List of valid barcodes + +Output: + - Modified FASTQ files with: + - Barcodes removed from R1 sequences + - Added BX:Z: and OX:Z: tags in headers + - Barcode conversion mapping to stdout + +Example haplotag format: A01C02B03D04 (represents concatenated indices) +"""
Line range hint
32-43
: Improve function documentation and robustness.Several improvements could be made to the
process_record
function:
- The docstring lacks parameter and return value documentation
- The function relies on global variables (
bc_dict
,bc_generator
)- Missing validation for the expected 16bp barcode length
Consider this refactoring:
-def process_record(fw_entry, rv_entry): - """convert the barcode to haplotag""" +def process_record(fw_entry, rv_entry, bc_dict, bc_generator): + """Convert inline barcode to haplotag format. + + Args: + fw_entry (list): Forward read entry [header, seq, '+', qual] + rv_entry (list): Reverse read entry [header, seq, '+', qual] + bc_dict (dict): Barcode conversion mapping + bc_generator: Generator for new haplotag barcodes + + Returns: + tuple: (new_forward_entry, new_reverse_entry) + + Raises: + ValueError: If forward read sequence is shorter than 16bp + """ + if len(fw_entry[1]) < 16: + raise ValueError("Forward read sequence must be at least 16bp long") + bc10x = fw_entry[1][:16] bchap = bc_dict.get(bc10x, "A00C00B00D00") if not bchap: bchap = "".join(next(bc_generator)) bc_dict[bc10x] = bchap
Line range hint
74-97
: Add error handling and improve processing efficiency.The main processing loop could be improved in several ways:
- Add error handling for malformed FASTQ files
- Process records more memory efficiently
- Add progress indication for large files
Consider implementing a FASTQ record iterator and adding error handling:
def iter_fastq_records(file_handle): """Iterate over FASTQ records in a file. Args: file_handle: Opened gzip file handle Yields: list: FASTQ record [header, seq, '+', qual] Raises: ValueError: If file is not in FASTQ format """ record = [] for line in file_handle: line = line.decode().rstrip("\n") record.append(line) if len(record) == 4: if not (record[0].startswith("@") and record[2] == "+"): raise ValueError("Invalid FASTQ format") yield record record = [] if record: raise ValueError("Incomplete FASTQ record at end of file")Then use it in the main loop:
from tqdm import tqdm import os # Get file size for progress bar fw_size = os.path.getsize(fw_reads) with gzip.open(fw_reads, "r") as fw_i, gzip.open(rv_reads, "r") as rv_i: pbar = tqdm(total=fw_size, unit='B', unit_scale=True) for fw_record, rv_record in zip(iter_fastq_records(fw_i), iter_fastq_records(rv_i)): new_fw, new_rv = process_record(fw_record, rv_record, bc_dict, bc_generator) fw_out.write(new_fw.encode("utf-8")) rv_out.write(new_rv.encode("utf-8")) pbar.update(sum(len(line) + 1 for line in fw_record))
Line range hint
52-63
: Fix barcode file handling and add validation.The barcode file handling has two issues:
- The gzipped file check is incorrect (should check before opening)
- Missing validation of barcode format
-# read in barcodes -with open(args.barcodes, "r") as bc_file: - while True: - # Get next line from file - if args.barcodes.endswith("gz"): - line = bc_file.readline().decode() - else: - line = bc_file.readline() - # if line is empty - # end of file is reached - if not line: - break - bc = line.rstrip("\n").split() - _10X = str(bc[0]) - bc_dict[_10X] = None +def validate_barcode(barcode): + """Validate barcode format (16bp of A,C,G,T).""" + if not (len(barcode) == 16 and set(barcode).issubset({'A','C','G','T'})): + raise ValueError(f"Invalid barcode format: {barcode}") + +# read in barcodes +opener = gzip.open if args.barcodes.endswith('.gz') else open +mode = 'rt' if args.barcodes.endswith('.gz') else 'r' + +with opener(args.barcodes, mode) as bc_file: + for line in bc_file: + barcode = line.rstrip().split()[0] + validate_barcode(barcode) + bc_dict[barcode] = Noneharpy/bin/haplotag_barcodes.py (2)
15-20
: Consider moving barcode sequences to a configuration file.The hardcoded barcode sequences make the code harder to maintain and update. Consider:
- Moving the sequences to a separate JSON/YAML configuration file
- Adding validation to ensure sequences only contain valid DNA bases (ACGT)
Example implementation:
import json from pathlib import Path def validate_sequence(seq): valid_bases = set('ACGT') if not all(base in valid_bases for base in seq): raise ValueError(f"Invalid DNA sequence: {seq}") return seq def load_barcodes(config_path): with open(config_path) as f: data = json.load(f) return { key: [validate_sequence(seq) for seq in sequences] for key, sequences in data.items() } # Load from config file BX = load_barcodes(Path(__file__).parent / "barcodes.json")
22-24
: Add error handling and progress indication.The barcode generation logic is correct, but consider these improvements:
- Add error handling for write operations
- Add progress indication for large output (4700+ combinations)
+from tqdm import tqdm + bc_generator = product(BX["A"], BX["C"], BX["B"], BX["D"]) -for BC in bc_generator: - sys.stdout.write("".join(BC) + "\n") +try: + for BC in tqdm(bc_generator, desc="Generating barcodes"): + sys.stdout.write("".join(BC) + "\n") +except IOError as e: + print(f"Error writing to stdout: {e}", file=sys.stderr) + sys.exit(1)Don't forget to add
tqdm
to your project's dependencies.harpy/snakefiles/simulate_linkedreads.smk (3)
53-59
: Add error handling and logging for barcode generation.While the rule correctly generates haplotag barcodes, it lacks error handling and logging. Consider adding these to ensure robustness:
rule create_barcodes: output: barcodefile + log: + outdir + "/logs/create_barcodes.log" container: None shell: - "haplotag_barcodes.py > {output}" + "haplotag_barcodes.py > {output} 2> {log} || (rm -f {output} && exit 1)"
197-198
: Enhance workflow summary with barcode conversion details.While the summary correctly reflects the script rename, consider adding more details about the barcode conversion process to help users understand the workflow better.
bxconvert = "Inline barcodes were converted in haplotag BX:Z tags using:\n" - bxconvert += "\tinline_to_haplotag.py" + bxconvert += f"\tinline_to_haplotag.py -f READ1.fq -r READ2.fq -b {barcodefile} -p PREFIX"
Fix log redirection and update log directory name for consistency
Two issues need attention:
- The log redirection should use
2>
to capture stderr instead of>
.- The log directory name still uses the old "10XtoHaplotag" naming scheme.
log: - outdir + "/logs/10XtoHaplotag/hap{hap}" + outdir + "/logs/demultiplex_barcodes/hap{hap}" shell: - "inline_to_haplotag.py -f {input.fw} -r {input.rv} -b {input.barcodes} -p {params} > {log}" + "inline_to_haplotag.py -f {input.fw} -r {input.rv} -b {input.barcodes} -p {params} 2> {log}"🔗 Analysis chain
Line range hint
151-166
: Fix log redirection and verify script availability.The rule has been correctly renamed to better reflect its functionality. However, there are two concerns:
- The log redirection uses
>
which captures stdout instead of stderr. This might miss error messages.- We should verify that the renamed script exists in the expected location.
shell: - "inline_to_haplotag.py -f {input.fw} -r {input.rv} -b {input.barcodes} -p {params} > {log}" + "inline_to_haplotag.py -f {input.fw} -r {input.rv} -b {input.barcodes} -p {params} 2> {log}"🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash # Description: Verify the existence and permissions of the renamed script # Check if the script exists and is executable fd -t x "inline_to_haplotag.py" # Check if there are any remaining references to the old script name rg "10XtoHaplotag\.py"Length of output: 88
Script:
#!/bin/bash # Let's verify if the script is actually executable and check its content ls -l harpy/bin/inline_to_haplotag.py # Check if there are any references to the old name without the .py extension rg "10XtoHaplotag" # Check the script content to ensure it's the correct implementation head -n 10 harpy/bin/inline_to_haplotag.pyLength of output: 515
harpy/simulate.py (2)
129-129
: Document the rationale for the molecules-per limit.While adding the upper limit of 4700 aligns with the PR objectives, it would be helpful to document why this specific limit was chosen, either in the help text or as a code comment.
151-153
: Avoid hardcoding file size in documentation.The docstring mentions a specific file size ("~2GB disk space") which could become outdated if the barcode generation changes. Consider making this more general or removing the size reference to reduce maintenance overhead.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (5)
.github/filters.yml
(1 hunks)harpy/bin/haplotag_barcodes.py
(1 hunks)harpy/bin/inline_to_haplotag.py
(2 hunks)harpy/simulate.py
(3 hunks)harpy/snakefiles/simulate_linkedreads.smk
(5 hunks)
🔇 Additional comments (4)
.github/filters.yml (1)
156-157
: Verify module assignments for new scripts.
The addition of the new scripts to the simreads
module aligns with the PR objectives. However, since haplotag_barcodes.py
handles barcode generation similar to haplotag_acbd.py
, consider if it should also be included in the demux
module's file list.
✅ Verification successful
Let me try another verification approach to check the actual content and purpose of these files.
Module assignments for new scripts are correct
Based on the verification results, both scripts are correctly assigned to the simreads
module only:
inline_to_haplotag.py
converts 10x linked reads to haplotag format, which is a simulation-specific taskhaplotag_barcodes.py
generates barcode lists specifically for the simulation workflow- Both files are exclusively referenced in
simulate_linkedreads.smk
and are part of the simulation pipeline - While there are demultiplexing-related files in the codebase, these new scripts serve different purposes and don't share functionality with the demultiplexing module
🏁 Scripts executed
The following scripts were executed for the analysis:
Script:
#!/bin/bash
# Description: Check if haplotag_barcodes.py has similar functionality to files in demux module
# Check for similar function signatures or patterns between the two files
echo "Comparing haplotag_barcodes.py with haplotag_acbd.py:"
rg -A 5 "def" "harpy/bin/haplotag_barcodes.py" "harpy/bin/haplotag_acbd.py" || echo "Files not found"
# Check if the new files are referenced in demux-related files
echo -e "\nChecking references in demux files:"
rg "haplotag_barcodes|inline_to_haplotag" "harpy/snakefiles/demultiplex*.smk" || echo "No references found in demux files"
Length of output: 549
Script:
#!/bin/bash
# Check the actual content and purpose of the new files
echo "Content of inline_to_haplotag.py:"
rg -A 10 "def|class|import" "harpy/bin/inline_to_haplotag.py" || echo "File not found"
echo -e "\nContent of haplotag_barcodes.py:"
rg -A 10 "def|class|import" "harpy/bin/haplotag_barcodes.py" || echo "File not found"
# Check if these files are referenced in simulation-related files
echo -e "\nChecking references in simulation files:"
fd -e smk -e py -e sh . | xargs rg "haplotag_barcodes|inline_to_haplotag"
# Check for any demux-related files in the codebase
echo -e "\nListing demux-related files:"
fd "demux|demultiplex" .
Length of output: 2941
harpy/bin/haplotag_barcodes.py (1)
1-6
: LGTM! Clean and minimal imports.
The script setup is well-structured with appropriate imports and a clear docstring.
harpy/snakefiles/simulate_linkedreads.smk (1)
20-20
: LGTM: Barcode file path configuration is correct.
The ternary operator correctly handles both custom and default barcode file paths, aligning with the PR's objective of using haplotagging barcodes by default.
harpy/simulate.py (1)
Line range hint 214-600
: LGTM: Other simulation commands maintain consistency.
The remaining simulation commands (snpindel, inversion, cnv, translocation) maintain a consistent structure and proper error handling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 5
🧹 Outside diff range and nitpick comments (9)
harpy/bin/parse_phaseblocks.py (1)
Line range hint
26-38
: Consider adding input validation and error handlingThe script assumes a specific file format but doesn't validate it. Consider adding basic error handling for:
- Empty files
- Malformed block headers
- Missing required fields in data lines
Here's a suggested implementation:
with open(args.input, "r", encoding="utf-8") as blocks: FIRST_LINE = True + found_block = False for line in blocks: lsplit = line.split() + if not lsplit: # Skip empty lines + continue if lsplit[0] == "BLOCK:": + if len(lsplit) < 9: + raise ValueError(f"Malformed block header: {line.strip()}") n_snp = int(lsplit[6]) len_block = int(lsplit[8]) FIRST_LINE = True + found_block = True else: if FIRST_LINE: + if len(lsplit) < 5: + raise ValueError(f"Malformed data line: {line.strip()}") pos_start = int(lsplit[4]) contig = lsplit[3] FIRST_LINE = False sys.stdout.write(f"{samplename}\t{contig}\t{n_snp}\t{pos_start}\t{len_block}\n") else: continue + if not found_block: + raise ValueError("No valid blocks found in input file")harpy/bin/make_windows.py (2)
Line range hint
55-59
: LGTM! The for loop is a more Pythonic approach.The change from a while loop to a for loop for reading FAI files is a good improvement that makes the code more readable and maintainable.
Consider applying similar improvements to the FASTA reading section (lines 64-91) which still uses while loops. Here's a suggested refactor:
- line = fopen.readline() - while True: - C_LEN=0 - if not line: - break - if line.startswith(">"): - # remove newline, > starting symbol, and any comments after name - contig = line.rstrip("\n").lstrip(">").split()[0] - # keep reading until hitting another ">" - HEADER = False - while not HEADER: - line = fopen.readline() - if not line: - break - line = line.rstrip("\n") - if line.startswith(">"): - HEADER = True - break - else: - C_LEN += len(line)-1 - HEADER = False + current_contig = None + current_len = 0 + + for line in fopen: + line = line.rstrip("\n") + if line.startswith(">"): + if current_contig: + current_len += args.mode + makewindows(current_contig, current_len, args.window) + current_contig = line.lstrip(">").split()[0] + current_len = 0 + else: + current_len += len(line) + + if current_contig: + current_len += args.mode + makewindows(current_contig, current_len, args.window)This refactoring:
- Eliminates nested while loops
- Simplifies the logic by tracking the current contig and length
- Processes windows when encountering a new contig or at EOF
- Maintains the same functionality with cleaner code
Line range hint
1-91
: Consider enhancing robustness and user experience.While the core functionality is solid, consider these improvements:
- Add validation to prevent negative window positions
- Add progress indication for large files
- Optimize memory usage for large FASTA files
Here are specific suggestions:
- Add position validation in the
makewindows
function:def makewindows(_contig, _c_len, windowsize): """create a file of the specified windows""" if _c_len < 0: raise ValueError(f"Invalid contig length {_c_len} for {_contig}") # ... rest of the function
- Add progress reporting using
sys.stderr
:def log_progress(msg): print(msg, file=sys.stderr)
- For large FASTA files, consider processing sequences in chunks:
def process_sequence_chunk(chunk, chunk_size=1024*1024): """Process sequence data in chunks to optimize memory usage""" for i in range(0, len(chunk), chunk_size): yield chunk[i:i + chunk_size]harpy/bin/molecule_coverage.py (1)
48-48
: LGTM! Great improvement to use Pythonic file iteration.The change from
while True
loop tofor
loop is a good improvement that makes the code more idiomatic and efficient.Consider using a more descriptive variable name than
fai
to improve code readability, for example:-with open(args.fai, "r", encoding= "utf-8") as fai: - for line in fai: +with open(args.fai, "r", encoding= "utf-8") as fasta_index: + for line in fasta_index:harpy/snakefiles/simulate_variants.smk (1)
Line range hint
83-89
: LGTM! Consider adding explicit error handling.The change from a
while True
loop to afor
loop is a good improvement that makes the code more Pythonic and efficient.Consider adding explicit error handling for potential I/O issues:
- for line in in_vcf: - if line.startswith("#") or rng.uniform(0, 1) >= params.het: - # write header lines and homozygous variants to both files - hap1.write(line) - hap2.write(line) - elif rng.random() < 0.5: - hap1.write(line) - else: - hap2.write(line) + try: + for line in in_vcf: + if line.startswith("#") or rng.uniform(0, 1) >= params.het: + # write header lines and homozygous variants to both files + hap1.write(line) + hap2.write(line) + elif rng.random() < 0.5: + hap1.write(line) + else: + hap2.write(line) + except IOError as e: + raise WorkflowError(f"Error processing VCF file: {e}")harpy/snakefiles/simulate_snpindel.smk (1)
Line range hint
121-128
: LGTM! Consider extracting the random assignment logic.The change from a while loop to a for loop improves readability while maintaining the same functionality. The random assignment of variants to haplotypes is correctly implemented.
Consider extracting the random assignment logic into a helper function to improve maintainability:
rule diploid_snps: input: f"{outdir}/{outprefix}.snp.vcf" output: f"{outdir}/diploid/{outprefix}.snp.hap1.vcf", f"{outdir}/diploid/{outprefix}.snp.hap2.vcf" params: het = heterozygosity run: + def assign_variant(rng, heterozygosity): + """Helper function to determine if and where to assign a variant.""" + if rng.uniform(0, 1) >= heterozygosity: + return "both" + return "hap1" if rng.random() < 0.5 else "hap2" + rng = random.Random(randomseed) if randomseed else random.Random() with open(input[0], "r") as in_vcf, open(output[0], "w") as hap1, open(output[1], "w") as hap2: for line in in_vcf: - if line.startswith("#") or rng.uniform(0, 1) >= params.het: - # write header lines and homozygous variants to both files + if line.startswith("#"): hap1.write(line) hap2.write(line) - elif rng.random() < 0.5: - hap1.write(line) - else: - hap2.write(line) + continue + + assignment = assign_variant(rng, params.het) + if assignment == "both": + hap1.write(line) + hap2.write(line) + elif assignment == "hap1": + hap1.write(line) + else: + hap2.write(line)harpy/snakefiles/snp_mpileup.smk (1)
38-41
: Add input validation for region file format.The code assumes each line contains exactly three space-separated columns. Consider adding validation to handle malformed lines gracefully.
for line in reg_in: - cont,startpos,endpos = line.split() - intervals.add(f"{cont}:{startpos}-{endpos}") + parts = line.strip().split() + if len(parts) != 3: + raise ValueError(f"Invalid region format in line: {line.strip()}") + cont, startpos, endpos = parts + try: + # Validate positions are integers + int(startpos), int(endpos) + intervals.add(f"{cont}:{startpos}-{endpos}") + except ValueError: + raise ValueError(f"Invalid position values in line: {line.strip()}")harpy/bin/inline_to_haplotag.py (2)
53-53
: Correct typo in error messageThere's a typo in the word "captial"; it should be "capital" in the error message.
Apply this diff to fix the typo:
- raise ValueError(f"Invalid barcode format: {barcode}. Barcodes must be captial letters and only contain standard nucleotide values ATCG.") + raise ValueError(f"Invalid barcode format: {barcode}. Barcodes must be capital letters and only contain standard nucleotide values ATCG.")
86-87
: Use context managers when opening output filesTo ensure proper resource management, use context managers (
with
statements) when opening the output files.Apply this diff:
-fw_out = gzip.open(f"{args.prefix}.R1.fq.gz", "wb", 6) -rv_out = gzip.open(f"{args.prefix}.R2.fq.gz", "wb", 6) +with gzip.open(f"{args.prefix}.R1.fq.gz", "wb", 6) as fw_out, \ + gzip.open(f"{args.prefix}.R2.fq.gz", "wb", 6) as rv_out: + # Process the input files within this block + with gzip.open(args.forward, "r") as fw_i, gzip.open(args.reverse, "r") as rv_i: + for fw_record, rv_record in zip(iter_fastq_records(fw_i), iter_fastq_records(rv_i)): + new_fw, new_rv = process_record(fw_record, rv_record, bc_dict, bc_generator) + fw_out.write(new_fw.encode("utf-8")) + rv_out.write(new_rv.encode("utf-8"))🧰 Tools
🪛 Ruff
86-86: Use a context manager for opening files
(SIM115)
87-87: Use a context manager for opening files
(SIM115)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
📒 Files selected for processing (9)
harpy/bin/inline_to_haplotag.py
(1 hunks)harpy/bin/make_windows.py
(1 hunks)harpy/bin/molecule_coverage.py
(1 hunks)harpy/bin/parse_phaseblocks.py
(1 hunks)harpy/simulate.py
(3 hunks)harpy/snakefiles/simulate_snpindel.smk
(1 hunks)harpy/snakefiles/simulate_variants.smk
(1 hunks)harpy/snakefiles/snp_freebayes.smk
(1 hunks)harpy/snakefiles/snp_mpileup.smk
(1 hunks)
🚧 Files skipped from review as they are similar to previous changes (1)
- harpy/simulate.py
🧰 Additional context used
🪛 Ruff
harpy/bin/inline_to_haplotag.py
86-86: Use a context manager for opening files
(SIM115)
87-87: Use a context manager for opening files
(SIM115)
🔇 Additional comments (3)
harpy/bin/parse_phaseblocks.py (1)
26-26
: LGTM! More Pythonic file iteration
The change to use a for
loop for file iteration is a good improvement. It's more idiomatic Python and eliminates the need for explicit EOF checking.
harpy/snakefiles/snp_mpileup.smk (1)
38-41
: LGTM! Good simplification of the region parsing logic.
The change from a while loop to a for loop makes the code more Pythonic and easier to read.
harpy/bin/inline_to_haplotag.py (1)
11-11
: Fix inconsistent program description
The description still refers to "inline linked read barcodes," which may not accurately reflect that the script handles generic inline barcodes. Consider updating it for clarity.
for fw_record, rv_record in zip_longest(iter_fastq_records(fw_i), iter_fastq_records(rv_i)): | ||
new_fw, new_rv = process_record(fw_record, rv_record, bc_dict, bc_generator) | ||
fw_out.write(new_fw.encode("utf-8")) | ||
rv_out.write(new_rv.encode("utf-8")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Handle mismatched number of records in paired FASTQ files
Using zip_longest
can lead to None
values if the files have different numbers of records, causing errors in process_record
. Paired FASTQ files should have matching records.
Modify the code to use zip
and check for mismatches:
with gzip.open(args.forward, "r") as fw_i, gzip.open(args.reverse, "r") as rv_i:
- for fw_record, rv_record in zip_longest(iter_fastq_records(fw_i), iter_fastq_records(rv_i)):
+ for fw_record, rv_record in zip(iter_fastq_records(fw_i), iter_fastq_records(rv_i)):
new_fw, new_rv = process_record(fw_record, rv_record, bc_dict, haplotag_bc)
fw_out.write(new_fw.encode("utf-8"))
rv_out.write(new_rv.encode("utf-8"))
+ # Check for any remaining records
+ if next(iter_fastq_records(fw_i), None) or next(iter_fastq_records(rv_i), None):
+ raise ValueError("Mismatched number of records in paired FASTQ files.")
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
for fw_record, rv_record in zip_longest(iter_fastq_records(fw_i), iter_fastq_records(rv_i)): | |
new_fw, new_rv = process_record(fw_record, rv_record, bc_dict, bc_generator) | |
fw_out.write(new_fw.encode("utf-8")) | |
rv_out.write(new_rv.encode("utf-8")) | |
for fw_record, rv_record in zip(iter_fastq_records(fw_i), iter_fastq_records(rv_i)): | |
new_fw, new_rv = process_record(fw_record, rv_record, bc_dict, bc_generator) | |
fw_out.write(new_fw.encode("utf-8")) | |
rv_out.write(new_rv.encode("utf-8")) | |
# Check for any remaining records | |
if next(iter_fastq_records(fw_i), None) or next(iter_fastq_records(rv_i), None): | |
raise ValueError("Mismatched number of records in paired FASTQ files.") |
This PR changes the default behavior for linked read simulation to use the standard set of haplotagging barcodes rather than the standard 10X set of barcodes.
haplotag_barcodes.py
script that's similar tohaplotag_acbd.py
but writes a single file tostdout
10XtoHaplotag.py
renamed toinline_to_haplotag.py
because that's really what it's doing--molecules-per
to reflect the author note to not let this value exceed4700
Summary by CodeRabbit
Release Notes
New Features
Updates
linkedreads
command with improved barcode generation and input validation.Bug Fixes