Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

htseq-count test data error #32

Open
arsenij-ust opened this issue Nov 9, 2022 · 3 comments
Open

htseq-count test data error #32

arsenij-ust opened this issue Nov 9, 2022 · 3 comments

Comments

@arsenij-ust
Copy link

Hi @zhxiaokang

i run into a problem when i execute the pipeline with the paired-end test data using htseq-count. There appears the following warning and finally the Error caused by missing file:

        count   jobs
        1       featureCount
        1
17933 GFF lines processed.

Warning: Read SRR1039509.3494278 claims to have an aligned mate which could not be found in an adjacent line.
Warning: 81048 reads with missing mate encountered.

89918 SAM alignment pairs processed.
MissingOutputException in line 99 of /data/tools/RASflow/workflow/align_count_genome.rules:
Missing files after 5 seconds:
output/test/genome/countFile/SRR1039509_count.tsv.summary
This might be due to filesystem latency. If that is the case, consider to increase the wait time with --latency-wait.
Exiting because a job execution failed. Look above for error message
Shutting down, this might take some time.
Exiting because a job execution failed. Look above for error message
Complete log: /data/tools/RASflow/.snakemake/log/2022-11-09T132244.600008.snakemake.log
DEA is not required and RASflow is done!

The problem is, that the file SRR1039509_count.tsv.summary is missing but I don't know if the warning of htseq-count is related to this problem.

When I execute the isolated command

htseq-count -f bam -i gene_id -s no -t exon output/test/test/genome/bamFileSort/SRR1039509.sort.bam data/example/ref/annotation/Homo_sapiens.GRCh38.93.1.1.10M.gtf | sed '/^__/ d' > output/test/genome/countFile/SRR1039509_count.tsv
only the SRR1039509_count.tsv file is created.

With kind regards
Arsenij

@zhxiaokang
Copy link
Owner

Hi Arsenij,

My suspicion is that: at some point, I added htseq-count but didn't test the whole workflow properly. featurecounts would output a summary file but not htseq-count. If you do not insist on summary report, the last rule for report can be removed. Therefore you can replace the align_count_genome.rules with the following content:

import pandas as pd
configfile: "configs/config_main.yaml"

samples = pd.read_csv(config["METAFILE"], sep = '\t', header = 0)['sample']
trimmed = config["TRIMMED"]
if trimmed:
    input_path = config["OUTPUTPATH"] + "/" + config["PROJECT"] + "/trim"
else:
    input_path = config["READSPATH"]
key = config["KEY"]
end = config["END"]
intermediate_path = config["OUTPUTPATH"] + "/" + config["PROJECT"] + "/genome"
final_path = config["FINALOUTPUT"] + "/" + config["PROJECT"] + "/genome"
alignmentQC = config["alignmentQC"]

rule end:
    input:
        count = expand(final_path + "/countFile/{sample}_count.tsv", sample = samples)

if end == "pair":
    rule getReads:
        output:
            forward = temp(intermediate_path + "/reads/{sample}_forward.fastq.gz"),
            reverse = temp(intermediate_path + "/reads/{sample}_reverse.fastq.gz")
        params:
            key = key,
            input_path = input_path
        run:
            shell("scp -i {params.key} {params.input_path}/{wildcards.sample}_*R1*.f*q.gz {output.forward}"),
            shell("scp -i {params.key} {params.input_path}/{wildcards.sample}_*R2*.f*q.gz {output.reverse}")

else:
    rule getReads:
        output:
            read = temp(intermediate_path + "/reads/{sample}.fastq.gz")
        params:
            key = key,
            input_path = input_path
        shell:
            """
            shopt -s extglob
            scp -i {params.key} {params.input_path}/{wildcards.sample}?(_*)?(.*).f*q.gz {output.read}
            """

rule indexGenome:
    input:
        genome = config["GENOME"]
    output:
        indexes = directory(intermediate_path + "/indexes"),
        splicesites = intermediate_path + "/splicesites.txt"
    params:
        index = intermediate_path + "/indexes/index"
    shell:
        "mkdir {output.indexes} && hisat2-build -p {config[NCORE]} {input.genome} {params.index}"
        "&& hisat2_extract_splice_sites.py {config[ANNOTATION]} > {output.splicesites}"

if end == "pair":
    rule alignment:
        input:
            indexes = directory(intermediate_path + "/indexes"),
            splicesites = intermediate_path + "/splicesites.txt",
            forward = intermediate_path + "/reads/{sample}_forward.fastq.gz",
            reverse = intermediate_path + "/reads/{sample}_reverse.fastq.gz"
        output:
            sam = temp(intermediate_path + "/samFile/{sample}.sam"),
            bam = temp(intermediate_path + "/bamFile/{sample}.bam")
        params:
            index = intermediate_path + "/indexes/index"
        benchmark:
            intermediate_path + "/benchmarks/{sample}.hisat2.benchmark.txt"
        run:
            shell("hisat2 -p {config[NCORE]} --known-splicesite-infile {input.splicesites} -x {params.index} -1 {input.forward} -2 {input.reverse} -S {output.sam}")
            shell("samtools view -@ {config[NCORE]} -b -S {output.sam} > {output.bam}")
else:
    rule alignment:
        input:
            indexes = directory(intermediate_path + "/indexes"),
            splicesites = intermediate_path + "/splicesites.txt",
            forward = intermediate_path + "/reads/{sample}.fastq.gz"
        output:
            sam = temp(intermediate_path + "/samFile/{sample}.sam"),
            bam = temp(intermediate_path + "/bamFile/{sample}.bam")
        params:
            index = intermediate_path + "/indexes/index"
        benchmark:
            intermediate_path + "/benchmarks/{sample}.hisat2.benchmark.txt"
        run:
            shell("hisat2 -p {config[NCORE]} --known-splicesite-infile {input.splicesites} -x {params.index} -U {input.forward} -S {output.sam}")
            shell("samtools view -@ {config[NCORE]} -b -S {output.sam} > {output.bam}")

rule sortBAM:
    input:
        bam = intermediate_path + "/bamFile/{sample}.bam"
    output:
        sort = intermediate_path + "/bamFileSort/{sample}.sort.bam"
    shell:
        "samtools sort -@ {config[NCORE]} {input.bam} -o {output.sort}"

rule featureCount:
    input:
        sort = intermediate_path + "/bamFileSort/{sample}.sort.bam",
        annotation = config["ANNOTATION"]
    output:
        count = final_path + "/countFile/{sample}_count.tsv"
    run:
        if config["COUNTER"]=="featureCounts":
            if config["END"]=="pair":
                shell("featureCounts -p -T {config[NCORE]} -t exon -g {config[ATTRIBUTE]} -a {input.annotation} -o {output.count} {input.sort} && tail -n +3 {output.count} | cut -f1,7 > temp.{wildcards.sample} && mv temp.{wildcards.sample} {output.count}")
            else:
                shell("featureCounts -T {config[NCORE]} -t exon -g {config[ATTRIBUTE]} -a {input.annotation} -o {output.count} {input.sort} && tail -n +3 {output.count} | cut -f1,7 > temp.{wildcards.sample} && mv temp.{wildcards.sample} {output.count}")
        elif config["COUNTER"]=="htseq-count":
            shell("htseq-count -f bam -i {config[ATTRIBUTE]} -s no -t exon {input.sort} {input.annotation} | sed '/^__/ d' > {output.count}")

Let me know if that does not work for you.

@arsenij-ust
Copy link
Author

Hi @zhxiaokang,
thanks for the fast reply and the provided solution. This is exactly how I fixed the problem. But it's of course only a temporary solution.

With kind regards
Arsenij

@zhxiaokang
Copy link
Owner

Hi Arsenij, sure, that's only temporary. Will fix it when I have time for this. Will keep this issue open and close it when it's fixed, in a good way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants