Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Input file shifts inconsistent (Arabidopsis) #197

Open
hermandebeukelaer opened this issue Jun 25, 2024 · 4 comments
Open

Input file shifts inconsistent (Arabidopsis) #197

hermandebeukelaer opened this issue Jun 25, 2024 · 4 comments

Comments

@hermandebeukelaer
Copy link

I am trying to train ChromBPNet on Arabidopsis ATAC-seq data, but run into the error "Input file shifts inconsistent" as in #153, #174 and #176. Having read through these other issues, I was able to generate the PWM for my BAM file:

atac_no_shift

The raw fastq data was processed using the nf-core ATAC-seq pipeline with Bowtie2 as the chosen aligner. To train ChromBPNet I took the merged BAM file containing the alignments of all replicates (found under bowtie2/merged_replicate/ among the output of the Nextflow pipeline).

My command used for training (I am running on a computing cluster with Apptainer):

apptainer exec  --nv -e \
                --no-mount /scratch \
                --bind data:/data \
                --bind bias_model:/bias_model \
                --bind potter-kc:/potter \
                --bind out:/output \
                chrombpnet.sif \
                chrombpnet pipeline \
                -ibam /potter/CONTROL.mRp.clN.sorted.bam \
                -d  ATAC \
                -g  /potter/ath.fasta \
                -c  /potter/ath.chrom.sizes.txt \
                -p  /data/peaks_no_blacklist.bed \
                -n  /data/background_negatives.bed \
                -fl /data/train1-3_val4_test5.json \
                -b  /bias_model/ENCSR868FGK_bias_fold_0.h5 \
                -o  /output

This worked fine when using the data from the tutorial, but yields the inconsistent shift error on my own data. To my knowledge, the nf-core ATAC-seq pipeline does not apply any shifts, at least I did not find it in the documentation.

The commands I used to generate the above PWM from my BAM file:

samtools view -b /potter/CONTROL.mRp.clN.sorted.bam Chr1 > /out/out.bam
samtools view -b -@8 /out/out.bam | bedtools bamtobed -i stdin | awk -v OFS="\t" '{if ($6=="-"){print $1,$2,$3,$4,$5,$6} else if ($6=="+") {print $1,$2,$3,$4,$5,$6}}' | bedtools genomecov -bg -5 -i stdin -g /potter/ath.chrom.sizes.txt | bedtools sort -i stdin > /out/tmp2
bedGraphToBigWig /out/tmp2 /potter/ath.chrom.sizes.txt /out/unstranded.bw
python build_pwm_from_bigwig.py -i /out/unstranded.bw -g /potter/ath.fasta -op /out/atac_no_shift -cr Chr1 -c /potter/ath.chrom.sizes.txt

Do you have any suggestions on how to proceed? I don't have sufficient background on Tn5 bias to judge if the PWM looks as expected. Are there any references you suggest to learn more about Tn5 bias? I am wondering if the issue could be due to specific bias in Arabidopsis or plants in general, compared to the human genome.

@panushri25
Copy link
Collaborator

Can you generate the PWM on the plus and minus strands separately?

@hermandebeukelaer
Copy link
Author

Dear @panushri25,

I have generated the logos for both strands separately.

Positive strand

atac_no_shift_pos

Code used to generate logo:

samtools view -b /potter/CONTROL.mRp.clN.sorted.bam Chr1 > /out/out.bam
samtools view -b -@8 /out/out.bam | bedtools bamtobed -i stdin | awk -v OFS="\t" '$6=="+"' | bedtools genomecov -bg -5 -i stdin -g /potter/ath.chrom.sizes.txt | bedtools sort -i stdin > /out/pos.bedGraph
bedGraphToBigWig /out/pos.bedGraph /potter/ath.chrom.sizes.txt /out/pos.bw
python build_pwm_from_bigwig.py -i /out/pos.bw -g /potter/ath.fasta -op /out/atac_no_shift_pos -cr Chr1 -c /potter/ath.chrom.sizes.txt

Negative strand

atac_no_shift_neg

Code used to generate logo:

samtools view -b /potter/CONTROL.mRp.clN.sorted.bam Chr1 > /out/out.bam
samtools view -b -@8 /out/out.bam | bedtools bamtobed -i stdin | awk -v OFS="\t" '$6=="-"' | bedtools genomecov -bg -5 -i stdin -g /potter/ath.chrom.sizes.txt | bedtools sort -i stdin > /out/neg.bedGraph
bedGraphToBigWig /out/neg.bedGraph /potter/ath.chrom.sizes.txt /out/neg.bw
python /scratch/chrombpnet/chrombpnet/helpers/preprocessing/analysis/build_pwm_from_bigwig.py -i /out/neg.bw -g /potter/ath.fasta -op /out/atac_no_shift_neg -cr Chr1 -c /potter/ath.chrom.sizes.txt

Does this give you any hint on what might be the issue?

@panushri25
Copy link
Collaborator

The strands have some sort of unconventional shifts on them, the ATAC-seq processing pipeline you are using might be introducing these shifts can you verify this and re-run the pipeline so that it does not perform any shifts on the positive and negative reads

@hermandebeukelaer
Copy link
Author

We are using nf-core ATAC-seq to process the raw fastq files. I didn't find any information about shifts in the documentation, so I assumed no shifts were applied. Do you have experience with nf-core ATAC-seq?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants