Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NUL (\x00, ^@) and other control characters in output #107

Open
Kaddea opened this issue Jun 27, 2024 · 4 comments
Open

NUL (\x00, ^@) and other control characters in output #107

Kaddea opened this issue Jun 27, 2024 · 4 comments

Comments

@Kaddea
Copy link

Kaddea commented Jun 27, 2024

Hi,

I've using the bam_readcount wrapper "mgibio/bam_readcount_helper-cwl". The output files (snv or indel) contain control characters which cannot be processed by the vcf_readcount_annotator.

Which substitution of the control characters are suitable for further processing?

Variation (vcf)
20 405939 . TTTC T . weak_evidence AS_FilterStatus=weak_evidence;AS_SB_TABLE=0,0|0,0;DP=1;ECNT=1;GERMQ=23;MBQ=0,32;MFRL=0,204;MMQ=60,60;MPOS=43;POPAF=7.3;TLOD=4.21;CSQ=-|upstream_gene_variant|MODIFIER|RBCK1|ENSG00000125826|Transcript|ENST00000356286.10|protein_coding|||||||||||2357|1||HGNC|HGNC:15864|1||| GT:AD:AF:DP:F1R2:F2R1:FAD:SB 0/1:0,1:0.667:1:0,1:0,0:0,1:0,0,1,0

bam_readcount output (indel)
20 405940 N 1 =:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 A:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 C:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 G:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 T:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 N:0:0.00:0.00:0.00:0:0:0.00:0.00:0.00:0:0.00:0.00:0.00 -^@^@^@:1:255.00:0.00:0.00:1:0:0.88:0.03:0.00:1:0.42:101.00:0.42

@chrisamiller
Copy link
Collaborator

Weird. We have processed lots of bams through this type of workflow and I've never seen anything like that. Happy to take a look though. Can you provide a tiny example bam with the steps needed to recreate the problem?

@Kaddea
Copy link
Author

Kaddea commented Jul 8, 2024

Thanks for your help!!
I've cropped one of the bam files and the corresponding vcf file (both from RNAseq reads) to reproduce the readcount output files.
The strange characters in the output files appear only from column 11 on, and it seems only at sites with varying deletions (2-5 bases).
The files (bam, vep-annotated vcf and the snv/indel tsv) can be downloaded from
https://kaddea.com/s/J76BAJsg4d5zytN (approx. 45 MB)
Sequence alignment and variant analysis based on Ensembl GRCh38, release 110.

Best,
Mathias

@chrisamiller
Copy link
Collaborator

Thank you. Can you also provide the exact commands that were used, along with software versions, etc - just trying to reproduce it on our end here.

@Kaddea
Copy link
Author

Kaddea commented Jul 31, 2024

read_count_pipeline.txt
Hmmm ... the attached file indicates the steps for alignment, variant calling, annotation and preparation for the read counts. I've omitted the mandatory parameters (like input/output, etc.). Hope it helps ...
btw.: truncating the read-count output files to the first 10 columns helps to proceed with the vcf annotation, but I'm not sure about the validity of the resulting files ...
Mathias

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants