Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Smart-seq2? #94

Open
itslittman opened this issue Dec 13, 2024 · 2 comments
Open

Smart-seq2? #94

itslittman opened this issue Dec 13, 2024 · 2 comments

Comments

@itslittman
Copy link

itslittman commented Dec 13, 2024

I'm a little confused on how this pipeline works:

  1. Why does the preprocessing separate each input BAM into chromosomes? I'm trying to call variants genome-wide.
  2. Why does it need to scan read sequences for cell barcodes? I can see the utility of this for 10x but I have smart-seq data (demultiplexed BAMs). The germline calling produced a single VCF file containing chr20 variants separated by BAM file/cell, but somatic doesn't work similarly; I'm guessing it wants a bulk file?

I know this was only benchmarked for 10x but the publication mentions it should work with smart-seq as well. Let me know if there's anything I can do to make this run efficiently with my demultiplexed BAMs.

@ZiyiWang7
Copy link

Hi @itslittman,

Thank you for your interest in our package!

  1. In the germline variant calling step, imputation is performed on the raw calls, so it's more efficient to process the data by chromosome. If you would like to get genome-wide variant calling, you can merge the individual VCF files together after processing.
  2. Germline variant calling is done via pseudobulk analysis, followed by imputation using the 1KG3 reference. However, in the somatic variant calling step, we aim to recover mutations at the single-cell level. This requires scanning the read sequences to identify and analyze cell barcodes.

Please let me know if you have any additional questions or need further clarification about the process!

@itslittman
Copy link
Author

Hi @ZiyiWang7
Thanks for the reply! I understand the need to do that for 10x Genomics data, but I have Smart-Seq2 data. I demultiplexed the FASTQ files before alignment, so each cell already has its own BAM/the separation of barcodes has already been carried out.

I understand I could theoretically merge the BAMs back into one bulk file and let Monopogen re-parse the barcodes read-by-read, but this would seem like a massive waste of resources considering that has already been done, and since merging into a bulk BAM would double the amount of storage I'd have to allocate to this project. Is there a way to streamline this for use with demultiplexed BAMS?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants