Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

specify custom adapters to split_on_adapter #19

Open
MortenEneberg opened this issue Sep 5, 2022 · 14 comments
Open

specify custom adapters to split_on_adapter #19

MortenEneberg opened this issue Sep 5, 2022 · 14 comments

Comments

@MortenEneberg
Copy link

MortenEneberg commented Sep 5, 2022

We see an extensive amount of chimeras with no intermediate sequencing adapters using the ligation sequencing kit 110. We use the SQK-LSK110 on multiplexed samples that have 24-nt barcodes in each end. After barcoding (in index PCR) the samples are pooled and used as input to the SQK-LSK110 kit. We have up to 24 different barcode sets.

Chimeric sequences have structures like:

5'-ADAPTER-Y-TOP-BARCODEX-PRIMER-SEQ-PRIMER-BARCODEX-BARCODEY-PRIMER-SEQ-PRIMER-BARCODEY-3'

Where adapter-Y-top is the ONT sequencing adapter, barcode X and Y are the barcodes introduced by an index PCR targetting PRIMER sites. The primer sites are ligated to the sequence of interest (SEQ). We have observed up to 14 barcode pairs in a single read.

Would like to split to at sites where barcodes are adjacent to each other, so that the read above becomes:

5'-ADAPTER-Y-TOP-BARCODEX-PRIMER-SEQ-PRIMER-BARCODEX-3' and 5'-BARCODEY-PRIMER-SEQ-PRIMER-BARCODEY-3'

@onordesjo
Copy link

Hi @MortenEneberg, could you try setting PCR primers here:

https://github.com/nanoporetech/duplex-tools/blob/master/duplex_tools/split_on_adapter.py#L36

then, run with the "PCR" setting like the command below:

duplex_tools split_on_adapter <fastq_directory> <output_directory> PCR

It may give you the following, in which case we would have a think of how to retain the tail-barcode in the left read, and the head barcode in the right read.

5'-ADAPTER-Y-TOP-BARCODEX-PRIMER-SEQ-PRIMER' and 5'PRIMER-SEQ-PRIMER-BARCODEY-3'

Cheers

@MortenEneberg
Copy link
Author

MortenEneberg commented Sep 6, 2022

Hi @onordesjo

So I tried with just a few reads where I knew the structure.

One was a read where nothing was supposed to happen having a structure like:
5'-ADAPTER-Y-TOP-BARCODEX-PRIMER1-SEQ-PRIMER2-BARCODEX-3'
This read was not split.

The next one was a read with the following structure, supposed to be split into 3 reads, as I set the --allow-multiple-splits:
5'-ADAPTER-Y-TOP-BARCODEX-PRIMER1-SEQ-PRIMER2-BARCODEX-BARCODEY-PRIMER1-SEQ-PRIMER2-BARCODEY-BARCODEZ-PRIMER1-SEQ-PRIMER2-BARCODEZ-3'

And this was split into 2 reads with the following structures:
Read 1: 5'-ADAPTER-Y-TOP-BARCODEX-PRIMER1-SEQ-PRIMER2-BARCODEX-BARCODEY-3'

Read 2: 5'-BARCODEY-BARCODEZ-PRIMER1-SEQ-PRIMER2-BARCODEZ-3'

meaning that the following was discarded:
5'-PRIMER1-SEQ-PRIMER2-3'

Interestingly, the beginning of read 2 contains the end of PRIMER2, meaning that not all of the primer was in the cut out part of the read.

I would have like these 3 reads to be the output instead:
5'-ADAPTER-Y-TOP-BARCODEX-PRIMER1-SEQ-PRIMER2-BARCODEX-3'
5'-BARCODEY-PRIMER1-SEQ-PRIMER2-BARCODEY-3'
5'-BARCODEZ-PRIMER1-SEQ-PRIMER2-BARCODEZ-3'

Same splits if run without --allow-multiple-splits

I used the following primers in the split_on_adapter.py file:

pcr_primers=(
            'ACACTCTTTCCCTACACGACGCTCTTCCGATCT',
            'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'),

On the following 4 reads, where I only presented the first 2 reads here:
input_seq1.fastq.gz

With this command: duplex_tools split_on_adapter $input_seq $output_seq PCR --allow-multiple-splits

Cheers!

@onordesjo
Copy link

Hi @MortenEneberg,

Getting there it seems!

If the last SEQ is rather short, it may be the case that it's masked (to not accidentally split reads right at the end).

You could try to use the additional options --trim_end 10 --trim_start 10 (which would reduce trimming down to 10bp on either side).

@MortenEneberg
Copy link
Author

MortenEneberg commented Sep 6, 2022

Hi @onordesjo,

just updated the splits for the second example (7/9 at 10AM) - made a small error

I tried your suggested settings.

The read where nothing was supposed to happen having a structure like:
5'-ADAPTER-Y-TOP-BARCODEX-PRIMER1-SEQ-PRIMER2-BARCODEX-3'
now splits into:
5'-ADAPTER-Y-TOP-BARCODEX-PRIMER1-3'
and
5'-BARCODEX-3'

thus discarding: 5'-SEQ-PRIMER2-3' ...
As with the examples in the previous comment, remains/small parts of the adapters are on each side of the cut - it does not identify the whole part which would be crucial for the subsequent demultiplexing if it got to cut correctly.

The next one was a read with the following structure, supposed to be split into 3 reads:
5'-ADAPTER-Y-TOP-BARCODEX-PRIMER1-SEQ-PRIMER2-BARCODEX-BARCODEY-PRIMER1-SEQ-PRIMER2-BARCODEY-BARCODEZ-PRIMER1-SEQ-PRIMER2-BARCODEZ-3'

With the new settings, this was split into 2 reads with the following structures:
Read 1: 5'-ADAPTER-Y-TOP-BARCODEX-PRIMER1-SEQ-PRIMER2-BARCODEX-BARCODEY-PRIMER1-SEQ-3'
and
Read 2: 5'-SEQ-PRIMER2-BARCODEZ-3'

meaning that the following was discarded:
5'-PRIMER2-BARCODEY-BARCODEZ-PRIMER1-3'

Do you have any clue on how to solve this?

Cheers!

@MortenEneberg
Copy link
Author

Hi @onordesjo,

I appreciate your help!

Did you have a chance to look at it yet?

Kind regards

@onordesjo
Copy link

Hi @MortenEneberg, sorry, I don't have much bandwidth to look at this at the moment.

Would you be able to use a debugger to step through split_on_adapter and see where the decisions are made? I'd suggest starting on this line, which is where all results are found for matches against the subsequence:

https://github.com/nanoporetech/duplex-tools/blob/master/duplex_tools/split_on_adapter.py#L142

@MortenEneberg
Copy link
Author

Hi @onordesjo,

Have you had the time to give it a look?

Kind regards,
Morten

@onordesjo
Copy link

Hi @MortenEneberg,

I have started to look at it, but may probably need to add in the barcode sequences to this plot to get a better view of what should be written out. I've added in imperfect matches to both of your primer sequences (using seqkit locate), but it's not entirely clear yet
image

@MortenEneberg
Copy link
Author

Hi @onordesjo,

Thank you for looking into it!

I have attached the barcodes here, where also the sequencing adapter is: Single_barcodes_rev_for.txt

Note that in the attached file it is the primer sequences. When reading one strand the barcodes in 5' and 3' ends will be the same

Kind regards,
Morten

@onordesjo
Copy link

Thanks Morten!

I'll add those in and try to get it straightened out. My feeling is that it'll be easiest in this use case to use a standalone tool (since the front-adapter is not actually expected to be in the middle.

Do note by the way that the targets being matched to are these: I forgot to point that out previously, but obviously relevant if you're not actually expecting part of the adapter to be between the primers:
https://github.com/nanoporetech/duplex-tools/blob/master/duplex_tools/split_on_adapter.py#L58

So basically what's being searched for is <primer-x-rc><adapter-rc><variable-length-N><adapter><primer-y>, where primer-x and primer-y may be same or different.

        'PCR': [
            (rev_comp(x)
                + tail_adapter[:len(tail_adapter) - n_bases_to_mask_tail]
                + middle_sequence + head_adapter[n_bases_to_mask_head:] + y)
            for x in pcr_primers for y in pcr_primers]

@onordesjo
Copy link

Basically, in principle you should have better luck with something like this, detecting

        'BARCODES': [
            (rev_comp(x) + y)
            for x in barcodes for y in barcodes]

I've marked up the regions (red bars right under the tick marks) that I understand you'd want written out as separate reads, is this correct?

image

@MortenEneberg
Copy link
Author

Dear @onordesjo ,

Yes it looks correct! I have attached a paint image (not pretty..) just to make sure we are on the same page :)

Thank you!
197509279-9d0c463f-e610-408b-bde3-81090db4ee86

Morten

@MortenEneberg
Copy link
Author

Dear @onordesjo,

Thanks for your help! Did you have a chance to look at it yet?

Kind regards,
Morten

@onordesjo
Copy link

Hi Morten!

Sorry, I wasn't clear on the last message. I don't think it's something we're planning to support since it's a rather special use case.

You could definitely it a go to replace:

        'PCR': [
            (rev_comp(x)
                + tail_adapter[:len(tail_adapter) - n_bases_to_mask_tail]
                + middle_sequence + head_adapter[n_bases_to_mask_head:] + y)
            for x in pcr_primers for y in pcr_primers]

with:

        'BARCODES': [
            (rev_comp(x) + y)
            for x in barcodes for y in barcodes]

and see if you get the right matches then.

Again, sorry for not being clear and not being able to put more resource on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants