specify custom adapters to split_on_adapter #19

MortenEneberg · 2022-09-05T12:42:18Z

We see an extensive amount of chimeras with no intermediate sequencing adapters using the ligation sequencing kit 110. We use the SQK-LSK110 on multiplexed samples that have 24-nt barcodes in each end. After barcoding (in index PCR) the samples are pooled and used as input to the SQK-LSK110 kit. We have up to 24 different barcode sets.

Chimeric sequences have structures like:

5'-ADAPTER-Y-TOP-BARCODEX-PRIMER-SEQ-PRIMER-BARCODEX-BARCODEY-PRIMER-SEQ-PRIMER-BARCODEY-3'

Where adapter-Y-top is the ONT sequencing adapter, barcode X and Y are the barcodes introduced by an index PCR targetting PRIMER sites. The primer sites are ligated to the sequence of interest (SEQ). We have observed up to 14 barcode pairs in a single read.

Would like to split to at sites where barcodes are adjacent to each other, so that the read above becomes:

5'-ADAPTER-Y-TOP-BARCODEX-PRIMER-SEQ-PRIMER-BARCODEX-3' and 5'-BARCODEY-PRIMER-SEQ-PRIMER-BARCODEY-3'

The text was updated successfully, but these errors were encountered:

onordesjo · 2022-09-05T12:50:36Z

Hi @MortenEneberg, could you try setting PCR primers here:

https://github.com/nanoporetech/duplex-tools/blob/master/duplex_tools/split_on_adapter.py#L36

then, run with the "PCR" setting like the command below:

duplex_tools split_on_adapter <fastq_directory> <output_directory> PCR

It may give you the following, in which case we would have a think of how to retain the tail-barcode in the left read, and the head barcode in the right read.

5'-ADAPTER-Y-TOP-BARCODEX-PRIMER-SEQ-PRIMER' and 5'PRIMER-SEQ-PRIMER-BARCODEY-3'

Cheers

MortenEneberg · 2022-09-06T08:35:42Z

Hi @onordesjo

So I tried with just a few reads where I knew the structure.

One was a read where nothing was supposed to happen having a structure like:
5'-ADAPTER-Y-TOP-BARCODEX-PRIMER1-SEQ-PRIMER2-BARCODEX-3'
This read was not split.

The next one was a read with the following structure, supposed to be split into 3 reads, as I set the --allow-multiple-splits:
5'-ADAPTER-Y-TOP-BARCODEX-PRIMER1-SEQ-PRIMER2-BARCODEX-BARCODEY-PRIMER1-SEQ-PRIMER2-BARCODEY-BARCODEZ-PRIMER1-SEQ-PRIMER2-BARCODEZ-3'

And this was split into 2 reads with the following structures:
Read 1: 5'-ADAPTER-Y-TOP-BARCODEX-PRIMER1-SEQ-PRIMER2-BARCODEX-BARCODEY-3'

Read 2: 5'-BARCODEY-BARCODEZ-PRIMER1-SEQ-PRIMER2-BARCODEZ-3'

meaning that the following was discarded:
5'-PRIMER1-SEQ-PRIMER2-3'

Interestingly, the beginning of read 2 contains the end of PRIMER2, meaning that not all of the primer was in the cut out part of the read.

I would have like these 3 reads to be the output instead:
5'-ADAPTER-Y-TOP-BARCODEX-PRIMER1-SEQ-PRIMER2-BARCODEX-3'
5'-BARCODEY-PRIMER1-SEQ-PRIMER2-BARCODEY-3'
5'-BARCODEZ-PRIMER1-SEQ-PRIMER2-BARCODEZ-3'

Same splits if run without --allow-multiple-splits

I used the following primers in the split_on_adapter.py file:

pcr_primers=(
            'ACACTCTTTCCCTACACGACGCTCTTCCGATCT',
            'AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC'),

On the following 4 reads, where I only presented the first 2 reads here:
input_seq1.fastq.gz

With this command: duplex_tools split_on_adapter $input_seq $output_seq PCR --allow-multiple-splits

Cheers!

onordesjo · 2022-09-06T08:56:49Z

Hi @MortenEneberg,

Getting there it seems!

If the last SEQ is rather short, it may be the case that it's masked (to not accidentally split reads right at the end).

You could try to use the additional options --trim_end 10 --trim_start 10 (which would reduce trimming down to 10bp on either side).

MortenEneberg · 2022-09-06T11:54:45Z

Hi @onordesjo,

just updated the splits for the second example (7/9 at 10AM) - made a small error

I tried your suggested settings.

The read where nothing was supposed to happen having a structure like:
5'-ADAPTER-Y-TOP-BARCODEX-PRIMER1-SEQ-PRIMER2-BARCODEX-3'
now splits into:
5'-ADAPTER-Y-TOP-BARCODEX-PRIMER1-3'
and
5'-BARCODEX-3'

thus discarding: 5'-SEQ-PRIMER2-3' ...
As with the examples in the previous comment, remains/small parts of the adapters are on each side of the cut - it does not identify the whole part which would be crucial for the subsequent demultiplexing if it got to cut correctly.

The next one was a read with the following structure, supposed to be split into 3 reads:
5'-ADAPTER-Y-TOP-BARCODEX-PRIMER1-SEQ-PRIMER2-BARCODEX-BARCODEY-PRIMER1-SEQ-PRIMER2-BARCODEY-BARCODEZ-PRIMER1-SEQ-PRIMER2-BARCODEZ-3'

With the new settings, this was split into 2 reads with the following structures:
Read 1: 5'-ADAPTER-Y-TOP-BARCODEX-PRIMER1-SEQ-PRIMER2-BARCODEX-BARCODEY-PRIMER1-SEQ-3'
and
Read 2: 5'-SEQ-PRIMER2-BARCODEZ-3'

meaning that the following was discarded:
5'-PRIMER2-BARCODEY-BARCODEZ-PRIMER1-3'

Do you have any clue on how to solve this?

Cheers!

MortenEneberg · 2022-09-12T09:30:53Z

Hi @onordesjo,

I appreciate your help!

Did you have a chance to look at it yet?

Kind regards

onordesjo · 2022-09-12T10:11:06Z

Hi @MortenEneberg, sorry, I don't have much bandwidth to look at this at the moment.

Would you be able to use a debugger to step through split_on_adapter and see where the decisions are made? I'd suggest starting on this line, which is where all results are found for matches against the subsequence:

https://github.com/nanoporetech/duplex-tools/blob/master/duplex_tools/split_on_adapter.py#L142

MortenEneberg · 2022-10-24T08:33:42Z

Hi @onordesjo,

Have you had the time to give it a look?

Kind regards,
Morten

onordesjo · 2022-10-24T10:09:58Z

Hi @MortenEneberg,

I have started to look at it, but may probably need to add in the barcode sequences to this plot to get a better view of what should be written out. I've added in imperfect matches to both of your primer sequences (using seqkit locate), but it's not entirely clear yet

MortenEneberg · 2022-10-24T10:30:02Z

Hi @onordesjo,

Thank you for looking into it!

I have attached the barcodes here, where also the sequencing adapter is: Single_barcodes_rev_for.txt

Note that in the attached file it is the primer sequences. When reading one strand the barcodes in 5' and 3' ends will be the same

Kind regards,
Morten

onordesjo · 2022-10-24T10:38:46Z

Thanks Morten!

I'll add those in and try to get it straightened out. My feeling is that it'll be easiest in this use case to use a standalone tool (since the front-adapter is not actually expected to be in the middle.

Do note by the way that the targets being matched to are these: I forgot to point that out previously, but obviously relevant if you're not actually expecting part of the adapter to be between the primers:
https://github.com/nanoporetech/duplex-tools/blob/master/duplex_tools/split_on_adapter.py#L58

So basically what's being searched for is <primer-x-rc><adapter-rc><variable-length-N><adapter><primer-y>, where primer-x and primer-y may be same or different.

        'PCR': [
            (rev_comp(x)
                + tail_adapter[:len(tail_adapter) - n_bases_to_mask_tail]
                + middle_sequence + head_adapter[n_bases_to_mask_head:] + y)
            for x in pcr_primers for y in pcr_primers]

onordesjo · 2022-10-24T10:50:26Z

Basically, in principle you should have better luck with something like this, detecting

        'BARCODES': [
            (rev_comp(x) + y)
            for x in barcodes for y in barcodes]

I've marked up the regions (red bars right under the tick marks) that I understand you'd want written out as separate reads, is this correct?

MortenEneberg · 2022-10-31T11:45:12Z

Dear @onordesjo ,

Yes it looks correct! I have attached a paint image (not pretty..) just to make sure we are on the same page :)

Thank you!

Morten

MortenEneberg · 2022-11-28T08:31:23Z

Dear @onordesjo,

Thanks for your help! Did you have a chance to look at it yet?

Kind regards,
Morten

onordesjo · 2022-11-28T09:06:35Z

Hi Morten!

Sorry, I wasn't clear on the last message. I don't think it's something we're planning to support since it's a rather special use case.

You could definitely it a go to replace:

        'PCR': [
            (rev_comp(x)
                + tail_adapter[:len(tail_adapter) - n_bases_to_mask_tail]
                + middle_sequence + head_adapter[n_bases_to_mask_head:] + y)
            for x in pcr_primers for y in pcr_primers]

with:

        'BARCODES': [
            (rev_comp(x) + y)
            for x in barcodes for y in barcodes]

and see if you get the right matches then.

Again, sorry for not being clear and not being able to put more resource on this!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

specify custom adapters to split_on_adapter #19

specify custom adapters to split_on_adapter #19

MortenEneberg commented Sep 5, 2022 •

edited

Loading

onordesjo commented Sep 5, 2022

MortenEneberg commented Sep 6, 2022 •

edited

Loading

onordesjo commented Sep 6, 2022

MortenEneberg commented Sep 6, 2022 •

edited

Loading

MortenEneberg commented Sep 12, 2022

onordesjo commented Sep 12, 2022

MortenEneberg commented Oct 24, 2022

onordesjo commented Oct 24, 2022

MortenEneberg commented Oct 24, 2022

onordesjo commented Oct 24, 2022

onordesjo commented Oct 24, 2022

MortenEneberg commented Oct 31, 2022

MortenEneberg commented Nov 28, 2022

onordesjo commented Nov 28, 2022

specify custom adapters to split_on_adapter #19

specify custom adapters to split_on_adapter #19

Comments

MortenEneberg commented Sep 5, 2022 • edited Loading

onordesjo commented Sep 5, 2022

MortenEneberg commented Sep 6, 2022 • edited Loading

onordesjo commented Sep 6, 2022

MortenEneberg commented Sep 6, 2022 • edited Loading

MortenEneberg commented Sep 12, 2022

onordesjo commented Sep 12, 2022

MortenEneberg commented Oct 24, 2022

onordesjo commented Oct 24, 2022

MortenEneberg commented Oct 24, 2022

onordesjo commented Oct 24, 2022

onordesjo commented Oct 24, 2022

MortenEneberg commented Oct 31, 2022

MortenEneberg commented Nov 28, 2022

onordesjo commented Nov 28, 2022

MortenEneberg commented Sep 5, 2022 •

edited

Loading

MortenEneberg commented Sep 6, 2022 •

edited

Loading

MortenEneberg commented Sep 6, 2022 •

edited

Loading