Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CCS copy number in each read #13

Open
mmaitenat opened this issue Jan 14, 2022 · 4 comments
Open

CCS copy number in each read #13

mmaitenat opened this issue Jan 14, 2022 · 4 comments

Comments

@mmaitenat
Copy link

Hi again,

I would like to know if the number of times a circRNA is repeated in each read (which I think you call CCS copy number) is reported somewhere in the output of CIRILONG. Mi idea is to get plots similar to those in Supplementary Figure 7 in your article "Comprehensive profiling of circular RNAs with nanopore sequencing and CIRI-long" with my own data.

Thanks!

Maitena.

@Kevinzjy
Copy link
Member

Hi, you can find the information you need in the 7th column of *.cand_circ.fa generated by the CIRI-long call command, which contains the start and end position of CCS segments in the raw ONT reads.

>9b2ec396-b290-4b90-b115-ff3fbc33076d   chr1:87938154-87940306  +       87938154-87938265|114,87938346-87938438|94,87939196-87939315|121,87940177-87940306|130      AG-GT|2-1       288|0-463       10-452;452-905;905-1358;1358-1467
GCATTCAGGGAGATAGCACAGTCCCACAGAGCCATGGAACAGGAGCTCGCACATGCTGTCAATGCCAGCTCCAAAGCCATGGAGCAGTATACAGCAAGCCCAGAACTGCAGAGGGTTGAACTGCCAGCTTTGTTCTGGAGATGGTGAATAACATCAGAGCACTGCGCAGTGAGACAGAGCTGCTGCTGGCTGGGAAGATGGCCCTGCAATTGGATCCCCCTCAGAAGGAACGGCAGAAACCGGGGCTGCCCTAATTGAGATGGACCAGCAGCTCAGGAAGCTGACAGACACTCCCTGGCTTTACGCCAGCCCTTGGAAGCCTGGTGAGGAAGAGTCTCTCCAACAGAATGTGATGCTGGATCTTACTAAACGCAGCCGTAGTGGTAAATTCCGCCTTGTGACCAAGTTTAAAAAGGAGAAAAACAATAAGAACAAAGAAGTTCACAGTAACCTAGGAGGCCCT
>82ca865f-cd01-4c4b-a12b-5343d9f8464b   chr4:95850782-95851509  +       95850782-95851509|731   AG-GT*|-3--6    611|3-725       41-758;758-845
GGTAGTCCTCTAGAGCTGATGAGGTTTGTAGAGTCAGACCCCAGCTACAGCTGTAGAACCAGGCATCCTTGGTTGCTGGAAACCAATCCTGGAAGCAGAGTACTAGCGCATGCCCAAACTCATGAAACAGCCAGTATAGAGCTGGAAGAAAGTCAGACCCCCAGCTACCAGCTGAGAACCAGGCACTTCAACCACTTGCCCGCATGCCCCAGTGTTAGAAGTGACAAACCAGGTGTTCTAATAATTTTTAATAATTGGGAATTCAATTTGCTGTGACTGCCTGAGTGTGGCAGACCCTGTGCTAAGTTCTTTAGTATAGCTCTCCTAATGCATATAATACCCTTTCATGGCCTGTAAGAGGGCCAGAAACTTACAAACACAGACCATTAGAAACCTCCAGTGGCAGAAGCCCATTTCCAGTTTAAGAATGGAGCTGGGCATGTGGCTTGGTGCTTAAAGCACTTCTGTCTTCCAGAGGACCTGCATCAATTTCCAGTACATTGTTGGTTCATCTGTGGAGTTATCATCTGTAACTCCGGTACCAGGAGTCTACTGCCCTCTCCTTCTGGAATTACCCTGGTGGTGGTGCCTATGCATAAACCTATCATTCAATCTATACAAAACAAACTAATCAATTACTCAATACGAAATAATATGTGCAACTAATTGTCATTGGATGGGCTGACTGTAGTGATGAATTGTCTCATAAAAGGTCAGTCTGGGCA

The *.reads output of the CIRI-long collapse also includes the correspondence between the read id (1st column) and the collapsed circRNA id (2nd column).

read_id circ_id tmp_id  strand  cirexons        signal  alignment       segments        sample  type
d6637a72-5a5b-41e6-8341-25ed39330ed2    chr1:3421702-3526342    chr1:3421702-3526342    -       3421702-3421901|201,3516918-3517016|100,3517613-3517717|106,3523427-3523692|267,3526200-3526342|137 AG-GT|1-7       263|9-831       41-858;858-1254 Long_SMARTer_H-_repfull
0ab11e74-33e9-4272-aae0-2a22035e7bc1    chr1:3421702-3526342    chr1:3421702-3526342    -       3421702-3421901|199,3516918-3517016|100,3517613-3517717|106,3523427-3523692|267,3526200-3526342|146 AG-GT|-1--2     151|0-831       10-827;827-1232 Long_SMARTer_H-_repfull

@mmaitenat
Copy link
Author

Hi!
That's clear, thanks!
I am so sorry for asking so many questions, but I'm afraid I have a few more...
When I was going through this, I found in the *.info files circRNAs with negative length values. Let me show you some examples:
grep 'circ_len "-' barcode03.info | head -5
2 CIRI-long circRNA 76696524 76696522 3 - . circ_id "2:76696524-76696522"; splice_site "AG-GT*|0--2"; equivalent_seq ""; circ_type "Unknown"; circ_len "-2"; isoform "76696524-76696522";
2 CIRI-long circRNA 117281827 117281825 5 + . circ_id "2:117281827-117281825"; splice_site "AG-GT*|7-5"; equivalent_seq "G"; circ_type "Unknown"; circ_len "-2"; isoform "117281827-117281825";
2 CIRI-long circRNA 121347282 121347280 2 + . circ_id "2:121347282-121347280"; splice_site "AG-GT*|10-8"; equivalent_seq ""; circ_type "Unknown"; circ_len "-2"; isoform "121347282-121347280";
2 CIRI-long circRNA 128669829 128669827 5 - . circ_id "2:128669829-128669827"; splice_site "AG-GT*|-7--9"; equivalent_seq ""; circ_type "Unknown"; circ_len "-2"; isoform "128669829-128669827";
3 CIRI-long circRNA 89958600 89958598 2 - . circ_id "3:89958600-89958598"; splice_site "AG-GT*|5-3"; equivalent_seq ""; circ_type "Unknown"; circ_len "-2"; isoform "89958600-89958598";
I also found in the same file circRNAs with unknown strand and splice_site info, as follows:
grep 'splice_site "None' barcode03.info | head -5
1 CIRI-long circRNA 3215147 3215449 5 None . circ_id "1:3215147-3215449"; splice_site "None"; equivalent_seq ""; circ_type "Unknown"; circ_len "302"; isoform "3215147-3215449";
1 CIRI-long circRNA 9940139 9940778 5 None . circ_id "1:9940139-9940778"; splice_site "None"; equivalent_seq ""; circ_type "Unknown"; circ_len "639"; isoform "9940139-9940778";
1 CIRI-long circRNA 15396359 15396994 7 None . circ_id "1:15396359-15396994"; splice_site "None"; equivalent_seq ""; circ_type "Unknown"; circ_len "635"; isoform "15396359-15396994";
1 CIRI-long circRNA 22552121 22552535 2 None . circ_id "1:22552121-22552535"; splice_site "None"; equivalent_seq "ggg"; circ_type "Unknown"; circ_len "414"; isoform "22552121-22552535";
1 CIRI-long circRNA 32390644 32391226 2 None . circ_id "1:32390644-32391226"; splice_site "None"; equivalent_seq ""; circ_type "Unknown"; circ_len "582"; isoform "32390644-32391226";

Could you be so kind to explain which situation do these circRNAs correspond to and how should I treat them?

Thank you very much!

@mmaitenat
Copy link
Author

I am so sorry, I just saw an issue regarding the circRNAs with negative and 0 length, and your recommendation to remove them as they come from erroneous reads. Still, I was wondering whether I should keep those with splice_site="None" or these may be errors too.

Thanks!

@Kevinzjy
Copy link
Member

Hi, the current version of CIRI-long on GitHub will remove these negative length circRNAs, and I will update the version on PyPI with the next formal release.

splice_site='None' means no pre-defined splice site could be found in the BSJ region of CCS reads, it's hard to tell whether these circRNAs are reverse transcription artifacts or real circRNAs. If you're using model species with well-defined splice sites, then it's better to filter them out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants