Fusion-making .py script does not take into account gene orientation #26

nvolkovaGEL · 2020-06-11T09:47:37Z

Thanks for creating this tool! I have noticed that the scripts/make_fusion_genes.py reports wrong order of exons for genes in '-' orientation, which results in considering some fusions as untranscribed and showing wrong fusion structure.

E.g. NTRK3 in cancer.hg38.csv:

NTRK3_ENST00000394480.6,chr15:87859751-88256768
1,88256644,88256747
2,88256284,88256486
3,88255906,88256168
4,88184225,88184299
...

Versus NTRK3 from the fusions.csv file generated by the script:

NTRK3_NM_002530,chr15:87859750-88256739
1,87859750,87877120
2,87880269,87880428
3,87929190,87929434
4,87933011,87933184
...

Here is a quick fix I introduced to make_fusion_genes.py:

Line 33:
_, transcript, chrom, strand, start, end, _, _, _, exonstart, exonend = line.rstrip("\n").split("\t")
Line 44:
_, transcript, chrom, strand, start, end, _, _, _, exonstart, exonend = line.rstrip("\n").split("\t")
And insert these after line 50:
if strand == '-':
exons = exons[::-1]

This returns the exons in the right order.

The text was updated successfully, but these errors were encountered:

sfchen · 2020-06-23T00:47:59Z

@ZKai0801

Can you please check and fix this issue?

nvolkovaGEL · 2020-06-26T14:56:58Z

Hi, just to add to this, I encountered one more bug in the fusion list making script: it picks up the wrong gene if the gene name is contained in some other gene name (like RET and RETREG1) and the relevant transcript is not supplied. Example:
grep RET refFlat.txt

RETREG1 NM_019000 chr5 - 16473037 16508998 16474740 16508612 7 16473037,16477661,16478033,16478849,16481008,16483345,16508577, 16475234,16477788,16478098,16478987,16481093,16483472,16508998,
RET NM_020975 chr10 + 43077068 43130349 43077258 43128269 20 43077068,43100458,43102341,43104951,43106375,43109030,43111206,43112098,43112852,43113555,43114479,43116583,43118372,43119530,43120080,43121945,43123670,43124882,43126574,43128111, 43077331,43100722,43102629,43105193,43106571,43109230,43111465,43112224,43112963,43113675,43114736,43116731,43118480,43119745,43120203,43122016,43123808,43124982,43126722,43130349,

and if I supply RET to current script, I'll get this:

RET_NM_001034850,chr5:16473037-16617058

I suspect this may be an issue for transcripts as well. I invented a quick fix, but someone more keen with regular expressions can just check if gene[0] or gene[1] plus tab symbol is in the line:

    if len(gene) == 1:
        transcripts = {}
        with open(refflat, "r") as fh:
            for line in fh:
                cur_gene, transcript, chrom, strand, start, end, _, _, _, exonstart, exonend = line.rstrip("\n").split("\t")
                if cur_gene != gene[0]:
                    continue
                transcripts[transcript] = (chrom, start, end, exonstart, exonend)
        transcript = get_longest_transcript(transcripts.keys(), refflat)
        chrom, start, end, exonstart, exonend  = transcripts[transcript]
 # use user-specified transcript
    elif len(gene) == 2:
        with open(refflat, "r") as fh:
            for line in fh:
                _, transcript, chrom, _, start, end, _, _, _, exonstart, exonend = line.rstrip("\n").split("\t")
                if transcript != gene[1]:
                    continue
                break`

ZKai0801 · 2020-06-28T01:53:12Z

Thank you very much for reporting bugs and providing solutions! I have fixed the two bugs you mentioned as the way you suggested.
@sfchen please could you merge the pull request I just created?

fix bugs mentioned in #26

nvolkovaGEL · 2020-06-28T19:13:23Z

Just found out that I did not fully fix the strand issue, should have added strand to the identification of the longest transcript as well:

    if len(gene) == 1:
        transcripts = {}
        with open(refflat, "r") as fh:
            for line in fh:
                cur_gene, transcript, chrom, strand, start, end, _, _, _, exonstart, exonend = line.rstrip("\n").split("\t")
                if cur_gene != gene[0]:
                        continue
                transcripts[transcript] = (chrom, strand, start, end, exonstart, exonend)
        transcript = get_longest_transcript(transcripts.keys(), refflat)
        chrom, strand, start, end, exonstart, exonend  = transcripts[transcript]

Now should be fully correct! Happy to help!

sfchen · 2020-06-28T23:53:37Z

Just found out that I did not fully fix the strand issue, should have added strand to the identification of the longest transcript as well:

    if len(gene) == 1:
        transcripts = {}
        with open(refflat, "r") as fh:
            for line in fh:
                cur_gene, transcript, chrom, strand, start, end, _, _, _, exonstart, exonend = line.rstrip("\n").split("\t")
                if cur_gene != gene[0]:
                        continue
                transcripts[transcript] = (chrom, strand, start, end, exonstart, exonend)
        transcript = get_longest_transcript(transcripts.keys(), refflat)
        chrom, strand, start, end, exonstart, exonend  = transcripts[transcript]

Now should be fully correct! Happy to help!

You should submit another PR?

ZKai0801 · 2020-06-29T02:07:52Z

@nvolkovaGEL You are perfectly right, I should have more test before made commit.
Besides adding the amendment as you suggested, I added some error msgs to the code just in case people put wrong gene symbol/transcript to the input file.
@sfchen Could you please merge my second PR? sorry for the trouble

sfchen · 2020-06-29T02:19:46Z

Merged.

nvolkovaGEL · 2020-06-29T16:18:47Z

Sorry, should have just made a pull request, but couldn't get my locked local machine to synchronize properly with github. Thanks, please feel free to close this ticket!

sfchen mentioned this issue Jun 23, 2020

update for gen_fusion_file #23

Merged

sfchen added a commit that referenced this issue Jun 28, 2020

Merge pull request #28 from ZKai0801/master

5958b57

fix bugs mentioned in #26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fusion-making .py script does not take into account gene orientation #26

Fusion-making .py script does not take into account gene orientation #26

nvolkovaGEL commented Jun 11, 2020

sfchen commented Jun 23, 2020

nvolkovaGEL commented Jun 26, 2020 •

edited

Loading

ZKai0801 commented Jun 28, 2020

nvolkovaGEL commented Jun 28, 2020

sfchen commented Jun 28, 2020

ZKai0801 commented Jun 29, 2020

sfchen commented Jun 29, 2020

nvolkovaGEL commented Jun 29, 2020

Fusion-making .py script does not take into account gene orientation #26

Fusion-making .py script does not take into account gene orientation #26

Comments

nvolkovaGEL commented Jun 11, 2020

sfchen commented Jun 23, 2020

nvolkovaGEL commented Jun 26, 2020 • edited Loading

ZKai0801 commented Jun 28, 2020

nvolkovaGEL commented Jun 28, 2020

sfchen commented Jun 28, 2020

ZKai0801 commented Jun 29, 2020

sfchen commented Jun 29, 2020

nvolkovaGEL commented Jun 29, 2020

nvolkovaGEL commented Jun 26, 2020 •

edited

Loading