Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EPACTS tool support of GRCh38 #20

Open
ana-stankovic opened this issue Apr 9, 2020 · 11 comments
Open

EPACTS tool support of GRCh38 #20

ana-stankovic opened this issue Apr 9, 2020 · 11 comments

Comments

@ana-stankovic
Copy link

Hello, I have files run on both GRCh37 and 38 reference and would like to use EPACTS tools to process them.

In the post on the EPACTS Google Group ( here ) the question of tool support for GRCh38 was asked and the answer was given to change the EPACTS/scripts/epacts.pm file.
But modifying the file by hard coding for one or the other is not something that is compatible with my implementation.
Is there another way to achieve compatibility for both references?

Thank you.

@jonathonl
Copy link

Can you try using the --ref option? See epacts <command> --help for more information.

@ana-stankovic
Copy link
Author

ana-stankovic commented Apr 9, 2020

What I found in the code from EPACTS/scripts/epacts.pm file, you can see that the tool is hard-coded for GRCh37 - chromosome names and lengths that correspond to 37 reference, but not 38.

BEGIN {
## Variables below are hard-coded based on GRCh37
    @chrs = (1..22,"X","Y","MT");
    @szchrs = qw(249250621 243199373 198022430 191154276 180915260 171115067 159138663 146364022 141213431 135534747 135006516 133851895 115169878 107349540 102531392 90354753 81195210 78077248 59128983 63025520 48129895 51304566 155270560 59373566 16569);
    %ichrs = ();
    
    @cumszchrsMb = (0);
    for(my $i=0; $i < @chrs; ++$i) {
	push(@cumszchrsMb,$szchrs[$i]/1e6+$cumszchrsMb[$i]);
	$ichrs{$chrs[$i]} = $i;
    }
}

With that in mind, are you sure that giving -ref as GRCh38 would result in correct values?

@jonathonl
Copy link

It probably hasn't made it to the master branch yet, but this was fixed with 60fe5db. I would recommend using the latest pre-release (https://github.com/statgen/EPACTS/releases).

@ana-stankovic
Copy link
Author

ana-stankovic commented Apr 14, 2020

Hello, thank you for your response.
Per your recommendation, I have installed the pre-release version and run with the 38 reference file.
However, in the logs I see that it is still using files for hg19:
Load gene file /usr/local/share/EPACTS/hg19_gencodeV14.txt.gz...

With that in mind, and the parts of code that are hardcoded (like the different chromosome lengths that correspond to 37 reference that I mentioned above), how does this affect the outputs? Do you have any experience with this?
Thank you

@jonathonl
Copy link

Which analysis are you trying to run? Can you attach the entire log file?

@ana-stankovic
Copy link
Author

ana-stankovic commented Apr 15, 2020

For now I am trying a test run for EPACTS Single Variant Test.
This is the command line
epacts single --vcf true_1000G_exome_chr20_example_softFiltered.calls.vcf_GRCh38_liftover.vcf.gz --ped 1000G_dummy_pheno.ped --test b.score --pheno DISEASE --cov AGE --cov SEX --min-maf 0.001 --anno --ref GRCh38ERCC.ensembl95.fasta --out test --run 2
And the log file is attached here job.err.log
For now I am just testing the implementation with Single Variant Test, but I plan to use Create Marker Group File and Create Kinship Matrix, Groupwise Burden Test, Zoom Plot and Plot, so I am interested in the effect of this on them as well.

@jonathonl
Copy link

I see. It's because you have the --anno option enabled. We will need to add a parameter in epacts single for specifying the gene file. In the meantime, you can remove --anno and then manually run epacts anno on the resulting *.epacts file, using --genef to specify the appropriate gene file. That is if you desire gene annotations in your results.

@ana-stankovic
Copy link
Author

Thank you, this solves the problem of the gene anottation.
How does the hardcoded part of the EPACTS/scripts/epacts.pm script affect results? As I mentioned, this part of the script is hardcoded:

## Variables below are hard-coded based on GRCh37
    @chrs = (1..22,"X","Y","MT");
    @szchrs = qw(249250621 243199373 198022430 191154276 180915260 171115067 159138663 146364022 141213431 135534747 135006516 133851895 115169878 107349540 102531392 90354753 81195210 78077248 59128983 63025520 48129895 51304566 155270560 59373566 16569);
    %ichrs = ();
    
    @cumszchrsMb = (0);
    for(my $i=0; $i < @chrs; ++$i) {
	push(@cumszchrsMb,$szchrs[$i]/1e6+$cumszchrsMb[$i]);
	$ichrs{$chrs[$i]} = $i;
    }
}

For this to be compatible for 38, do we need to change this, or is there a workaround?

@jonathonl
Copy link

It doesn't. The initRef function is called when an alternative reference is provided, which overrides the initial values. See https://github.com/statgen/EPACTS/blob/develop/scripts/epacts.pm#L576-L648.

@ana-stankovic
Copy link
Author

Hello, thank you for all the help so far.
I am having trouble finding a corresponding version of the Gene Map file (the genetic_map_GRCh37_wgs.txt.gz file from the data package) for the GRCh38.
I was not able to locate it in the UCSC Genome Browser or Downloads.
Thank you.

@jonathonl
Copy link

You can find a build 38 map file at http://csg.sph.umich.edu/locuszoom/download/recomb-hg38.tar.gz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants