Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify column names for header in GWAS summary stats file : followup to Issue #76 #92

Open
samreenzafer opened this issue Jan 16, 2025 · 1 comment

Comments

@samreenzafer
Copy link

Hi
I have recently downloaded the May 2024 version of PRScs and updated the scripts as per our discussion in issue #76 and wanted to send you my tested code for the same.

parse_genet_newColNames.py.txt
PRScs_colnames.py.txt

These scripts work well with GWAS summary stats downloaded from GWAS-catalog which typically have the following header.

zcat ../GWAS/GCST90399677.h.tsv.gz | head 
chromosome      base_pair_location      effect_allele   other_allele    beta    standard_error  effect_allele_frequency p_value rsid    is_strand_flip  rs_id   N_ctrl  n_bbk   is_diff_AF_gnomAD       n_dataset       inv_var-het_p   direction       N_case  hm_coordinate_conversion        hm_code variant_id
1       16226   A       AG      -0.1643 0.079081        0.01037 0.037739999999999996    rs755466349     no      rs755466349     330526  2       no      2       0.6037  ??--??????      24773   lo      10      1_16226_AG_A
1       48186   G       T       -0.20347        0.15655 0.003232        0.1937  rs199900651     no      rs199900651     269406  2       no      2       0.5154  ???-??-???      17186   lo      10      1_48186_T_G
1       55326   C       T       0.0046873       0.14151 0.07157000000000001     0.9736  rs3107975       no      rs3107975       94636   2       no      2       0.9798  ?+???+????      1282    lo      10      1_55326_T_C

This is how to use the script above.
python /softwares/PRScs/PRScs_colnames.py --ref_dir=$ldref/ldblk_1kg_eur --bim_prefix=$bim_dir/inputfile --sst_file=$SUM_STATS_FILE --n_gwas=$GWAS_SAMPLE_SIZE --out_dir=$out_dir/output.PRScs --SNP=rs_id --A1=effect_allele --A2=other_allele --BETA=beta --P=p_value

I do have a followup request.
Many a times, the GWAS summary stats files only provide the tested allele/minor allele/A1 . Can you please modify PRScs so that It can work with such data input. It would not be hard to grab common snps between the Ldref , BIM and summStats files, by RSID (snp id) and A1 allele at a minimum, while adjusting the Beta if the A1 is major allele instead of minor, by looking up in the hap map snp list provided by your software at $PRScsREFDIR/ldblk_1kg_eur/snpinfo_1kg_hm3

Example:
This summary stats file from the GWAS catalog (GCST006479 [https://www.ebi.ac.uk/gwas/studies/GCST006479]) does not have A2 allele , but provides all the other required columns.

SNP ALLELE iscores NBETA-clinical_c_K57 NSE-clinical_c_K57 PV-clinical_c_K57
10:62535_C_A A 0.8509559999999999 -0.04962 0.087016 0.56852
10:66208_T_C C 0.239414 -0.05802 0.051094 0.25614000000000003
10:67991_A_C C 0.299157 -0.022293 0.036826 0.54495
rs11252546 C 0.9844569999999999 -0.00016589 0.0005241000000000001 0.7516
rs12255619 C 0.995585 -0.00072331 0.00088709 0.41486000000000006
rs7909677 G 1.0 -0.00067211 0.00088585 0.44803000000000004
rs10904494 C 0.9968440000000001 -4.3866e-05 0.00051934 0.93269
rs11591988 T 0.978547 -0.00018362 0.00080765 0.82015
rs4508132 C 0.99779 -0.00084295 0.00066752 0.20665999999999998
rs9419461 T 0.9875889999999999 -0.0004753 0.00071174 0.50427
rs10904561 G 0.99069 -0.00021088 0.00053584 0.69391
rs11253478 T 0.978096 -0.00016417 0.00080747 0.8388899999999999
rs4495823 A 0.995431 -0.0008757 0.00066695 0.18919

And the common snps with the $PRScsREFDIR/ldblk_1kg_eur/snpinfo_1kg_hm3 file are

10	rs12255619	98481	C	A	0.066600
10	rs11252546	104427	C	T	0.369800
10	rs7909677	111955	G	A	0.067590
10	rs10904494	113934	C	A	0.368800
10	rs9419461	124767	T	C	0.126200
10	rs11591988	126070	T	C	0.103400
10	rs4508132	131636	T	C	0.154100
10	rs10904561	135656	G	T	0.351900
10	rs7917054	135708	A	G	0.469200

It would be very helpful to have PRScs work, when A2 column is not provided in the summary stats file.

Hope my scripts help others too.

@getian107
Copy link
Owner

Thank you for sharing the scripts—I believe they will be beneficial for many people.

Regarding your request, currently I don’t think I have the bandwidth to modify PRScs to accommodate the new format. GWAS summary statistics come in various formats, some of which are not best practices (such as failing to report A2), making it challenging to accommodate all variations. Therefore, I've decided to leave it to users to preprocess the summary statistics into a specified format.

However, as I mentioned in issue 76, we are working on an algorithmic extension of PRScs, along with command-line options to select columns, which will accommodate a much larger range of formats. We hope to release those tools soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants