Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Errors in missing vs pad values in VCF #1190

Merged
merged 7 commits into from
Mar 5, 2024

Conversation

jeromekelleher
Copy link
Collaborator

Through developing the alternative implementation of vcf-to-zarr conversion in #1185 I think there's some bugs in how we're currently handling missing data. Opening this PR for discussion purposes.

There's some related issues around handling mixed ploidy calls, and string missing values, but let's leave those alone for now.

I realise now that the extra fields I added in to the simple test here duplicates later tests - but they're handy for discussion here for now anyway.

sgkit/io/vcf/vcf_reader.py Outdated Show resolved Hide resolved
@jeromekelleher
Copy link
Collaborator Author

Basically I think we're returning FILL values when we should be returning missing. Consider INFO/NS in the example:

##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FILTER=<ID=q10,Description="Quality below 10">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
##ALT=<ID=DEL:ME:ALU,Description="Deletion of ALU element">
##ALT=<ID=CNV,Description="Copy number variable region">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA00001 NA00002 NA00003
19      111     .       A       C       9.6     .       .       GT:HQ   0|0:10,15       0|0:10,10       0/1:3,3
19      112     .       A       G       10      .       .       GT:HQ   0|0:10,10       0|0:10,10       0/1:3,3
20      14370   rs6054257       G       A       29      PASS    NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ     0|0:48:1:51,51  1|0:48:8:51,51  1/1:43:5:.,.
20      17330   .       T       A       3       q10     NS=3;DP=11;AF=0.017     GT:GQ:DP:HQ     0|0:49:3:58,50  0|1:3:5:65,3    0/0:41:3:.,.
20      1110696 rs6040355       A       G,T     67      PASS    NS=2;DP=10;AF=0.333,0.667;AA=T;DB       GT:GQ:DP:HQ     1|2:21:6:23,27  2|1:2:0:18,2    2/2:35:4:.,.
20      1230237 .       T       .       47      PASS    NS=3;DP=13;AA=T GT:GQ:DP:HQ     0|0:54:.:56,60  0|0:48:4:51,51  0/0:61:2:.,.
20      1234567 microsat1       G       GA,GAC  50      PASS    NS=3;DP=9;AA=G;AN=6;AC=3,1      GT:GQ:DP        0/1:.:4 0/2:17:2        ./.:40:3
20      1235237 .       T       .       .       .       .       GT      0/0     0|0     ./.
X       10      rsTest  AC      A,ATG,C 10      PASS    .       GT      0       0/1     0|2

We're currently returning [-2, -2, 3, 3, 2, 3, 3, -2, -2] when it should be [-1, -1, 3, 3, 2, 3, 3, -1, -1]. If the NS info field is not present, then surely that's interpreted as missing data not as a missing dimension in the current data. Dimension padding is a special case, when data is present, but the current row has dimension smaller then the overall column.

That's my interpretation anyway - what do you think @tomwhite?

I'll fix up the tests if we agree there is a bug, but that's the essence of it.

@tomwhite
Copy link
Collaborator

tomwhite commented Feb 9, 2024

Thanks for giving an example @jeromekelleher. I think your interpretation is correct. The NS values are simply missing, not the end of a vector that needs padding/filling.

@timothymillar
Copy link
Collaborator

@jeromekelleher your interpretation looks correct to me. We should also test for a case where -2 should appear in an INFO field (e.g., with length A or R):

##INFO=<ID=IDX,Number=R,Type=Integer,Description="Index of each allele">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO          FORMAT        SAMPLE1        SAMPLE2
1       100     .       A       C,T     .       .       IDX=0,1,2     GT:HQ         0|1            0|2 
1       200     .       A       C       .       .       IDX=0,1       GT:HQ         0|0            0|1

With max_alt_alleles=2, the IDX field should correspond to the (variants * alleles) array:

[
    [0, 1,  2],
    [0, 1, -2],
]

@jeromekelleher
Copy link
Collaborator Author

I think this is ready for a look now. There is a little duplication in the tests I've added for the sample VCF and existing ones which have been fixed up, but I think that's OK.

The change is a pretty noisy one I'm afraid, as there's been a few downstream things broken as well (#1195, #1196). I've temporarily skipped those tests to make this more manageable.

@jeromekelleher
Copy link
Collaborator Author

Sigh - skipping those tests pushes the required coverage below 100% so the build still fails.

Any suggestions here? Temporarily push coverage down, and create an issue to track setting it back to 100?

sgkit/tests/io/vcf/test_vcf_reader.py Outdated Show resolved Hide resolved
sgkit/tests/io/vcf/test_vcf_roundtrip.py Outdated Show resolved Hide resolved
@tomwhite
Copy link
Collaborator

Sigh - skipping those tests pushes the required coverage below 100% so the build still fails.

Any suggestions here? Temporarily push coverage down, and create an issue to track setting it back to 100?

I'm OK allowing coverage to dip as long as there is a path to get it back to 100%.

@jeromekelleher
Copy link
Collaborator Author

Thanks @tomwhite - I'm happy to go with your preferred option. If we can fix the other problems fairly easily in this PR then that would be simplest. I'm just worried about doing too many changes at once.

@jeromekelleher
Copy link
Collaborator Author

OK, this is ready for another look @tomwhite. I've temporarily worked around some fiddly corner cases - hopefully we'll still have 100% coverage.

#1197 is a tricky one, but I don't think it's worth getting hung up on.

@jeromekelleher
Copy link
Collaborator Author

OK! Looks like it's ready for a final look now @tomwhite.

@timothymillar, this is a reasonably big change so would be good to get your eyes on it as well if possible.

Copy link
Collaborator

@tomwhite tomwhite left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Collaborator

@timothymillar timothymillar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

sgkit/tests/io/vcf/test_vcf_reader.py Show resolved Hide resolved
@jeromekelleher jeromekelleher added the auto-merge Auto merge label for mergify test flight label Mar 4, 2024
@mergify mergify bot merged commit 6b10a77 into sgkit-dev:main Mar 5, 2024
8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge Auto merge label for mergify test flight
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants