Errors in missing vs pad values in VCF #1190

jeromekelleher · 2024-02-08T22:41:29Z

Through developing the alternative implementation of vcf-to-zarr conversion in #1185 I think there's some bugs in how we're currently handling missing data. Opening this PR for discussion purposes.

There's some related issues around handling mixed ploidy calls, and string missing values, but let's leave those alone for now.

I realise now that the extra fields I added in to the simple test here duplicates later tests - but they're handy for discussion here for now anyway.

sgkit/io/vcf/vcf_reader.py

jeromekelleher · 2024-02-09T12:24:40Z

Basically I think we're returning FILL values when we should be returning missing. Consider INFO/NS in the example:

##fileformat=VCFv4.0
##fileDate=20090805
##source=myImputationProgramV3.1
##reference=1000GenomesPilot-NCBI36
##phasing=partial
##INFO=<ID=NS,Number=1,Type=Integer,Description="Number of Samples With Data">
##INFO=<ID=AN,Number=1,Type=Integer,Description="Total number of alleles in called genotypes">
##INFO=<ID=AC,Number=.,Type=Integer,Description="Allele count in genotypes, for each ALT allele, in the same order as listed">
##INFO=<ID=DP,Number=1,Type=Integer,Description="Total Depth">
##INFO=<ID=AF,Number=.,Type=Float,Description="Allele Frequency">
##INFO=<ID=AA,Number=1,Type=String,Description="Ancestral Allele">
##INFO=<ID=DB,Number=0,Type=Flag,Description="dbSNP membership, build 129">
##INFO=<ID=H2,Number=0,Type=Flag,Description="HapMap2 membership">
##FILTER=<ID=s50,Description="Less than 50% of samples have data">
##FILTER=<ID=q10,Description="Quality below 10">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
##FORMAT=<ID=GQ,Number=1,Type=Integer,Description="Genotype Quality">
##FORMAT=<ID=DP,Number=1,Type=Integer,Description="Read Depth">
##FORMAT=<ID=HQ,Number=2,Type=Integer,Description="Haplotype Quality">
##ALT=<ID=DEL:ME:ALU,Description="Deletion of ALU element">
##ALT=<ID=CNV,Description="Copy number variable region">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO    FORMAT  NA00001 NA00002 NA00003
19      111     .       A       C       9.6     .       .       GT:HQ   0|0:10,15       0|0:10,10       0/1:3,3
19      112     .       A       G       10      .       .       GT:HQ   0|0:10,10       0|0:10,10       0/1:3,3
20      14370   rs6054257       G       A       29      PASS    NS=3;DP=14;AF=0.5;DB;H2 GT:GQ:DP:HQ     0|0:48:1:51,51  1|0:48:8:51,51  1/1:43:5:.,.
20      17330   .       T       A       3       q10     NS=3;DP=11;AF=0.017     GT:GQ:DP:HQ     0|0:49:3:58,50  0|1:3:5:65,3    0/0:41:3:.,.
20      1110696 rs6040355       A       G,T     67      PASS    NS=2;DP=10;AF=0.333,0.667;AA=T;DB       GT:GQ:DP:HQ     1|2:21:6:23,27  2|1:2:0:18,2    2/2:35:4:.,.
20      1230237 .       T       .       47      PASS    NS=3;DP=13;AA=T GT:GQ:DP:HQ     0|0:54:.:56,60  0|0:48:4:51,51  0/0:61:2:.,.
20      1234567 microsat1       G       GA,GAC  50      PASS    NS=3;DP=9;AA=G;AN=6;AC=3,1      GT:GQ:DP        0/1:.:4 0/2:17:2        ./.:40:3
20      1235237 .       T       .       .       .       .       GT      0/0     0|0     ./.
X       10      rsTest  AC      A,ATG,C 10      PASS    .       GT      0       0/1     0|2

We're currently returning [-2, -2, 3, 3, 2, 3, 3, -2, -2] when it should be [-1, -1, 3, 3, 2, 3, 3, -1, -1]. If the NS info field is not present, then surely that's interpreted as missing data not as a missing dimension in the current data. Dimension padding is a special case, when data is present, but the current row has dimension smaller then the overall column.

That's my interpretation anyway - what do you think @tomwhite?

I'll fix up the tests if we agree there is a bug, but that's the essence of it.

tomwhite · 2024-02-09T12:37:03Z

Thanks for giving an example @jeromekelleher. I think your interpretation is correct. The NS values are simply missing, not the end of a vector that needs padding/filling.

timothymillar · 2024-02-14T20:59:25Z

@jeromekelleher your interpretation looks correct to me. We should also test for a case where -2 should appear in an INFO field (e.g., with length A or R):

##INFO=<ID=IDX,Number=R,Type=Integer,Description="Index of each allele">
#CHROM  POS     ID      REF     ALT     QUAL    FILTER  INFO          FORMAT        SAMPLE1        SAMPLE2
1       100     .       A       C,T     .       .       IDX=0,1,2     GT:HQ         0|1            0|2 
1       200     .       A       C       .       .       IDX=0,1       GT:HQ         0|0            0|1

With max_alt_alleles=2, the IDX field should correspond to the (variants * alleles) array:

[
    [0, 1,  2],
    [0, 1, -2],
]

jeromekelleher · 2024-02-15T13:42:30Z

I think this is ready for a look now. There is a little duplication in the tests I've added for the sample VCF and existing ones which have been fixed up, but I think that's OK.

The change is a pretty noisy one I'm afraid, as there's been a few downstream things broken as well (#1195, #1196). I've temporarily skipped those tests to make this more manageable.

jeromekelleher · 2024-02-15T15:59:00Z

Sigh - skipping those tests pushes the required coverage below 100% so the build still fails.

Any suggestions here? Temporarily push coverage down, and create an issue to track setting it back to 100?

sgkit/tests/io/vcf/test_vcf_reader.py

sgkit/tests/io/vcf/test_vcf_roundtrip.py

tomwhite · 2024-02-16T09:42:06Z

Sigh - skipping those tests pushes the required coverage below 100% so the build still fails.

Any suggestions here? Temporarily push coverage down, and create an issue to track setting it back to 100?

I'm OK allowing coverage to dip as long as there is a path to get it back to 100%.

jeromekelleher · 2024-02-16T09:48:02Z

Thanks @tomwhite - I'm happy to go with your preferred option. If we can fix the other problems fairly easily in this PR then that would be simplest. I'm just worried about doing too many changes at once.

jeromekelleher · 2024-02-16T11:22:33Z

OK, this is ready for another look @tomwhite. I've temporarily worked around some fiddly corner cases - hopefully we'll still have 100% coverage.

#1197 is a tricky one, but I don't think it's worth getting hung up on.

jeromekelleher · 2024-02-16T12:49:22Z

OK! Looks like it's ready for a final look now @tomwhite.

@timothymillar, this is a reasonably big change so would be good to get your eyes on it as well if possible.

tomwhite

LGTM

timothymillar

LGTM!

sgkit/tests/io/vcf/test_vcf_reader.py

Closes sgkit-dev#1192

Closes sgkit-dev#1196

Temporary workaround to get tests to pass before addressing sgkit-dev#1195

Closes sgkit-dev#1197

Remove distinction between present with missing value, versus missing

tomwhite reviewed Feb 9, 2024

View reviewed changes

sgkit/io/vcf/vcf_reader.py Outdated Show resolved Hide resolved

jeromekelleher mentioned this pull request Feb 9, 2024

VCF parser returning FILL instead of MISSING values #1192

Closed

jeromekelleher mentioned this pull request Feb 15, 2024

vcf_writer emitting incorrect missing values for INFO fields #1196

Closed

jeromekelleher force-pushed the vcf-missing-data-issues branch from c77857c to 7a0035d Compare February 15, 2024 13:39

jeromekelleher marked this pull request as ready for review February 15, 2024 13:40

jeromekelleher force-pushed the vcf-missing-data-issues branch from 7a0035d to f7f8aed Compare February 15, 2024 16:01

tomwhite reviewed Feb 16, 2024

View reviewed changes

sgkit/tests/io/vcf/test_vcf_reader.py Outdated Show resolved Hide resolved

sgkit/tests/io/vcf/test_vcf_roundtrip.py Outdated Show resolved Hide resolved

jeromekelleher mentioned this pull request Feb 16, 2024

Cannot round-trip explicitly set missing INFO values in VCF #1197

Closed

jeromekelleher force-pushed the vcf-missing-data-issues branch from f7f8aed to 57d84d1 Compare February 16, 2024 11:06

tomwhite approved these changes Feb 16, 2024

View reviewed changes

jeromekelleher force-pushed the vcf-missing-data-issues branch from 8d6a760 to a652a0c Compare February 16, 2024 12:22

tomwhite approved these changes Feb 16, 2024

View reviewed changes

timothymillar approved these changes Mar 4, 2024

View reviewed changes

sgkit/tests/io/vcf/test_vcf_reader.py Show resolved Hide resolved

jeromekelleher added the auto-merge Auto merge label for mergify test flight label Mar 4, 2024

jeromekelleher added 7 commits March 5, 2024 09:12

Errors in missing vs pad values in VCF

b61a09b

Closes sgkit-dev#1192

Skip tests for reported bugs sgkit-dev#1195, sgkit-dev#1196

43d2f75

Mask VCF missing rather than fill values

355ea2b

Closes sgkit-dev#1196

Enable tests, and skip problematic round trip

b5708ed

Skip problematic fields in scikit-allel comparison

a0edba7

Temporary workaround to get tests to pass before addressing sgkit-dev#1195

Remove all-missing fields from example VCF

b03b952

Closes sgkit-dev#1197

Simplify VCF reader logic

c58bb62

Remove distinction between present with missing value, versus missing

jeromekelleher force-pushed the vcf-missing-data-issues branch from a652a0c to c58bb62 Compare March 5, 2024 09:12

mergify bot merged commit 6b10a77 into sgkit-dev:main Mar 5, 2024
8 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Errors in missing vs pad values in VCF #1190

Errors in missing vs pad values in VCF #1190

jeromekelleher commented Feb 8, 2024

jeromekelleher commented Feb 9, 2024

tomwhite commented Feb 9, 2024

timothymillar commented Feb 14, 2024

jeromekelleher commented Feb 15, 2024

jeromekelleher commented Feb 15, 2024

tomwhite commented Feb 16, 2024

jeromekelleher commented Feb 16, 2024

jeromekelleher commented Feb 16, 2024

jeromekelleher commented Feb 16, 2024

tomwhite left a comment

timothymillar left a comment

Errors in missing vs pad values in VCF #1190

Errors in missing vs pad values in VCF #1190

Conversation

jeromekelleher commented Feb 8, 2024

jeromekelleher commented Feb 9, 2024

tomwhite commented Feb 9, 2024

timothymillar commented Feb 14, 2024

jeromekelleher commented Feb 15, 2024

jeromekelleher commented Feb 15, 2024

tomwhite commented Feb 16, 2024

jeromekelleher commented Feb 16, 2024

jeromekelleher commented Feb 16, 2024

jeromekelleher commented Feb 16, 2024

tomwhite left a comment

Choose a reason for hiding this comment

timothymillar left a comment

Choose a reason for hiding this comment