You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
In vcf_to_zarr, specifying max_alt_alleles as smaller than the actual maximum is meant to truncate the alleles dimension (with a warning). However, if there is an INFO field with length R (one value for each allele) then this results in an IndexError. Here is a minimal example:
example.vcf with 2 alternate alleles:
##fileformat=VCFv4.3
##FILTER=<ID=PASS,Description="All filters passed">
##fileDate=20240307
##source=None
##contig=<ID=chr1,length=21898217>
##INFO=<ID=AFP,Number=R,Type=Float,Description="Posterior mean allele frequencies">
##FORMAT=<ID=GT,Number=1,Type=String,Description="Genotype">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT SAMPLE0
chr1 1 . A T,G . . AFP=0.8,0.1,01 GT 0/1
convert.py:
fromsgkit.io.vcfimportvcf_to_zarrvcf_to_zarr(
input="example.vcf",
output="example.zarr",
max_alt_alleles=1, # truncate to one alternate allelefields=[
"INFO/AFP",
"FORMAT/GT",
]
)
traceback:
---------------------------------------------------------------------------
IndexError Traceback (most recent call last)
Cell In[8], line 1
----> 1 vcf_to_zarr(
2 input="example.vcf",
3 output="example.zarr",
4 max_alt_alleles=1,
5 fields=[
6 "INFO/AFP",
7 "FORMAT/GT",
8 ]
9 )
File /path/to/sgkit/sgkit/io/vcf/vcf_reader.py:1002, in vcf_to_zarr(input, output, target_part_size, regions, chunk_length, chunk_width, compressor, encoding, temp_chunk_length, tempdir, tempdir_storage_options, ploidy, mixed_ploidy, truncate_calls, max_alt_alleles, fields, exclude_fields, field_defs, read_chunk_length, retain_temp_files)
966 sequential_function = functools.partial(
967 vcf_to_zarr_sequential,
968 output=output,
(...)
980 field_defs=field_defs,
981 )
982 parallel_function = functools.partial(
983 vcf_to_zarr_parallel,
984 output=output,
(...)
1000 retain_temp_files=retain_temp_files,
1001 )
-> 1002 process_vcfs(
1003 input,
1004 sequential_function,
1005 parallel_function,
1006 regions=regions,
1007 target_part_size=target_part_size,
1008 )
1010 # Issue a warning if max_alt_alleles caused data to be dropped
1011 ds = zarr.open(output)
File /path/to/sgkit/sgkit/io/vcf/vcf_reader.py:1328, in process_vcfs(input, sequential_function, parallel_function, regions, target_part_size)
1320 regions = [
1321 partition_into_regions(input, target_part_size=target_part_size)
1322 for input in inputs
1323 ]
1325 if (isinstance(input, str) or isinstance(input, Path)) and (
1326 regions is None or isinstance(regions, str)
1327 ):
-> 1328 return sequential_function(input=input, region=regions)
1329 else:
1330 return parallel_function(input=input, regions=regions)
File /path/to/sgkit/sgkit/io/vcf/vcf_reader.py:516, in vcf_to_zarr_sequential(input, output, region, chunk_length, chunk_width, compressor, encoding, ploidy, mixed_ploidy, truncate_calls, max_alt_alleles, fields, exclude_fields, field_defs, read_chunk_length)
514 raise ValueError(f"Filter '{f}' is not defined in the header.")
515 for field_handler in field_handlers:
--> 516 field_handler.add_variant(i, variant)
518 # Truncate np arrays (if last chunk is smaller than read_chunk_length)
519 if i + 1 < read_chunk_length:
File /path/to/sgkit/sgkit/io/vcf/vcf_reader.py:293, in InfoAndFormatFieldHandler.add_variant(self, i, variant)
291 try:
292 for j, v in enumerate(val):
--> 293 self.array[i, j] = (
294 v if v is not None else self.missing_value
295 )
296 except TypeError: # val is a scalar
297 self.array[i, 0] = val
IndexError: index 2 is out of bounds for axis 1 with size 2
The text was updated successfully, but these errors were encountered:
timothymillar
added
bug
Something isn't working
IO
Issues related to reading and writing common third-party file formats
labels
Mar 7, 2024
Possibly a similar bug with a FORMAT field which is length R. Getting some random crashes which may be due to out of bounds memory access in Numba code.
Possibly a similar bug with a FORMAT field which is length R. Getting some random crashes which may be due to out of bounds memory access in Numba code.
That's odd - the vcf parsing code doesn't use any numba AFAIK.
With an updated dataset (slightly larger), including the length R FORMAT field resulted in ValueError: Codec does not support buffers of > 2147483647 bytes. This led me to zarr-developers/zarr-python#487 and with smaller chunk sizes I was able to convert the VCF. I'm not sure why it was crashing earlier.
In
vcf_to_zarr
, specifyingmax_alt_alleles
as smaller than the actual maximum is meant to truncate the alleles dimension (with a warning). However, if there is an INFO field with lengthR
(one value for each allele) then this results in an IndexError. Here is a minimal example:example.vcf with 2 alternate alleles:
convert.py:
traceback:
The text was updated successfully, but these errors were encountered: