IO for genomic file formats #6082

ghost · 2020-04-10T01:46:50Z

Greetings,

I would like to implement IO for some genomic file formats. The specifications for the file formats I would like to be able to read directly into Dask dataframes follow the SAM specification as well as the VCF
specification. TLDR on the specs is: Tab delimited files with a large header containing global state for all of the records in the file.

I have a denormalized schema for the data in the files meaning that the header information is copied into a column in every record. Also, most of the time these files are compressed using block gzip (bgzip) and indexed using an R-tree for random access. The challenge is I would like to provide code that reads directly from these files into a dask dataframe following the denormalized schema. Are there base classes I can implement for this?

mrocklin · 2020-04-10T14:55:11Z

Hi @Jamesthegiantpeach .

Broadly you should look through https://docs.dask.org/en/latest/dataframe-create.html

In particular I suspect that you'll want to use Dask delayed with Dask dataframe, particularly if you already have Python functions that can create Pandas dataframes from chunks of your file.

For tab-delimited files and remote access (S3, HDFS, GCS) you probably also want to look at read_bytes, and note the delimiter keyword

I think that having an open source reader for SAM and VCF files for Dask would be great. Is this work that you're planning on / open to publishing?

ghost · 2020-04-10T17:46:22Z

Hi @mrocklin

Thanks for the links! These are helpful.

The files are almost always stored compressed with an index that specifies byte offsets for extracting random records from the files without using DEFLATE on the entire file. So I think the blocks would have to be calculated from the file's index and extracted using some kind of S3 file seeker. I don't know if the s3fs lib that Dask currently uses supports byte range queries on GET requests to S3.

I would like to have the community's help maintaining the IO libraries. It's a little more complicated than a dd.from_vcf() and dd.to_vcf() function though. There can be two schema's associated with a VCF file for instance: A Variant schema, and a Genotype schema. I think I can take care of this using an argument like dd.from_vcf(schema='genotype'), but i'm not sure yet.

I want to write up the denormalized table schemas that the files need to be converted to and circle back to this thread with that info.

Note: There are some useful schemas defined in this project.

ihnorton · 2020-04-10T18:00:29Z

In case it is of interest, TileDB has an open-source VCF ingestion and query library here which models variants as sparse TileDB arrays. The library provides a Python API which returns query results as dask (delayed) data frames built on top of pandas DFs, similar to mrocklin's suggestion (internally the results are actually returned as Arrow buffers, which are also used for the Spark API and could be plumbed in to other Arrow-friendly languages like R).

martindurant · 2020-04-10T18:04:20Z

I don't know if the s3fs lib that Dask currently uses supports byte range queries on GET requests to S3.

Yes, you can treat the files just like ordinary file-like objects and seek will work as expected.

ghost · 2020-04-10T18:54:01Z

@ihnorton

Is TileDB maintaining a Dask fork?

import dask
import dask.array as da

array = da.from_tiledb('s3://my-bucket/my-dense-array',
                        attribute='attr',
                        storage_options={'vfs.s3.aws_access_key_id': 'mykeyid',
                                         'vfs.s3.aws_secret_access_key': 'mysecret'})
print(array.mean().compute())

EDIT: Nvm I found the code here

I think this is an interesting approach. I personally think the SAM and VCF specs are hacky, and I have been trying to get comp bio community to drop it. It might be better to eschew using the spec at all and just convert the VCF files to a proper columnar disk interchange format like parquet and ingesting from parquet to arrow recordbatches, buffers etc.

mrocklin · 2020-04-10T19:27:01Z

So to be explicit about @martindurant 's point, you could do something like the following:

def read_sam(filename, open=open) -> dask.dataframe.DataFrame:
    with open(filename) as f:
        header = f.read(size_of_header)
        header = process_header
    index_locations = ...

    partitions = [
        dask.delayed(read_part, filename, start, stop, open=open) 
        for start, stop in index_locations
    ]

    return dask.dataframe.from_delayed(partitions)

def read_part(filename, start, stop, open=open) -> pandas.DataFrame:
    with open(filename) as f:
        f.seek(start)
        data = f.read(stop - start)

    df = process(data)
    return df

mrocklin · 2020-04-10T19:28:36Z

Misfire on commenting earlier. Updated.

ghost · 2020-04-10T19:44:10Z

Thanks this makes sense.

I need to rethink this. It might not be a great idea to simply import the VCF records as rows in a dd since in any given human genome the overwhelming majority of locations will match the reference genome (homozygous reference, 0/0, etc) causing 99% of the rows to be wasted space. The approach here makes a lot of sense to me. I am going to tinker with tiledb and the dask integration today and circle back.

mrocklin · 2020-04-10T19:45:24Z

Great. I'm glad to see people working on this.

…

On Fri, Apr 10, 2020 at 12:44 PM jamesthegiantpeach < ***@***.***> wrote: Thanks this makes sense. I need to rethink this. It might not be a great idea to simply import the VCF records as rows in a dd since in any given human genome the overwhelming majority of locations will match the reference genome (homozygous reference, 0/0, etc) causing 99% of the rows to be wasted space. The approach here <https://docs.tiledb.com/main/use-cases/genomics/> makes a lot of sense to me. I am going to tinker with tiledb and the dask integration today and circle back. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#6082 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTF33V3P544WV7FQBHTRL5ZJPANCNFSM4MFE2NXA> .

ghost · 2020-04-11T01:17:40Z

TileDB works well for my needs, and the integration with dask arrays works nicely. I think any IO efforts from me should go into helping with IO of genomic files over there and just using the tdb/dask integration.

Cool project.

Also, as a side note to TileDB people: I am able to version control a changing array (adding/subtracting individuals from the cohort) by storing TileDB files in a Pachyderm repository. Time traveling then looks like the usual git semantics. This works even if you've merged the fragment arrays. You can revert back to before you merged the fragments without losing any data.

Anyway, thanks for the help everyone.

mrocklin · 2020-04-11T01:42:45Z

TileDB works well for my needs, and the integration with dask arrays works nicely. I think any IO efforts from me should go into helping with IO of genomic files over there and just using the tdb/dask integration.

I'm glad to hear it! Thanks for jumping in @ihnorton .

@Jamesthegiantpeach if you ever have time to write about your work, we'd love a blogpost (even if only very brief) at https://github.com/dask/dask-blog (same goes for you @ihnorton )

ghost · 2020-05-22T07:22:29Z

I ended toying around with doing this straight from the files in my spare time, and I will probably migrate over to using tiledb when I can.

Sharding these files ends up being fairly complicated because of how some aligned reads hang over the coordinates that you're binning on.

---- ---             ---------
 ----      -----        -------
       ---- ----   ----- ---- 
             ------   ----    ------      ------ ------                          
   ----- ----- ----- -----      ----
|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|

Some of the reads hang over the bin edges, and you have to make sure these fragments end up going to both bins. There are a bunch of other weird things to worry about when splitting and merging (like deduping around the bin edges).

Further, I had to write some C++ glue between htslib and Arrow so that the IO to pandas was reasonably efficient. Seems like a dependency that isn't appropriate for this repo.

I think this probably warrants its own project. I think it would look something like scikit allel, but maybe more centered around apache arrow to cut down on how much data pandas tends to unnecessarily materialize.

jakirkham · 2020-07-13T19:35:38Z

Is Hi-C also relevant to this discussion? Or is that not really being considered here.

jakirkham · 2020-07-13T19:35:45Z

cc @alimanfoo (who works on scikit-allel in case he has thoughts on any of this 🙂)

ghost · 2020-07-13T19:45:44Z

@jakirkham Hi-C is relevant if it uses the standard hts formats. I don't have any experience with Hi-C.

I have used allel, and I dig it. One thing that I think is missing is an efficient disk format. It would be interesting to introduce some classes to allel that get you zero copy data from S3 via tiledb.

quasiben · 2020-07-13T20:46:46Z

@nvictus and I briefly chatted about Hi-C today. He also mentioned bigwig and bigbed

jakirkham · 2020-07-13T21:20:25Z

I have used allel, and I dig it. One thing that I think is missing is an efficient disk format.

AIUI this is one of the motivations behind Zarr ( cggh/scikit-allel#102 ). Though @alimanfoo please feel free to jump in and correct me if I'm misunderstanding things.

alimanfoo · 2020-07-13T22:56:41Z

Hi folks, thanks for tagging me in. Just to mention that we commonly convert VCF to zarr via scikit-allel's vcf_to_zarr() function, following a process like shown in this worked example. We then upload the data to GCS using gsutil, and access it via gcsfs and analyse via dask. Here's an example notebook running a simple popgen analysis, runnable via google colab (although will go much faster if you have your own dask-kubernetes cluster in google cloud :-).

TileDB may be an equally good or better route to dask for VCF, I'd be interested in any experience there.

ghost · 2020-07-13T23:26:06Z

Cool! I didn’t know of zarr. Does anyone know of the central differences between zarr and TileDB?

ghost · 2020-07-13T23:34:16Z

Looks like Stavros wrote something that mentions zarr

https://medium.com/tiledb/tiledb-a-database-for-data-scientists-ddf4ca122176

Is the main difference the native support for sparse arrays?

Edit: from what I’ve groked on forums it looks like zarr is more like a file format and tiledb is more like an engine for multidimensional indices?

nvictus · 2020-07-13T23:37:45Z

Thanks @quasiben! FYI, I'm also the developer of the cooler format for Hi-C data mentioned above. It implements a sparse matrix data model using HDF5 primitives, and by extension Zarr works too (it is currently almost a drop-in replacement, we plan to support it as an alternative backend). You can see cooler data displayed in HiGlass.

nvictus · 2020-07-14T00:36:30Z

As I was telling @quasiben, similar to the bgzipped binary htslib formats, bigWig and bigBed (developed as part of the UCSC genome browser) are binary, compressed and indexed. However, they also store multiple "zoom levels" of summary statistics, as they were designed for browsing. With a simple pandas range-query interface they could be trivially turned into dask dataframes. These are more "downstream" formats from BAM as they eschew specific sequence or alignment details. Their tab-delimited brethren are wiggle/bedGraph which describes a step function along the genome and BED which describes a set of intervals that may overlap.

I should also mention another htslib format that is historically interesting: tabix. It uses bgzip to compress and index a TSV file and you can therefore read remote chunks using http range requests. Though not super common, I see it sometimes used with VCF text files as an alternative to BCF. The main takeaway is that in genomics, people tend to prefer large indexed files to sharded ones. That's partly because we deal with so many moderately big datasets.

ghost · 2020-07-14T00:46:58Z

With respect to tabix:
I believe tabix just uses an r-tree to poke for records that exist in a k-cell. I think tiledb supports that natively.

alimanfoo · 2020-07-14T07:34:22Z

Cool! I didn’t know of zarr. Does anyone know of the central differences between zarr and TileDB?

Some discussion here and here.

martindurant · 2021-03-12T14:44:28Z

Given the extensive discussion here, I don't think there's much to add.
I will, however, briefly mention https://github.com/intake/fsspec-reference-maker/ , which would allow for (parallel) access to chunks of a binary file, where the set of offsets is easily available. That can be used in conjunction with Zarr to read binary data from some other file format. Worth a thought!

martindurant closed this as completed Mar 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IO for genomic file formats #6082

IO for genomic file formats #6082

ghost commented Apr 10, 2020

mrocklin commented Apr 10, 2020

ghost commented Apr 10, 2020

ihnorton commented Apr 10, 2020

martindurant commented Apr 10, 2020

ghost commented Apr 10, 2020 •

edited by ghost

Loading

mrocklin commented Apr 10, 2020 •

edited

Loading

mrocklin commented Apr 10, 2020

ghost commented Apr 10, 2020

mrocklin commented Apr 10, 2020 via email

ghost commented Apr 11, 2020

mrocklin commented Apr 11, 2020

ghost commented May 22, 2020 •

edited by ghost

Loading

jakirkham commented Jul 13, 2020

jakirkham commented Jul 13, 2020

ghost commented Jul 13, 2020 •

edited by ghost

Loading

quasiben commented Jul 13, 2020

jakirkham commented Jul 13, 2020

alimanfoo commented Jul 13, 2020

ghost commented Jul 13, 2020

ghost commented Jul 13, 2020 •

edited by ghost

Loading

nvictus commented Jul 13, 2020 •

edited

Loading

nvictus commented Jul 14, 2020

ghost commented Jul 14, 2020

alimanfoo commented Jul 14, 2020

martindurant commented Mar 12, 2021

IO for genomic file formats #6082

IO for genomic file formats #6082

Comments

ghost commented Apr 10, 2020

mrocklin commented Apr 10, 2020

ghost commented Apr 10, 2020

ihnorton commented Apr 10, 2020

martindurant commented Apr 10, 2020

ghost commented Apr 10, 2020 • edited by ghost Loading

mrocklin commented Apr 10, 2020 • edited Loading

mrocklin commented Apr 10, 2020

ghost commented Apr 10, 2020

mrocklin commented Apr 10, 2020 via email

ghost commented Apr 11, 2020

mrocklin commented Apr 11, 2020

ghost commented May 22, 2020 • edited by ghost Loading

jakirkham commented Jul 13, 2020

jakirkham commented Jul 13, 2020

ghost commented Jul 13, 2020 • edited by ghost Loading

quasiben commented Jul 13, 2020

jakirkham commented Jul 13, 2020

alimanfoo commented Jul 13, 2020

ghost commented Jul 13, 2020

ghost commented Jul 13, 2020 • edited by ghost Loading

nvictus commented Jul 13, 2020 • edited Loading

nvictus commented Jul 14, 2020

ghost commented Jul 14, 2020

alimanfoo commented Jul 14, 2020

martindurant commented Mar 12, 2021

ghost commented Apr 10, 2020 •

edited by ghost

Loading

mrocklin commented Apr 10, 2020 •

edited

Loading

ghost commented May 22, 2020 •

edited by ghost

Loading

ghost commented Jul 13, 2020 •

edited by ghost

Loading

ghost commented Jul 13, 2020 •

edited by ghost

Loading

nvictus commented Jul 13, 2020 •

edited

Loading