Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

IO for genomic file formats #6082

Closed
ghost opened this issue Apr 10, 2020 · 25 comments
Closed

IO for genomic file formats #6082

ghost opened this issue Apr 10, 2020 · 25 comments

Comments

@ghost
Copy link

ghost commented Apr 10, 2020

Greetings,

I would like to implement IO for some genomic file formats. The specifications for the file formats I would like to be able to read directly into Dask dataframes follow the SAM specification as well as the VCF
specification. TLDR on the specs is: Tab delimited files with a large header containing global state for all of the records in the file.

I have a denormalized schema for the data in the files meaning that the header information is copied into a column in every record. Also, most of the time these files are compressed using block gzip (bgzip) and indexed using an R-tree for random access. The challenge is I would like to provide code that reads directly from these files into a dask dataframe following the denormalized schema. Are there base classes I can implement for this?

@mrocklin
Copy link
Member

Hi @Jamesthegiantpeach .

Broadly you should look through https://docs.dask.org/en/latest/dataframe-create.html

In particular I suspect that you'll want to use Dask delayed with Dask dataframe, particularly if you already have Python functions that can create Pandas dataframes from chunks of your file.

For tab-delimited files and remote access (S3, HDFS, GCS) you probably also want to look at read_bytes, and note the delimiter keyword

I think that having an open source reader for SAM and VCF files for Dask would be great. Is this work that you're planning on / open to publishing?

@ghost
Copy link
Author

ghost commented Apr 10, 2020

Hi @mrocklin

Thanks for the links! These are helpful.

The files are almost always stored compressed with an index that specifies byte offsets for extracting random records from the files without using DEFLATE on the entire file. So I think the blocks would have to be calculated from the file's index and extracted using some kind of S3 file seeker. I don't know if the s3fs lib that Dask currently uses supports byte range queries on GET requests to S3.

I would like to have the community's help maintaining the IO libraries. It's a little more complicated than a dd.from_vcf() and dd.to_vcf() function though. There can be two schema's associated with a VCF file for instance: A Variant schema, and a Genotype schema. I think I can take care of this using an argument like dd.from_vcf(schema='genotype'), but i'm not sure yet.

I want to write up the denormalized table schemas that the files need to be converted to and circle back to this thread with that info.

Note: There are some useful schemas defined in this project.

@ihnorton
Copy link
Contributor

In case it is of interest, TileDB has an open-source VCF ingestion and query library here which models variants as sparse TileDB arrays. The library provides a Python API which returns query results as dask (delayed) data frames built on top of pandas DFs, similar to mrocklin's suggestion (internally the results are actually returned as Arrow buffers, which are also used for the Spark API and could be plumbed in to other Arrow-friendly languages like R).

@martindurant
Copy link
Member

I don't know if the s3fs lib that Dask currently uses supports byte range queries on GET requests to S3.

Yes, you can treat the files just like ordinary file-like objects and seek will work as expected.

@ghost
Copy link
Author

ghost commented Apr 10, 2020

@ihnorton

Is TileDB maintaining a Dask fork?

import dask
import dask.array as da

array = da.from_tiledb('s3://my-bucket/my-dense-array',
                        attribute='attr',
                        storage_options={'vfs.s3.aws_access_key_id': 'mykeyid',
                                         'vfs.s3.aws_secret_access_key': 'mysecret'})
print(array.mean().compute())

EDIT: Nvm I found the code here

I think this is an interesting approach. I personally think the SAM and VCF specs are hacky, and I have been trying to get comp bio community to drop it. It might be better to eschew using the spec at all and just convert the VCF files to a proper columnar disk interchange format like parquet and ingesting from parquet to arrow recordbatches, buffers etc.

@mrocklin
Copy link
Member

mrocklin commented Apr 10, 2020

So to be explicit about @martindurant 's point, you could do something like the following:

def read_sam(filename, open=open) -> dask.dataframe.DataFrame:
    with open(filename) as f:
        header = f.read(size_of_header)
        header = process_header
    index_locations = ...

    partitions = [
        dask.delayed(read_part, filename, start, stop, open=open) 
        for start, stop in index_locations
    ]

    return dask.dataframe.from_delayed(partitions)

def read_part(filename, start, stop, open=open) -> pandas.DataFrame:
    with open(filename) as f:
        f.seek(start)
        data = f.read(stop - start)

    df = process(data)
    return df

@mrocklin
Copy link
Member

Misfire on commenting earlier. Updated.

@ghost
Copy link
Author

ghost commented Apr 10, 2020

Thanks this makes sense.

I need to rethink this. It might not be a great idea to simply import the VCF records as rows in a dd since in any given human genome the overwhelming majority of locations will match the reference genome (homozygous reference, 0/0, etc) causing 99% of the rows to be wasted space. The approach here makes a lot of sense to me. I am going to tinker with tiledb and the dask integration today and circle back.

@mrocklin
Copy link
Member

mrocklin commented Apr 10, 2020 via email

@ghost
Copy link
Author

ghost commented Apr 11, 2020

TileDB works well for my needs, and the integration with dask arrays works nicely. I think any IO efforts from me should go into helping with IO of genomic files over there and just using the tdb/dask integration.

Cool project.

Also, as a side note to TileDB people: I am able to version control a changing array (adding/subtracting individuals from the cohort) by storing TileDB files in a Pachyderm repository. Time traveling then looks like the usual git semantics. This works even if you've merged the fragment arrays. You can revert back to before you merged the fragments without losing any data.

Anyway, thanks for the help everyone.

@mrocklin
Copy link
Member

TileDB works well for my needs, and the integration with dask arrays works nicely. I think any IO efforts from me should go into helping with IO of genomic files over there and just using the tdb/dask integration.

I'm glad to hear it! Thanks for jumping in @ihnorton .

@Jamesthegiantpeach if you ever have time to write about your work, we'd love a blogpost (even if only very brief) at https://github.com/dask/dask-blog (same goes for you @ihnorton )

@ghost
Copy link
Author

ghost commented May 22, 2020

I ended toying around with doing this straight from the files in my spare time, and I will probably migrate over to using tiledb when I can.

Sharding these files ends up being fairly complicated because of how some aligned reads hang over the coordinates that you're binning on.

---- ---             ---------
 ----      -----        -------
       ---- ----   ----- ---- 
             ------   ----    ------      ------ ------                          
   ----- ----- ----- -----      ----
|-----|-----|-----|-----|-----|-----|-----|-----|-----|-----|

Some of the reads hang over the bin edges, and you have to make sure these fragments end up going to both bins. There are a bunch of other weird things to worry about when splitting and merging (like deduping around the bin edges).

Further, I had to write some C++ glue between htslib and Arrow so that the IO to pandas was reasonably efficient. Seems like a dependency that isn't appropriate for this repo.

I think this probably warrants its own project. I think it would look something like scikit allel, but maybe more centered around apache arrow to cut down on how much data pandas tends to unnecessarily materialize.

@jakirkham
Copy link
Member

Is Hi-C also relevant to this discussion? Or is that not really being considered here.

@jakirkham
Copy link
Member

cc @alimanfoo (who works on scikit-allel in case he has thoughts on any of this 🙂)

@ghost
Copy link
Author

ghost commented Jul 13, 2020

@jakirkham Hi-C is relevant if it uses the standard hts formats. I don't have any experience with Hi-C.

I have used allel, and I dig it. One thing that I think is missing is an efficient disk format. It would be interesting to introduce some classes to allel that get you zero copy data from S3 via tiledb.

@quasiben
Copy link
Member

@nvictus and I briefly chatted about Hi-C today. He also mentioned bigwig and bigbed

@jakirkham
Copy link
Member

I have used allel, and I dig it. One thing that I think is missing is an efficient disk format.

AIUI this is one of the motivations behind Zarr ( cggh/scikit-allel#102 ). Though @alimanfoo please feel free to jump in and correct me if I'm misunderstanding things.

@alimanfoo
Copy link
Contributor

Hi folks, thanks for tagging me in. Just to mention that we commonly convert VCF to zarr via scikit-allel's vcf_to_zarr() function, following a process like shown in this worked example. We then upload the data to GCS using gsutil, and access it via gcsfs and analyse via dask. Here's an example notebook running a simple popgen analysis, runnable via google colab (although will go much faster if you have your own dask-kubernetes cluster in google cloud :-).

TileDB may be an equally good or better route to dask for VCF, I'd be interested in any experience there.

@ghost
Copy link
Author

ghost commented Jul 13, 2020

Cool! I didn’t know of zarr. Does anyone know of the central differences between zarr and TileDB?

@ghost
Copy link
Author

ghost commented Jul 13, 2020

Looks like Stavros wrote something that mentions zarr

https://medium.com/tiledb/tiledb-a-database-for-data-scientists-ddf4ca122176

Is the main difference the native support for sparse arrays?

Edit: from what I’ve groked on forums it looks like zarr is more like a file format and tiledb is more like an engine for multidimensional indices?

@nvictus
Copy link

nvictus commented Jul 13, 2020

Thanks @quasiben! FYI, I'm also the developer of the cooler format for Hi-C data mentioned above. It implements a sparse matrix data model using HDF5 primitives, and by extension Zarr works too (it is currently almost a drop-in replacement, we plan to support it as an alternative backend). You can see cooler data displayed in HiGlass.

@nvictus
Copy link

nvictus commented Jul 14, 2020

As I was telling @quasiben, similar to the bgzipped binary htslib formats, bigWig and bigBed (developed as part of the UCSC genome browser) are binary, compressed and indexed. However, they also store multiple "zoom levels" of summary statistics, as they were designed for browsing. With a simple pandas range-query interface they could be trivially turned into dask dataframes. These are more "downstream" formats from BAM as they eschew specific sequence or alignment details. Their tab-delimited brethren are wiggle/bedGraph which describes a step function along the genome and BED which describes a set of intervals that may overlap.

I should also mention another htslib format that is historically interesting: tabix. It uses bgzip to compress and index a TSV file and you can therefore read remote chunks using http range requests. Though not super common, I see it sometimes used with VCF text files as an alternative to BCF. The main takeaway is that in genomics, people tend to prefer large indexed files to sharded ones. That's partly because we deal with so many moderately big datasets.

@ghost
Copy link
Author

ghost commented Jul 14, 2020

With respect to tabix:
I believe tabix just uses an r-tree to poke for records that exist in a k-cell. I think tiledb supports that natively.

@alimanfoo
Copy link
Contributor

Cool! I didn’t know of zarr. Does anyone know of the central differences between zarr and TileDB?

Some discussion here and here.

@martindurant
Copy link
Member

Given the extensive discussion here, I don't think there's much to add.
I will, however, briefly mention https://github.com/intake/fsspec-reference-maker/ , which would allow for (parallel) access to chunks of a binary file, where the set of offsets is easily available. That can be used in conjunction with Zarr to read binary data from some other file format. Worth a thought!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants