-
-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
IO for genomic file formats #6082
Comments
Hi @Jamesthegiantpeach . Broadly you should look through https://docs.dask.org/en/latest/dataframe-create.html In particular I suspect that you'll want to use Dask delayed with Dask dataframe, particularly if you already have Python functions that can create Pandas dataframes from chunks of your file.
For tab-delimited files and remote access (S3, HDFS, GCS) you probably also want to look at read_bytes, and note the delimiter keyword
I think that having an open source reader for SAM and VCF files for Dask would be great. Is this work that you're planning on / open to publishing? |
Hi @mrocklin Thanks for the links! These are helpful. The files are almost always stored compressed with an index that specifies byte offsets for extracting random records from the files without using DEFLATE on the entire file. So I think the blocks would have to be calculated from the file's index and extracted using some kind of S3 file seeker. I don't know if the s3fs lib that Dask currently uses supports byte range queries on GET requests to S3. I would like to have the community's help maintaining the IO libraries. It's a little more complicated than a dd.from_vcf() and dd.to_vcf() function though. There can be two schema's associated with a VCF file for instance: A Variant schema, and a Genotype schema. I think I can take care of this using an argument like dd.from_vcf(schema='genotype'), but i'm not sure yet. I want to write up the denormalized table schemas that the files need to be converted to and circle back to this thread with that info. Note: There are some useful schemas defined in this project. |
In case it is of interest, TileDB has an open-source VCF ingestion and query library here which models variants as sparse TileDB arrays. The library provides a Python API which returns query results as dask (delayed) data frames built on top of pandas DFs, similar to mrocklin's suggestion (internally the results are actually returned as Arrow buffers, which are also used for the Spark API and could be plumbed in to other Arrow-friendly languages like R). |
Yes, you can treat the files just like ordinary file-like objects and seek will work as expected. |
Is TileDB maintaining a Dask fork?
EDIT: Nvm I found the code here I think this is an interesting approach. I personally think the SAM and VCF specs are hacky, and I have been trying to get comp bio community to drop it. It might be better to eschew using the spec at all and just convert the VCF files to a proper columnar disk interchange format like parquet and ingesting from parquet to arrow recordbatches, buffers etc. |
So to be explicit about @martindurant 's point, you could do something like the following: def read_sam(filename, open=open) -> dask.dataframe.DataFrame:
with open(filename) as f:
header = f.read(size_of_header)
header = process_header
index_locations = ...
partitions = [
dask.delayed(read_part, filename, start, stop, open=open)
for start, stop in index_locations
]
return dask.dataframe.from_delayed(partitions)
def read_part(filename, start, stop, open=open) -> pandas.DataFrame:
with open(filename) as f:
f.seek(start)
data = f.read(stop - start)
df = process(data)
return df |
Misfire on commenting earlier. Updated. |
Thanks this makes sense. I need to rethink this. It might not be a great idea to simply import the VCF records as rows in a dd since in any given human genome the overwhelming majority of locations will match the reference genome (homozygous reference, 0/0, etc) causing 99% of the rows to be wasted space. The approach here makes a lot of sense to me. I am going to tinker with tiledb and the dask integration today and circle back. |
Great. I'm glad to see people working on this.
…On Fri, Apr 10, 2020 at 12:44 PM jamesthegiantpeach < ***@***.***> wrote:
Thanks this makes sense.
I need to rethink this. It might not be a great idea to simply import the
VCF records as rows in a dd since in any given human genome the
overwhelming majority of locations will match the reference genome
(homozygous reference, 0/0, etc) causing 99% of the rows to be wasted
space. The approach here
<https://docs.tiledb.com/main/use-cases/genomics/> makes a lot of sense
to me. I am going to tinker with tiledb and the dask integration today and
circle back.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#6082 (comment)>, or
unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTF33V3P544WV7FQBHTRL5ZJPANCNFSM4MFE2NXA>
.
|
TileDB works well for my needs, and the integration with dask arrays works nicely. I think any IO efforts from me should go into helping with IO of genomic files over there and just using the tdb/dask integration. Cool project. Also, as a side note to TileDB people: I am able to version control a changing array (adding/subtracting individuals from the cohort) by storing TileDB files in a Pachyderm repository. Time traveling then looks like the usual git semantics. This works even if you've merged the fragment arrays. You can revert back to before you merged the fragments without losing any data. Anyway, thanks for the help everyone. |
I'm glad to hear it! Thanks for jumping in @ihnorton . @Jamesthegiantpeach if you ever have time to write about your work, we'd love a blogpost (even if only very brief) at https://github.com/dask/dask-blog (same goes for you @ihnorton ) |
I ended toying around with doing this straight from the files in my spare time, and I will probably migrate over to using tiledb when I can. Sharding these files ends up being fairly complicated because of how some aligned reads hang over the coordinates that you're binning on.
Some of the reads hang over the bin edges, and you have to make sure these fragments end up going to both bins. There are a bunch of other weird things to worry about when splitting and merging (like deduping around the bin edges). Further, I had to write some C++ glue between htslib and Arrow so that the IO to pandas was reasonably efficient. Seems like a dependency that isn't appropriate for this repo. I think this probably warrants its own project. I think it would look something like scikit allel, but maybe more centered around apache arrow to cut down on how much data pandas tends to unnecessarily materialize. |
Is Hi-C also relevant to this discussion? Or is that not really being considered here. |
cc @alimanfoo (who works on scikit-allel in case he has thoughts on any of this 🙂) |
@jakirkham Hi-C is relevant if it uses the standard hts formats. I don't have any experience with Hi-C. I have used allel, and I dig it. One thing that I think is missing is an efficient disk format. It would be interesting to introduce some classes to allel that get you zero copy data from S3 via tiledb. |
@nvictus and I briefly chatted about Hi-C today. He also mentioned bigwig and bigbed |
AIUI this is one of the motivations behind Zarr ( cggh/scikit-allel#102 ). Though @alimanfoo please feel free to jump in and correct me if I'm misunderstanding things. |
Hi folks, thanks for tagging me in. Just to mention that we commonly convert VCF to zarr via scikit-allel's vcf_to_zarr() function, following a process like shown in this worked example. We then upload the data to GCS using gsutil, and access it via gcsfs and analyse via dask. Here's an example notebook running a simple popgen analysis, runnable via google colab (although will go much faster if you have your own dask-kubernetes cluster in google cloud :-). TileDB may be an equally good or better route to dask for VCF, I'd be interested in any experience there. |
Cool! I didn’t know of zarr. Does anyone know of the central differences between zarr and TileDB? |
Looks like Stavros wrote something that mentions zarr https://medium.com/tiledb/tiledb-a-database-for-data-scientists-ddf4ca122176 Is the main difference the native support for sparse arrays? Edit: from what I’ve groked on forums it looks like zarr is more like a file format and tiledb is more like an engine for multidimensional indices? |
Thanks @quasiben! FYI, I'm also the developer of the cooler format for Hi-C data mentioned above. It implements a sparse matrix data model using HDF5 primitives, and by extension Zarr works too (it is currently almost a drop-in replacement, we plan to support it as an alternative backend). You can see cooler data displayed in HiGlass. |
As I was telling @quasiben, similar to the bgzipped binary htslib formats, bigWig and bigBed (developed as part of the UCSC genome browser) are binary, compressed and indexed. However, they also store multiple "zoom levels" of summary statistics, as they were designed for browsing. With a simple pandas range-query interface they could be trivially turned into dask dataframes. These are more "downstream" formats from BAM as they eschew specific sequence or alignment details. Their tab-delimited brethren are wiggle/bedGraph which describes a step function along the genome and BED which describes a set of intervals that may overlap. I should also mention another htslib format that is historically interesting: tabix. It uses bgzip to compress and index a TSV file and you can therefore read remote chunks using http range requests. Though not super common, I see it sometimes used with VCF text files as an alternative to BCF. The main takeaway is that in genomics, people tend to prefer large indexed files to sharded ones. That's partly because we deal with so many moderately big datasets. |
With respect to tabix: |
Given the extensive discussion here, I don't think there's much to add. |
Greetings,
I would like to implement IO for some genomic file formats. The specifications for the file formats I would like to be able to read directly into Dask dataframes follow the SAM specification as well as the VCF
specification. TLDR on the specs is: Tab delimited files with a large header containing global state for all of the records in the file.
I have a denormalized schema for the data in the files meaning that the header information is copied into a column in every record. Also, most of the time these files are compressed using block gzip (bgzip) and indexed using an R-tree for random access. The challenge is I would like to provide code that reads directly from these files into a dask dataframe following the denormalized schema. Are there base classes I can implement for this?
The text was updated successfully, but these errors were encountered: