You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
This is problematic as it makes launching new Vitessce configs difficult and hard to communicate to people not familiar with out code. This problem is only going to expand, and as we gain users (probably other data portals), it would be good to have not only schemas for validating the data, but a way of reliably generating the data.
The overarching goal here is to take in a Pandas dataframe and output compliant Arrow (in the future), Zarr, OME-TIFF, and JSON data for Vitessce. A secondary goal could be to also create Vitessce configurations based on what data has been generated - basically pre-defined view configurations based on certain standard inputs (i.e a genes/clusters + raster + cells/cell-sets without scatterplot gives what we have for CODEX, and with scatterplot gives Linnarsson minus one of the scatterplots).
I'll organize this issue by data type.
Genes/Clusters (Heatmap)
Our genes and clusters schema convey very similar information, i.e data per observation and a max for rendering. We should think about merging these, if possible, since if we can show one, we can show the other:
This might require an arrow loader if it's too hard to parse out data properly using only one schema in the client across the two use cases, since they are used differently.
In any case, I think a function that takes in a Pandas DataFrame containing a Cell x Gene matrix and outputs JSON/Arrow should be the goal here. The index of such a DataFrame would be cell names and the column names genes. This will help with Cells/Cell-Sets.
@keller-mark knows best (feel free to comment/edit this issue!) but this is a little bit more complicated since the two are intertwined, but not necessary/sufficient in both directions (like the above); that is, one could have "Cells" without "Cell-sets" but not really "Cell-Sets" without "Cells."
Like the above we want a function that takes in a Pandas DataFrame and outputs JSON/Arrow but the structure for the DataFrame is a little bit hairier (not just a labeled Cell x Gene matrix where the labels are basically unchecked). I foresee us needing to either strongly define an API or rely on a properly named DataFrame (i.e each column has a specific name like poly or xy). I think we should probably go the route of an API so we have something like:
where each string argument is a column in the dataframe df to be put into the json portion corresponding roughly to the arg key. The index of this dataframe will be cell ids, just like the above.
I think Cell_sets is going to be a little harder. Maybe you could add something about this @keller-mark here in terms of what input data could look like.
Raster
This one is tricky as well. We should probably support both tiff and zarr via a flag. We'll need to set up the docker container for bioformats2raw/raw2ometiff as a dependency (which I think can be done via the setup.py file). Beyond that, the other major paint point will be input data. Are we expecting numpy arrays? dask arrays? zarr stores? File paths? Perhaps all 4 can be possible?
@manzt can probably comment on this as well. I Imagine most people will input OME-TIFF to bioformats2raw but I think we can also handle other inputs and use our custom pyramid generator or something python-specific (in contrast to bioformats2raw) that Glencoe writes.
Molecules
I think this will be relatively straightforward like the genes data - I think an input data frame with the index being molecule names plugged into an API is what we will use:
For cell sets, as far as what conversions to support, my thought would be:
support conversion from a dataframe that matches the CSV format that Vitessce uses with the "Export as CSV" button of the cell set manager (only supported with 1-level hierarchies), output a JSON file containing the 1-level hierarchy
support conversion from a dataframe that matches the CSV format that we receive from the Satija lab (flat cell type annotations that we expand into a hierarchy using the EBI Cell Ontology)
if someone wants a full hierarchy consisting of something else they will need to define the full JSON themselves, but something else that this package could offer is validation (in python, against the same JSON schemas that we use in the client)
Overview
Right now we have code all over the place for creating Vitessce data/configs:
https://github.com/hubmapconsortium/portal-containers
https://github.com/hubmapconsortium/vitessce-data
https://github.com/hubmapconsortium/portal-ui/blob/master/context/app/api/vitessce.py
This is problematic as it makes launching new Vitessce configs difficult and hard to communicate to people not familiar with out code. This problem is only going to expand, and as we gain users (probably other data portals), it would be good to have not only schemas for validating the data, but a way of reliably generating the data.
The overarching goal here is to take in a Pandas dataframe and output compliant Arrow (in the future), Zarr, OME-TIFF, and JSON data for Vitessce. A secondary goal could be to also create Vitessce configurations based on what data has been generated - basically pre-defined view configurations based on certain standard inputs (i.e a genes/clusters + raster + cells/cell-sets without scatterplot gives what we have for CODEX, and with scatterplot gives Linnarsson minus one of the scatterplots).
I'll organize this issue by data type.
Genes/Clusters (Heatmap)
Our
genes
andclusters
schema convey very similar information, i.e data per observation and amax
for rendering. We should think about merging these, if possible, since if we can show one, we can show the other:https://github.com/hubmapconsortium/portal-containers/blob/fb1910324fc796ff4b7d4e643de27ff2861e7d8c/containers/sprm-to-json/context/main.py#L125-L160
https://github.com/hubmapconsortium/vitessce-data/blob/master/python/cluster.py
https://github.com/hubmapconsortium/vitessce-data/blob/master/snakemake/satija/src/convert_h5ad_to_zarr.py
This might require an arrow loader if it's too hard to parse out data properly using only one schema in the client across the two use cases, since they are used differently.
In any case, I think a function that takes in a Pandas DataFrame containing a Cell x Gene matrix and outputs JSON/Arrow should be the goal here. The index of such a DataFrame would be cell names and the column names genes. This will help with
Cells
/Cell-Sets
.Cell-Sets/Cells
@keller-mark knows best (feel free to comment/edit this issue!) but this is a little bit more complicated since the two are intertwined, but not necessary/sufficient in both directions (like the above); that is, one could have "Cells" without "Cell-sets" but not really "Cell-Sets" without "Cells."
Like the above we want a function that takes in a Pandas DataFrame and outputs JSON/Arrow but the structure for the DataFrame is a little bit hairier (not just a labeled Cell x Gene matrix where the labels are basically unchecked). I foresee us needing to either strongly define an API or rely on a properly named DataFrame (i.e each column has a specific name like
poly
orxy
). I think we should probably go the route of an API so we have something like:where each string argument is a column in the dataframe
df
to be put into the json portion corresponding roughly to thearg
key. The index of this dataframe will be cell ids, just like the above.I think
Cell_sets
is going to be a little harder. Maybe you could add something about this @keller-mark here in terms of what input data could look like.Raster
This one is tricky as well. We should probably support both
tiff
andzarr
via a flag. We'll need to set up the docker container forbioformats2raw
/raw2ometiff
as a dependency (which I think can be done via thesetup.py
file). Beyond that, the other major paint point will be input data. Are we expectingnumpy
arrays?dask
arrays?zarr
stores? File paths? Perhaps all 4 can be possible?@manzt can probably comment on this as well. I Imagine most people will input
OME-TIFF
tobioformats2raw
but I think we can also handle other inputs and use our custom pyramid generator or something python-specific (in contrast tobioformats2raw
) that Glencoe writes.Molecules
I think this will be relatively straightforward like the genes data - I think an input data frame with the index being molecule names plugged into an API is what we will use:
The text was updated successfully, but these errors were encountered: