Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[python] Account for missing end-rows in shape-upgrader #3538

Merged
merged 1 commit into from
Jan 9, 2025

Conversation

johnkerl
Copy link
Member

@johnkerl johnkerl commented Jan 8, 2025

Context

As tracked on #2407 / [sc-51048] and documented at
https://cloud.tiledb.com/academy/structure/life-sciences/single-cell/tutorials/shapes/
we have a new shape feature as of TileDB-SOMA 1.15. This is a core-database-managed construct which keeps track of the 'bounding box' of a sparse array, regardless of how many cells have (or have not) been written to the array.

All TileDB-SOMA experiments created at 1.15 and above will have this new shape on all component dataframes/arrays.

For TileDB-SOMA experiments created by TileDB-SOMA software versions < 1.15, we offer tiledbsoma.io.upgrade_shape as documented also here
https://cloud.tiledb.com/academy/structure/life-sciences/single-cell/tutorials/shapes/
This leverages the old used_shape (which is now deprecated and replace by the new-and-improved shape).

[sc-61530]

Bug

There is a corner-case bug discovered recently. Namely, suppose an experiment created by TileDB-SOMA software version < 1.15 has nobs of 1,000,000, and has nvar of 60,000 for measurement "RNA". Then in tiledbsoma.io.upgrade_shape we want to endow the X arrays in the "RNA" measurement with that shape of (1_000_000, 60_000). Before this PR, we were consulting the old used_shape for that X arrays. There are corner cases where an X array has no occupied cell-counts whatsoever for the last one or more rows. In such a case, the used_shape will be, say, (999_998, 60_000). Then, subsequently, if the user does an ExperimentAxisQuery which includes the last rows of obs, namely soma_joinid of 999,998 or 999,999, then the ExperimentAxisQuery will error with

Query: A range was set outside of the current domain

Solution

On this PR, when endowing old experiments with new shapes, we correctly consult the nobs and nvar for X (likewise with obsm, obsp, varm, and varp, mutatis mutandis). In the above example, the shape would be properly set to (1_000_000, 60_000).

Repair

How to check:

fix-check.py

#!/usr/bin/env python

import tiledbsoma
import tiledbsoma.io
import sys

# ----------------------------------------------------------------
def check_array(array, old_shape, new_shape):
    print()
    print("URI      ", array.uri)
    print("Old shape", old_shape)
    print("New shape", new_shape)
    ok, msg = array.resize(new_shape, check_only=True)
    if ok:
        print("OK to resize")
    else:
        print("Cannot resize: ", msg)

# ----------------------------------------------------------------
for uri in sys.argv[1:]:
    print()
    print("================================================================")
    print("Experiment URI:", uri)

    nobs = None
    nvars = {}
    with tiledbsoma.Experiment.open(uri) as exp:
        nobs = exp.obs.count
        for name, measurement in exp.ms.items():
            nvars[name] = measurement.var.count

    with tiledbsoma.Experiment.open(uri, "w") as exp:
        for measurement_name, measurement in exp.ms.items():
            nvar = nvars[measurement_name]

            if "X" in measurement:
                for array in measurement.X.values():
                    old_shape = array.shape
                    new_shape = (nobs, nvar)
                    check_array(array, old_shape, new_shape)

            # obsm and varm are densely occupied and not in need of change

            if "obsp" in measurement:
                for array in measurement.obsp.values():
                    old_shape = array.shape
                    new_shape = (nobs, nobs)
                    check_array(array, old_shape, new_shape)

            if "varp" in measurement:
                for array in measurement.varp.values():
                    old_shape = array.shape
                    new_shape = (nvar, nvar)
                    check_array(array, old_shape, new_shape)

How to fix:

cat fix.py

#!/usr/bin/env python

import tiledbsoma
import tiledbsoma.io
import sys

# ----------------------------------------------------------------
def resize_array(array, old_shape, new_shape):
    print()
    print("URI      ", array.uri)
    print("Old shape", old_shape)
    print("New shape", new_shape)
    if old_shape != new_shape:
        print("Resizing")
        array.resize(new_shape)
    else:
        print("Already correct")

# ----------------------------------------------------------------
for uri in sys.argv[1:]:
    print()
    print("================================================================")
    print("Experiment URI:", uri)

    nobs = None
    nvars = {}
    with tiledbsoma.Experiment.open(uri) as exp:
        nobs = exp.obs.count
        for name, measurement in exp.ms.items():
            nvars[name] = measurement.var.count

    with tiledbsoma.Experiment.open(uri, "w") as exp:
        for measurement_name, measurement in exp.ms.items():
            nvar = nvars[measurement_name]

            if "X" in measurement:
                for array in measurement.X.values():
                    old_shape = array.shape
                    new_shape = (nobs, nvar)
                    resize_array(array, old_shape, new_shape)

            # obsm and varm are densely occupied and not in need of change

            if "obsp" in measurement:
                for array in measurement.obsp.values():
                    old_shape = array.shape
                    new_shape = (nobs, nobs)
                    resize_array(array, old_shape, new_shape)

            if "varp" in measurement:
                for array in measurement.varp.values():
                    old_shape = array.shape
                    new_shape = (nvar, nvar)
                    resize_array(array, old_shape, new_shape)

@johnkerl johnkerl requested a review from nguyenv January 8, 2025 21:23
Copy link

codecov bot commented Jan 8, 2025

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 86.32%. Comparing base (eabdb30) to head (dba4157).
Report is 2 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #3538      +/-   ##
==========================================
+ Coverage   86.25%   86.32%   +0.06%     
==========================================
  Files          55       55              
  Lines        6359     6369      +10     
==========================================
+ Hits         5485     5498      +13     
+ Misses        874      871       -3     
Flag Coverage Δ
python 86.32% <100.00%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Components Coverage Δ
python_api 86.32% <100.00%> (+0.06%) ⬆️
libtiledbsoma ∅ <ø> (∅)

@johnkerl
Copy link
Member Author

johnkerl commented Jan 9, 2025

Thanks @nguyenv !

@johnkerl johnkerl merged commit 52361d8 into main Jan 9, 2025
11 checks passed
@johnkerl johnkerl deleted the kerl/upgrade-shape-fixup branch January 9, 2025 20:51
johnkerl added a commit that referenced this pull request Jan 9, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants