Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Materialize DimArray or DimStack From a Table #739

Open
wants to merge 40 commits into
base: main
Choose a base branch
from

Conversation

JoshuaBillson
Copy link
Contributor

Description:

resolves #335

This PR aims to let users construct either a DimStack or DimArray from a table with one or more coordinate columns.

Unlike the existing contructor, rows may be out of order or even missing altogether.

Performance:

The algorithm is O(n), requiring two forward passes for each dimension to determine the correct order of rows.

1000x1000: 0.005 Seconds

2000x2000: 0.025 Seconds

4000x4000: 0.108 Seconds

8000x8000: 0.376 Seconds

Example:

julia> r = DimArray(rand(UInt8, 1000, 1000), (X, Y));

julia> t = r |> DataFrame |> Random.shuffle!;

julia> restored = DimArray(t, dims(r));

julia> all(r .== restored)
true

Next Steps:

  1. Finalize method signatures.
  2. Decide what to do with missing rows. Currently, users may choose a missing value to fill in (missing by default).
  3. Add support for a :geometry column.
  4. Write test cases.
  5. Update docs.

src/table_ops.jl Outdated Show resolved Hide resolved
src/table_ops.jl Outdated Show resolved Hide resolved
src/table_ops.jl Show resolved Hide resolved
src/table_ops.jl Outdated Show resolved Hide resolved
src/table_ops.jl Outdated Show resolved Hide resolved
src/table_ops.jl Outdated Show resolved Hide resolved
@JoshuaBillson
Copy link
Contributor Author

There's still a few questions that need resolving:

  1. What do we do with missing rows? This can be handled explicitly in Rasters.jl, but DimArrays have no concept of missing values. We could let users choose a missing value to write, or we could encode it explicitly with missing. We could also disallow the existence of missing rows, but I don't think this is an ideal solution.
  2. How should we handle a :geometry column? I know that several packages in the Julia ecosystem are using this convention, but I don't have much experience with them. Can we assume :geometry will contain tuples of coordinates, or could they also be some sort of geometry like Meshes.Point?

@asinghvi17
Copy link
Collaborator

asinghvi17 commented Jun 20, 2024

I think it's best to test that the geometry column's elements have GI.PointTrait - then you can always extract e.g. GI.x(point) and the same for y, z, and m. You can also interrogate the dimension via GI.is3d, GI.ismeasured.

Going forward to get the geometry column it's probably best to have first(GI.geometrycolumns(table)) - the fallback implementation will give you (:geometry,), but to handle tables with other geometry columns which may be indicated by e.g. metadata, it's better to use this.

@rafaqz
Copy link
Owner

rafaqz commented Jun 20, 2024

DimensionalData.jl does not depend on GeoInterface

@JoshuaBillson
Copy link
Contributor Author

DimensionalData.jl does not depend on GeoInterface

Being able to interact with geometries in a generic fashion would help us interop with other packages. GeoInterface's only dependency is Extents, which we already depend on. After importing DimensionalData, GeoInterface loads in 0.008 seconds.

I also see that Rasters already depends on GeoInterface. We could simply ignore the :geometry column in DimensionalData, then implement the additional functionality in Rasters. However, I think the problem is general enough to be implemented here.

@rafaqz
Copy link
Owner

rafaqz commented Jun 20, 2024

It's not about deps or timing, it's about clean feature scope.

The promise here is that Rasters (and YAX) have all the geo deps and features, so non-geo people don't have to worry about them. Some of the biggest contributors here use DD for unrelated fields.

I would ignore point columns here entirely and instead write the code so it's easy for Rasters to handle them. Taking the underscores off a few functions and documenting them as an real interface will help that.

@JoshuaBillson
Copy link
Contributor Author

I've updated the docstrings for both the DimArray and DimStack constructors. By default, materializers will now use the Contains selector for irregular and non-numeric coordinates, with the option to specify an alternative when desired. We're exporting the restore_array and coords_to_index methods to be used by downstream packages like Rasters.jl. I've also written several test cases to show that we can handle tables with out of order and missing rows. If everything looks good, I think we can go ahead and merge this PR.

Copy link
Owner

@rafaqz rafaqz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is good, but a few minor changes are needed for it to be correct. Constructed selectors should be allowed so At can have atol. We can just construct selector types at the outer level (they're just filled with nothing.

Then there is a bit more dispatch needed to make the fast paths correct for At/Near/Contains with their standard behaviour.

src/DimensionalData.jl Outdated Show resolved Hide resolved
src/array/array.jl Outdated Show resolved Hide resolved
src/array/array.jl Outdated Show resolved Hide resolved
src/table_ops.jl Outdated Show resolved Hide resolved
src/table_ops.jl Outdated Show resolved Hide resolved
src/table_ops.jl Show resolved Hide resolved
src/table_ops.jl Outdated Show resolved Hide resolved
src/table_ops.jl Outdated Show resolved Hide resolved
src/table_ops.jl Outdated Show resolved Hide resolved
src/table_ops.jl Outdated Show resolved Hide resolved
@JoshuaBillson
Copy link
Contributor Author

JoshuaBillson commented Aug 12, 2024

I've implemented the guess_dims method to infer dimensions directly from the data. It should work for both sorted and shuffled data and can handle missing rows. If the data is sorted, you shouldn't need to specify the ordering, since it can be inferred directly. In the case of shuffled data, you will need to give the order as a Dim => Order pair. If the names of your coordinate columns can be safely guessed (currently :X, :Y, :Z, :Ti, and :Band) and the data is sorted, then we should be able to infer the dimensions without any input from the user.

Here's an example of it in action:

julia> xdims = X(LinRange{Float64}(610000.0, 661180.0, 2560));

julia> ydims = Y(LinRange{Float64}(6.84142e6, 6.79024e6, 2560));

julia> bdims = Dim{:Band}([:B02, :B03, :B04]);

julia> d = DimArray(rand(UInt16, 2560, 2560, 3), (xdims, ydims, bdims));

julia> t = DataFrame(d);

julia> t_rand = Random.shuffle(t);

julia> dims(d)
 X    Sampled{Float64} LinRange{Float64}(610000.0, 661180.0, 2560) ForwardOrdered Regular Points,
 Y    Sampled{Float64} LinRange{Float64}(6.84142e6, 6.79024e6, 2560) ReverseOrdered Regular Points,
↗ Band Categorical{Symbol} [:B02, :B03, :B04] ForwardOrdered

julia> DD.guess_dims(t)
 X    Sampled{Float64} LinRange{Float64}(610000.0, 661180.0, 2560) ForwardOrdered Regular Points,
 Y    Sampled{Float64} LinRange{Float64}(6.84142e6, 6.79024e6, 2560) ReverseOrdered Regular Points,
↗ Band Categorical{Symbol} [:B02, :B03, :B04] ForwardOrdered

julia> DD.guess_dims(t, (X, Y, :Band))
 X    Sampled{Float64} LinRange{Float64}(610000.0, 661180.0, 2560) ForwardOrdered Regular Points,
 Y    Sampled{Float64} LinRange{Float64}(6.84142e6, 6.79024e6, 2560) ReverseOrdered Regular Points,
↗ Band Categorical{Symbol} [:B02, :B03, :B04] ForwardOrdered

julia> DD.guess_dims(t_rand, (X => DD.ForwardOrdered(), Y => DD.ReverseOrdered(), :Band => DD.ForwardOrdered()))
 X    Sampled{Float64} LinRange{Float64}(610000.0, 661180.0, 2560) ForwardOrdered Regular Points,
 Y    Sampled{Float64} LinRange{Float64}(6.84142e6, 6.79024e6, 2560) ReverseOrdered Regular Points,
↗ Band Categorical{Symbol} [:B02, :B03, :B04] ForwardOrdered

julia> DD.guess_dims(t_rand[1:10000,:], (X => DD.ForwardOrdered(), Y => DD.ReverseOrdered(), :Band => DD.ForwardOrdered()))
 X    Sampled{Float64} LinRange{Float64}(610000.0, 661180.0, 2560) ForwardOrdered Regular Points,
 Y    Sampled{Float64} LinRange{Float64}(6.84142e6, 6.79024e6, 2560) ReverseOrdered Regular Points,
↗ Band Categorical{Symbol} [:B02, :B03, :B04] ForwardOrdered

As can be seen in the last example, guess_dims works even when we select only a subset of the shuffled rows.

@rafaqz
Copy link
Owner

rafaqz commented Aug 14, 2024

This is awesome to have, thanks.

It may be a bit before I can fully review and test it but it looks good from a quick skim.

src/stack/stack.jl Outdated Show resolved Hide resolved
src/table_ops.jl Outdated Show resolved Hide resolved
src/table_ops.jl Outdated Show resolved Hide resolved
@@ -429,6 +430,13 @@ function DimArray(A::AbstractBasicDimArray;
newdata = collect(data)
DimArray(newdata, format(dims, newdata); refdims, name, metadata)
end
# Write a single column from a table with one or more coordinate columns to a DimArray
function DimArray(table, dims; name=NoName(), selector=DimensionalData.Near(), precision=6, kw...)
data = restore_array(table, dims; selector=selector, missingval=missing, name=name, precision=precision)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should probably do a Tables.istable check at this point for a nicer error and stack trace, rather than failing deep in the internals.

@@ -420,5 +421,12 @@ function DimStack(data::NamedTuple, dims::Tuple;
all(map(d -> axes(d) == axes(first(data)), data)) || _stack_size_mismatch()
DimStack(data, format(dims, first(data)), refdims, layerdims, metadata, layermetadata)
end
# Write each column from a table with one or more coordinate columns to a layer in a DimStack
function DimStack(table, dims::Tuple; selector=DimensionalData.Contains(), kw...)
data_cols = _data_cols(table, dims)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again we probably need a Tables.istable check here

src/table_ops.jl Outdated
end

# Determine the ordinality of a set of coordinates
_coords_to_ords(coords::AbstractVector, dim::DD.Dimension, sel::DD.Selector) = _coords_to_ords(coords, dim, sel, DD.locus(dim), DD.span(dim))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are all of these DD and DimensionalData qualifiers needed?

_guess_dims(coords::AbstractVector, dim::DD.Dimension, args...) = dim
_guess_dims(coords::AbstractVector, dim::Type{<:DD.Dimension}, args...) = _guess_dims(coords, DD.name(dim), args...)
_guess_dims(coords::AbstractVector, dim::Pair, args...) = _guess_dims(coords, first(dim), last(dim), args...)
function _guess_dims(coords::AbstractVector, dim::Symbol, precision::Int)
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be AbstractVector{Union{Number,DateTime} ?

I'm wondering what happens to strings, symbols and other objects that need to go in Categorical lookups.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It should work with almost any type. If the coordinates are non-numerical, then we will internally dispatch on the following methods:

# Extract all unique coordinates from the given vector
_unique_vals(coords::AbstractVector, precision::Int) = _round_dim_val.(coords, precision) |> unique

# Round dimension value within the specified precision
_round_dim_val(x, ::Int) = x

# Determine if the given coordinates are forward ordered, reverse ordered, or unordered
function _guess_dim_order(coords::AbstractVector)
    if issorted(coords)
        return DD.ForwardOrdered()
    elseif issorted(coords, rev=true)
        return DD.ReverseOrdered()
    else
        return DD.Unordered()
    end
end

# Estimate the span between consecutive coordinates
_guess_dim_span(::AbstractVector, ::DD.Order, ::Int) = DD.Irregular()

_unique_vals will just return all unique values, where _round_dim_val is just the identity function for non-numerical coordinates.

_guess_dim_order should work for anything that can be sorted, which is the case for both String and Symbol.

_guess_dim_span will return DD.Irregular() for non-numerical coordinates.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We used to need a try catch for issorted in case < is not defined for a type

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also find regular spans for Dates?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should let us handle data that can't be sorted:

function _guess_dim_order(coords::AbstractVector)
    try
        if issorted(coords)
            return DD.ForwardOrdered()
        elseif issorted(coords, rev=true)
            return DD.ReverseOrdered()
        else
            return DD.Unordered()
        end
    catch 
        return DD.Unordered()
    end
end

And this should retrieve the span from Date and DateTime objects:

function _guess_dim_span(coords::AbstractVector{<:Dates.AbstractTime}, ::DD.Ordered, precision::Int)
    steps = (@view coords[2:end]) .- (@view coords[1:end-1])
    span = argmin(abs, steps)
    return all(isinteger, round.(steps ./ span, digits=precision)) ? DD.Regular(span) : DD.Irregular()
end

However, there seems to be a problem with constructing a LinRange from Date objects:

julia> vals = [Date("2022-11-16") + Day(i * 7) for i in 0:4];

julia> LinRange(first(vals), last(vals), 5)
5-element LinRange{Day, Int64}:
Error showing value of type LinRange{Day, Int64}:
ERROR: InexactError: Int64(553856.25)

Thus, I'm not sure how we should construct a Dimension with regularly spaced dates.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A StepRange works for Dates. Probably we should use StepRangeLen instead of LinRange where possible anyway

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That works. Do you want to use StepRangeLen for numerical coordinates or just dates?

Here's the result:

julia> xdims = X(LinRange{Float64}(610000.0, 661180.0, 2560));

julia> ydims = Y(LinRange{Float64}(6.84142e6, 6.79024e6, 2560));

julia> bdims = Dim{:Band}([:B02, :B03, :B04]);

julia> tdims = Dim{:Ti}([d1 + Day(i * 7) for i in 0:4]);

julia> d = DimArray(rand(UInt16, 2560, 2560, 3, 5), (xdims, ydims, bdims, tdims));

julia> t = DataFrame(d);

julia> DD.guess_dims(t)
 X    Sampled{Float64} 610000.0:20.0:661180.0 ForwardOrdered Regular Points,
 Y    Sampled{Float64} 6.84142e6:-20.0:6.79024e6 ReverseOrdered Regular Points,
↗ Ti   Sampled{Date} Date("2024-11-18"):Day(7):Date("2024-12-16") ForwardOrdered Regular Points,
⬔ Band Categorical{Symbol} [:B02, :B03, :B04] ForwardOrdered

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's probably better for everything, except a bit slower. It didn't exist when I first wrote this package and uses of LinRange are just legacy from that.

src/table_ops.jl Outdated Show resolved Hide resolved
src/table_ops.jl Outdated Show resolved Hide resolved
src/table_ops.jl Outdated Show resolved Hide resolved
src/table_ops.jl Outdated Show resolved Hide resolved
src/table_ops.jl Outdated Show resolved Hide resolved
src/array/array.jl Outdated Show resolved Hide resolved
Copy link
Owner

@rafaqz rafaqz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, a bunch of minor fixes.

Probably the main question is how Number and DateTime is handled

@rafaqz
Copy link
Owner

rafaqz commented Sep 15, 2024

Any news on this PR @JoshuaBillson ? would be good to have this available

JoshuaBillson and others added 6 commits September 17, 2024 23:55
Co-authored-by: Rafael Schouten <[email protected]>
Co-authored-by: Rafael Schouten <[email protected]>
Co-authored-by: Rafael Schouten <[email protected]>
Co-authored-by: Rafael Schouten <[email protected]>
Co-authored-by: Rafael Schouten <[email protected]>
Co-authored-by: Rafael Schouten <[email protected]>
@JoshuaBillson
Copy link
Contributor Author

Any news on this PR @JoshuaBillson ? would be good to have this available

Sorry, I've been busy with some other projects. I'll try to resolve your suggested fixes this week. Once that's done, I think we just need to update the docs for DimStack and DimArray and write some more test cases.

@rafaqz
Copy link
Owner

rafaqz commented Sep 18, 2024

No worries at all, whenever you have time.

There are just some nice consequences of having this, like loading GeoJSON to a vector data cube, so I'm keen to have it.

# Write a single column from a table with one or more coordinate columns to a DimArray
function DimArray(table, dims; name=NoName(), selector=Near(), precision=6, missingval=missing, kw...)
# Confirm that the Tables interface is implemented
Tables.istable(table) || throw(ArgumentError("`table` must satisfy the `Tables.jl` interface."))
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Tables.istable(table) || throw(ArgumentError("`table` must satisfy the `Tables.jl` interface."))
Tables.istable(table) || throw(ArgumentError("`obj` must be an `AbstractArray` or satisfy the `Tables.jl` interface."))

People will also hit this method if they do something weird like pass a non-AbstractArray to DimArray.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add a Tables.materializer method
3 participants