Derivative loading differences #10

adelavega · 2022-12-05T20:23:02Z

ANCPBIds eagery loads derivatives, whereas pybids only loads when specifically asked to (e.g. derivatives=True)

The text was updated successfully, but these errors were encountered:

adelavega · 2022-12-05T20:27:43Z

ancpbids seems to load all derivatives eagerly, even if derivatives=True is not set. This actually is fairly important as you may specifically not want to load derivatives. We may want to discuss if its better to "filter" derivatives prior to instantiation of a layout, or at the time of a query. I think the latter is more flexible and generic, but I think there may be times you really don't want to index a derivative.

erdalkaraca · 2022-12-05T21:12:26Z

I think parsing the dataset structure (on file system) is fast enough in ancpbids, such that there is no need to omit parsing of derivatives. I would prefer to ignore derivatives when querying a dataset which is already possible via the scope parameter.

adelavega · 2022-12-05T22:22:05Z

Yes, that's what I figured. In general, I think the big philosophical difference between pybids & ancpbids is that pybids has various ways to effectively filter on ingestion, rather than at query time.

This was done for a few reasons: 1) In part due to not having a powerful enough query language to exclude results at query time that were not relevant 2) Also made queries shorter since often those filters would be applied to all subsequent queries in a session (i.e. if i'm never interested in derivatives, or only select derivatives, I would only want to say that once at the beginning). 3) To make indexing faster (no longer relevant).

I'm in favor of in the future going in the direction of ancpbids and letting more of this to occur at query time.

That said, this will be a breaking change

adelavega · 2022-12-05T22:22:28Z

Also a question: Can you load derivatives that are in a different directory (i.e. not under derivatives/) with ancpbids?

erdalkaraca · 2022-12-06T12:23:59Z

Yes, that's what I figured. In general, I think the big philosophical difference between pybids & ancpbids is that pybids has various ways to effectively filter on ingestion, rather than at query time.

This was done for a few reasons: 1) In part due to not having a powerful enough query language to exclude results at query time that were not relevant 2) Also made queries shorter since often those filters would be applied to all subsequent queries in a session (i.e. if i'm never interested in derivatives, or only select derivatives, I would only want to say that once at the beginning). 3) To make indexing faster (no longer relevant).

I'm in favor of in the future going in the direction of ancpbids and letting more of this to occur at query time.

That said, this will be a breaking change

What could be done is to ignore the derivatives subtree (in the in-memory graph) by default from queries if derivatives=False is provided.

erdalkaraca · 2022-12-06T12:27:16Z

Also a question: Can you load derivatives that are in a different directory (i.e. not under derivatives/) with ancpbids?

You mean in a deeply nested folder within derivatives? For example, detivatives/pipeline_1/test1/derivative? ancpbids assumes any (sub-)folder within derivatives/ may be a derivative folder (i.e. may contain the dataset_description.json), if not, it will still be able to query files/artifacts.

adelavega · 2022-12-06T22:13:45Z

Also a question: Can you load derivatives that are in a different directory (i.e. not under derivatives/) with ancpbids?

You mean in a deeply nested folder within derivatives? For example, detivatives/pipeline_1/test1/derivative? ancpbids assumes any (sub-)folder within derivatives/ may be a derivative folder (i.e. may contain the dataset_description.json), if not, it will still be able to query files/artifacts.

No I mean a directory not inside of the main directory at all.

For example the main dataset may be at /data/dataset and the derivative may be at /data/fmriprep/derivative_1

adelavega · 2022-12-06T22:14:42Z

Yes, that's what I figured. In general, I think the big philosophical difference between pybids & ancpbids is that pybids has various ways to effectively filter on ingestion, rather than at query time.
This was done for a few reasons: 1) In part due to not having a powerful enough query language to exclude results at query time that were not relevant 2) Also made queries shorter since often those filters would be applied to all subsequent queries in a session (i.e. if i'm never interested in derivatives, or only select derivatives, I would only want to say that once at the beginning). 3) To make indexing faster (no longer relevant).
I'm in favor of in the future going in the direction of ancpbids and letting more of this to occur at query time.
That said, this will be a breaking change

What could be done is to ignore the derivatives subtree (in the in-memory graph) by default from queries if derivatives=False is provided.

This might work. In such cases, I'd like to guide these decisions by the cost of implementing/maintenance vs cost of making a backwards breaking change.

erdalkaraca · 2022-12-07T05:58:35Z

There is a recent/related discussion:

ANCPLabOldenburg/ancp-bids#68

@ghisvail

adelavega · 2022-12-08T21:50:30Z

This is actually currently how you load a derivative dataset as a primary dataset:

layout = BIDSLayout(dataset_path, derivatives=dataset_path)

In a way you say its the main raw and a derivative dataset. Not ideal

ghisvail · 2022-12-09T09:04:02Z

In a way you say its the main raw and a derivative dataset. Not ideal

I have had a good read of the derivative spec recently, and I am somehow feeling there is mismatch between both the pybids and ancpbids APIs and the spirit of the spec.

There are a few specific aspects that caught my attention:

Storage conventions. Derivatives may be stored in-tree (under /derivatives) or out-of-tree. The current API fails at covering the out-of-tree convention.
Non-compliant derivatives may be stored in-tree under the derivatives folder, for instance FreeSurfer's output. How are these taken into account by the derivatives loading logic?
Provenance of derivatives are stored under the SourceDatasets field in dataset_description.json.

So, assuming eager loading is desired, it sounds more intuitive to have it implemented the other way around: when loading a derivative, load the associated source dataset(s). This way, you can avoid loading a derivative which has nothing to do with the source dataset, and load derivatives in both in-tree and out-of-tree storage conventions.

adelavega · 2022-12-09T14:04:21Z

I can only answer for pybids (and agree it's suboptimal), but I have a few comments:

In a way you say its the main raw and a derivative dataset. Not ideal

I have had a good read of the derivative spec recently, and I am somehow feeling there is mismatch between both the pybids and ancpbids APIs and the spirit of the spec.

There are a few specific aspects that caught my attention:
1. Storage conventions. Derivatives may be stored in-tree (under `/derivatives`) or out-of-tree. The current API fails at covering the out-of-tree convention.

I believe pybids fully supports this. You can list an out of tree derivative under derivatives=

2. Non-compliant derivatives may be stored in-tree under the `derivatives` folder, for instance FreeSurfer's output. How are these taken into account by the derivatives loading logic?

I believe if there is no dataset_description.json it will not index. If it does have one, but the files are non compliant, the behavior will depend on the validate argument.

3. Provenance of derivatives are stored under the `SourceDatasets` field in `dataset_description.json`.
So, assuming eager loading is desired, it sounds more intuitive to have it implemented the other way around: when loading a derivative, load the associated source dataset(s). This way, you can avoid loading a derivative which has nothing to do with the source dataset, and load derivatives in both in-tree and out-of-tree storage conventions.

The problem with this suggestion is you can have a complete stand alone derivative, and that may be sufficient enough for an analysis. For example, an fmriprep output is all that is needed for many subsequent fMRI analyses. I think you should be able to load it as a primary dataset, and pybids/ancpbids should detect that its a derivative and index it appropriately.

Let me know if I misunderstood you, because I'm not sure I understand the point of "loading derivatives which have nothing to do with the source dataset".

ghisvail · 2022-12-09T14:43:08Z

The problem with this suggestion is you can have a complete stand alone derivative [...] you should be able to load it as a primary dataset.

I agree with every word. AFAIK however, this is not the case for neither pybids nor ancpbids.

Which means you need to start with a raw dataset and "attach" a derivative to it whether it's in-tree or out-of-tree. My point was that the BIDS provenance metadata suggest the inverse relationship, i.e. you start from a derivative and load it's upstream dataset listed in SourceDatasets.

adelavega · 2022-12-09T14:52:00Z

layout = BIDSLayout(dataset_path, derivatives=dataset_path)

You can with pybids awkwardly, but not the "correct" way.

I have to think more about the SourceDatasets suggestion. Isn't this a URL rather than a relative path? I'm not sure how that would work in practice.

I will note this discussion happened a long time ago when BIDS Derivatives were not a thing, and some expressed preference for sources being nested within derivatives dataset (i.e. derivatives/sources/{datalad_folder). but this organizational scheme did not win out. I would suggest the presence of a derivatives/ folder suggest the current way of doing things.

Regardless, in the meantime we will mimic current behavior but its something to consider for a future breaking release.

ghisvail · 2022-12-09T15:04:22Z

I have to think more about the SourceDatasets suggestion. Isn't this a URL rather than a relative path? I'm not sure how that would work in practice.

My understanding is that it could either be a URL or a relative path.

I would suggest the presence of a derivatives/ folder suggest the current way of doing things.

I agree that the in-tree storage convention is probably the most used. That being said, storing derivatives out-of-tree has its benefits, like allowing raw datasets be shared read-only to avoid accidental file modifications by processing pipelines.

adelavega · 2022-12-09T15:46:19Z

Agree, I'm a big fan of out-of-tree derivatives, but specifically the part about loading a derivative and then loading the referenced raw dataset is the part I'm unsure about. I suppose if its a relative path it could try to load it, and not attempt if its a remote URL

ghisvail · 2022-12-09T15:58:35Z

I suppose if its a relative path it could try to load it, and not attempt if its a remote URL.

Sounds appropriate to me.

Perhaps you'd want to make loading optional with something like:

dataset = BIDSLayout("/path/to/derivative/dataset", load_source_datasets=True)  # False by default.

adelavega · 2023-02-10T16:55:06Z

@effigies suggests keeping multple ancpbids datasets for each out of tree derivative.

That is pybids would handle the multiple derivatives by setting up various ancpbids datasets.

erdalkaraca added the effort: medium Estimated medium effort task label Dec 7, 2022

adelavega mentioned this issue Feb 27, 2023

FIXES: invalid filters, mark all xfails, etc.. #40

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Derivative loading differences #10

Derivative loading differences #10

adelavega commented Dec 5, 2022

adelavega commented Dec 5, 2022

erdalkaraca commented Dec 5, 2022

adelavega commented Dec 5, 2022

adelavega commented Dec 5, 2022

erdalkaraca commented Dec 6, 2022 •

edited

Loading

erdalkaraca commented Dec 6, 2022

adelavega commented Dec 6, 2022

adelavega commented Dec 6, 2022

erdalkaraca commented Dec 7, 2022

adelavega commented Dec 8, 2022 •

edited

Loading

ghisvail commented Dec 9, 2022

adelavega commented Dec 9, 2022

ghisvail commented Dec 9, 2022

adelavega commented Dec 9, 2022 •

edited

Loading

ghisvail commented Dec 9, 2022

adelavega commented Dec 9, 2022

ghisvail commented Dec 9, 2022

adelavega commented Feb 10, 2023 •

edited

Loading

Derivative loading differences #10

Derivative loading differences #10

Comments

adelavega commented Dec 5, 2022

adelavega commented Dec 5, 2022

erdalkaraca commented Dec 5, 2022

adelavega commented Dec 5, 2022

adelavega commented Dec 5, 2022

erdalkaraca commented Dec 6, 2022 • edited Loading

erdalkaraca commented Dec 6, 2022

adelavega commented Dec 6, 2022

adelavega commented Dec 6, 2022

erdalkaraca commented Dec 7, 2022

adelavega commented Dec 8, 2022 • edited Loading

ghisvail commented Dec 9, 2022

adelavega commented Dec 9, 2022

ghisvail commented Dec 9, 2022

adelavega commented Dec 9, 2022 • edited Loading

ghisvail commented Dec 9, 2022

adelavega commented Dec 9, 2022

ghisvail commented Dec 9, 2022

adelavega commented Feb 10, 2023 • edited Loading

erdalkaraca commented Dec 6, 2022 •

edited

Loading

adelavega commented Dec 8, 2022 •

edited

Loading

adelavega commented Dec 9, 2022 •

edited

Loading

adelavega commented Feb 10, 2023 •

edited

Loading