Skip to content
This repository has been archived by the owner on Aug 8, 2024. It is now read-only.

Derivative loading differences #10

Open
adelavega opened this issue Dec 5, 2022 · 18 comments
Open

Derivative loading differences #10

adelavega opened this issue Dec 5, 2022 · 18 comments
Labels
effort: medium Estimated medium effort task

Comments

@adelavega
Copy link
Collaborator

ANCPBIds eagery loads derivatives, whereas pybids only loads when specifically asked to (e.g. derivatives=True)

@adelavega
Copy link
Collaborator Author

ancpbids seems to load all derivatives eagerly, even if derivatives=True is not set. This actually is fairly important as you may specifically not want to load derivatives. We may want to discuss if its better to "filter" derivatives prior to instantiation of a layout, or at the time of a query. I think the latter is more flexible and generic, but I think there may be times you really don't want to index a derivative.

@erdalkaraca
Copy link
Collaborator

I think parsing the dataset structure (on file system) is fast enough in ancpbids, such that there is no need to omit parsing of derivatives. I would prefer to ignore derivatives when querying a dataset which is already possible via the scope parameter.

@adelavega
Copy link
Collaborator Author

Yes, that's what I figured. In general, I think the big philosophical difference between pybids & ancpbids is that pybids has various ways to effectively filter on ingestion, rather than at query time.

This was done for a few reasons: 1) In part due to not having a powerful enough query language to exclude results at query time that were not relevant 2) Also made queries shorter since often those filters would be applied to all subsequent queries in a session (i.e. if i'm never interested in derivatives, or only select derivatives, I would only want to say that once at the beginning). 3) To make indexing faster (no longer relevant).

I'm in favor of in the future going in the direction of ancpbids and letting more of this to occur at query time.

That said, this will be a breaking change

@adelavega
Copy link
Collaborator Author

Also a question: Can you load derivatives that are in a different directory (i.e. not under derivatives/) with ancpbids?

@erdalkaraca
Copy link
Collaborator

erdalkaraca commented Dec 6, 2022

Yes, that's what I figured. In general, I think the big philosophical difference between pybids & ancpbids is that pybids has various ways to effectively filter on ingestion, rather than at query time.

This was done for a few reasons: 1) In part due to not having a powerful enough query language to exclude results at query time that were not relevant 2) Also made queries shorter since often those filters would be applied to all subsequent queries in a session (i.e. if i'm never interested in derivatives, or only select derivatives, I would only want to say that once at the beginning). 3) To make indexing faster (no longer relevant).

I'm in favor of in the future going in the direction of ancpbids and letting more of this to occur at query time.

That said, this will be a breaking change

What could be done is to ignore the derivatives subtree (in the in-memory graph) by default from queries if derivatives=False is provided.

@erdalkaraca
Copy link
Collaborator

Also a question: Can you load derivatives that are in a different directory (i.e. not under derivatives/) with ancpbids?

You mean in a deeply nested folder within derivatives? For example, detivatives/pipeline_1/test1/derivative? ancpbids assumes any (sub-)folder within derivatives/ may be a derivative folder (i.e. may contain the dataset_description.json), if not, it will still be able to query files/artifacts.

@adelavega
Copy link
Collaborator Author

Also a question: Can you load derivatives that are in a different directory (i.e. not under derivatives/) with ancpbids?

You mean in a deeply nested folder within derivatives? For example, detivatives/pipeline_1/test1/derivative? ancpbids assumes any (sub-)folder within derivatives/ may be a derivative folder (i.e. may contain the dataset_description.json), if not, it will still be able to query files/artifacts.

No I mean a directory not inside of the main directory at all.

For example the main dataset may be at /data/dataset and the derivative may be at /data/fmriprep/derivative_1

@adelavega
Copy link
Collaborator Author

Yes, that's what I figured. In general, I think the big philosophical difference between pybids & ancpbids is that pybids has various ways to effectively filter on ingestion, rather than at query time.
This was done for a few reasons: 1) In part due to not having a powerful enough query language to exclude results at query time that were not relevant 2) Also made queries shorter since often those filters would be applied to all subsequent queries in a session (i.e. if i'm never interested in derivatives, or only select derivatives, I would only want to say that once at the beginning). 3) To make indexing faster (no longer relevant).
I'm in favor of in the future going in the direction of ancpbids and letting more of this to occur at query time.
That said, this will be a breaking change

What could be done is to ignore the derivatives subtree (in the in-memory graph) by default from queries if derivatives=False is provided.

This might work. In such cases, I'd like to guide these decisions by the cost of implementing/maintenance vs cost of making a backwards breaking change.

@erdalkaraca
Copy link
Collaborator

There is a recent/related discussion:

ANCPLabOldenburg/ancp-bids#68

@ghisvail

@erdalkaraca erdalkaraca added the effort: medium Estimated medium effort task label Dec 7, 2022
@adelavega
Copy link
Collaborator Author

adelavega commented Dec 8, 2022

This is actually currently how you load a derivative dataset as a primary dataset:

layout = BIDSLayout(dataset_path, derivatives=dataset_path)

In a way you say its the main raw and a derivative dataset. Not ideal

@ghisvail
Copy link

ghisvail commented Dec 9, 2022

In a way you say its the main raw and a derivative dataset. Not ideal

I have had a good read of the derivative spec recently, and I am somehow feeling there is mismatch between both the pybids and ancpbids APIs and the spirit of the spec.

There are a few specific aspects that caught my attention:

  1. Storage conventions. Derivatives may be stored in-tree (under /derivatives) or out-of-tree. The current API fails at covering the out-of-tree convention.

  2. Non-compliant derivatives may be stored in-tree under the derivatives folder, for instance FreeSurfer's output. How are these taken into account by the derivatives loading logic?

  3. Provenance of derivatives are stored under the SourceDatasets field in dataset_description.json.

So, assuming eager loading is desired, it sounds more intuitive to have it implemented the other way around: when loading a derivative, load the associated source dataset(s). This way, you can avoid loading a derivative which has nothing to do with the source dataset, and load derivatives in both in-tree and out-of-tree storage conventions.

@adelavega
Copy link
Collaborator Author

I can only answer for pybids (and agree it's suboptimal), but I have a few comments:

In a way you say its the main raw and a derivative dataset. Not ideal

I have had a good read of the derivative spec recently, and I am somehow feeling there is mismatch between both the pybids and ancpbids APIs and the spirit of the spec.

There are a few specific aspects that caught my attention:

1. Storage conventions. Derivatives may be stored in-tree (under `/derivatives`) or out-of-tree. The current API fails at covering the out-of-tree convention.

I believe pybids fully supports this. You can list an out of tree derivative under derivatives=

2. Non-compliant derivatives may be stored in-tree under the `derivatives` folder, for instance FreeSurfer's output. How are these taken into account by the derivatives loading logic?

I believe if there is no dataset_description.json it will not index. If it does have one, but the files are non compliant, the behavior will depend on the validate argument.

3. Provenance of derivatives are stored under the `SourceDatasets` field in `dataset_description.json`.

So, assuming eager loading is desired, it sounds more intuitive to have it implemented the other way around: when loading a derivative, load the associated source dataset(s). This way, you can avoid loading a derivative which has nothing to do with the source dataset, and load derivatives in both in-tree and out-of-tree storage conventions.

The problem with this suggestion is you can have a complete stand alone derivative, and that may be sufficient enough for an analysis. For example, an fmriprep output is all that is needed for many subsequent fMRI analyses. I think you should be able to load it as a primary dataset, and pybids/ancpbids should detect that its a derivative and index it appropriately.

Let me know if I misunderstood you, because I'm not sure I understand the point of "loading derivatives which have nothing to do with the source dataset".

@ghisvail
Copy link

ghisvail commented Dec 9, 2022

The problem with this suggestion is you can have a complete stand alone derivative [...] you should be able to load it as a primary dataset.

I agree with every word. AFAIK however, this is not the case for neither pybids nor ancpbids.

Which means you need to start with a raw dataset and "attach" a derivative to it whether it's in-tree or out-of-tree. My point was that the BIDS provenance metadata suggest the inverse relationship, i.e. you start from a derivative and load it's upstream dataset listed in SourceDatasets.

@adelavega
Copy link
Collaborator Author

adelavega commented Dec 9, 2022

layout = BIDSLayout(dataset_path, derivatives=dataset_path)

You can with pybids awkwardly, but not the "correct" way.

I have to think more about the SourceDatasets suggestion. Isn't this a URL rather than a relative path? I'm not sure how that would work in practice.

I will note this discussion happened a long time ago when BIDS Derivatives were not a thing, and some expressed preference for sources being nested within derivatives dataset (i.e. derivatives/sources/{datalad_folder). but this organizational scheme did not win out. I would suggest the presence of a derivatives/ folder suggest the current way of doing things.

Regardless, in the meantime we will mimic current behavior but its something to consider for a future breaking release.

@ghisvail
Copy link

ghisvail commented Dec 9, 2022

I have to think more about the SourceDatasets suggestion. Isn't this a URL rather than a relative path? I'm not sure how that would work in practice.

My understanding is that it could either be a URL or a relative path.

I would suggest the presence of a derivatives/ folder suggest the current way of doing things.

I agree that the in-tree storage convention is probably the most used. That being said, storing derivatives out-of-tree has its benefits, like allowing raw datasets be shared read-only to avoid accidental file modifications by processing pipelines.

@adelavega
Copy link
Collaborator Author

Agree, I'm a big fan of out-of-tree derivatives, but specifically the part about loading a derivative and then loading the referenced raw dataset is the part I'm unsure about. I suppose if its a relative path it could try to load it, and not attempt if its a remote URL

@ghisvail
Copy link

ghisvail commented Dec 9, 2022

I suppose if its a relative path it could try to load it, and not attempt if its a remote URL.

Sounds appropriate to me.

Perhaps you'd want to make loading optional with something like:

dataset = BIDSLayout("/path/to/derivative/dataset", load_source_datasets=True)  # False by default.

@adelavega
Copy link
Collaborator Author

adelavega commented Feb 10, 2023

@effigies suggests keeping multple ancpbids datasets for each out of tree derivative.

That is pybids would handle the multiple derivatives by setting up various ancpbids datasets.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
effort: medium Estimated medium effort task
Projects
None yet
Development

No branches or pull requests

3 participants