-
Notifications
You must be signed in to change notification settings - Fork 1
Derivative loading differences #10
Comments
ancpbids seems to load all derivatives eagerly, even if derivatives=True is not set. This actually is fairly important as you may specifically not want to load derivatives. We may want to discuss if its better to "filter" derivatives prior to instantiation of a layout, or at the time of a query. I think the latter is more flexible and generic, but I think there may be times you really don't want to index a derivative. |
I think parsing the dataset structure (on file system) is fast enough in ancpbids, such that there is no need to omit parsing of derivatives. I would prefer to ignore derivatives when querying a dataset which is already possible via the scope parameter. |
Yes, that's what I figured. In general, I think the big philosophical difference between pybids & ancpbids is that pybids has various ways to effectively filter on ingestion, rather than at query time. This was done for a few reasons: 1) In part due to not having a powerful enough query language to exclude results at query time that were not relevant 2) Also made queries shorter since often those filters would be applied to all subsequent queries in a session (i.e. if i'm never interested in derivatives, or only select derivatives, I would only want to say that once at the beginning). 3) To make indexing faster (no longer relevant). I'm in favor of in the future going in the direction of ancpbids and letting more of this to occur at query time. That said, this will be a breaking change |
Also a question: Can you load derivatives that are in a different directory (i.e. not under |
What could be done is to ignore the derivatives subtree (in the in-memory graph) by default from queries if derivatives=False is provided. |
You mean in a deeply nested folder within derivatives? For example, detivatives/pipeline_1/test1/derivative? ancpbids assumes any (sub-)folder within derivatives/ may be a derivative folder (i.e. may contain the dataset_description.json), if not, it will still be able to query files/artifacts. |
No I mean a directory not inside of the main directory at all. For example the main dataset may be at |
This might work. In such cases, I'd like to guide these decisions by the cost of implementing/maintenance vs cost of making a backwards breaking change. |
There is a recent/related discussion: |
This is actually currently how you load a derivative dataset as a primary dataset:
In a way you say its the main raw and a derivative dataset. Not ideal |
I have had a good read of the derivative spec recently, and I am somehow feeling there is mismatch between both the There are a few specific aspects that caught my attention:
So, assuming eager loading is desired, it sounds more intuitive to have it implemented the other way around: when loading a derivative, load the associated source dataset(s). This way, you can avoid loading a derivative which has nothing to do with the source dataset, and load derivatives in both in-tree and out-of-tree storage conventions. |
I can only answer for pybids (and agree it's suboptimal), but I have a few comments:
I believe pybids fully supports this. You can list an out of tree derivative under
I believe if there is no
The problem with this suggestion is you can have a complete stand alone derivative, and that may be sufficient enough for an analysis. For example, an Let me know if I misunderstood you, because I'm not sure I understand the point of "loading derivatives which have nothing to do with the source dataset". |
I agree with every word. AFAIK however, this is not the case for neither Which means you need to start with a raw dataset and "attach" a derivative to it whether it's in-tree or out-of-tree. My point was that the BIDS provenance metadata suggest the inverse relationship, i.e. you start from a derivative and load it's upstream dataset listed in |
You can with pybids awkwardly, but not the "correct" way. I have to think more about the I will note this discussion happened a long time ago when BIDS Derivatives were not a thing, and some expressed preference for sources being nested within derivatives dataset (i.e. Regardless, in the meantime we will mimic current behavior but its something to consider for a future breaking release. |
My understanding is that it could either be a URL or a relative path.
I agree that the in-tree storage convention is probably the most used. That being said, storing derivatives out-of-tree has its benefits, like allowing raw datasets be shared read-only to avoid accidental file modifications by processing pipelines. |
Agree, I'm a big fan of out-of-tree derivatives, but specifically the part about loading a derivative and then loading the referenced raw dataset is the part I'm unsure about. I suppose if its a relative path it could try to load it, and not attempt if its a remote URL |
Sounds appropriate to me. Perhaps you'd want to make loading optional with something like: dataset = BIDSLayout("/path/to/derivative/dataset", load_source_datasets=True) # False by default. |
@effigies suggests keeping multple ancpbids datasets for each out of tree derivative. That is pybids would handle the multiple derivatives by setting up various ancpbids datasets. |
ANCPBIds eagery loads derivatives, whereas pybids only loads when specifically asked to (e.g.
derivatives=True
)The text was updated successfully, but these errors were encountered: