Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Processing functions should not strip metadata from processed xdata. #16

Open
1 task
Brow71189 opened this issue Apr 30, 2020 · 8 comments
Open
1 task
Labels
audience - scripting affects advanced users doing scripting or plug-in development effort - few weeks a few weeks impact - high has difficult or onerous workaround needs planning needs planning reach - medium affects several users weekly

Comments

@Brow71189
Copy link
Collaborator

Brow71189 commented Apr 30, 2020

Right now, some of them at least try to add some sensible dimensional and intensity calibrations to the output data. But all functions in Core.py strip the metadata dict from their inputs.
Processing functions with one input and one output should just copy the metdata to their outputs.
For processing functions with multiple inputs, I'm not sure what the best solution would be. 3 ideas:

  1. Use the metadata from the first input: Very simple but probably the worst solution (still better than the current situation, though)
  2. Add a list of metadata dicts to the output: Also very simple, but can overload the metadata of the output a bit. We already have a lot of stuff in there, and multiplying that might make it even harder for the user to browse through it.
  3. Merge the metadata dicts: This is the most complex, but probably most elegant solution. We would recursively go through the metadata and only put common keys into the result once. If the values of these keys are the same, we just list the values once, otherwise they get turned into a list of values for each input.
    Alternatively we could have the common key, value pairs in one dict and the "unique" ones in a list of dicts as in point 2.

Tasks

Preview Give feedback
  1. audience - developer effort - week f - acquisition f - processing f - user-interface impact - high reach - all type - enhancement
@cmeyer
Copy link
Collaborator

cmeyer commented Apr 30, 2020

  1. Do not copy metadata. Copying almost always results in incorrect metadata and makes it impossible to use metadata for anything useful. For instance, let's say we try to detect "data acquired from a camera" using metadata, which we do currently. If all processed data from the original has a copy of the metadata, how are we to distinguish it from the original?
  2. Put source metadata into a sub-tag. This is also not a good idea as it can propagate a lot of idea quickly and still make it difficult to access the metadata.

My plan for this in the future (which may be now) is to provide more metadata functions that allow for tracing the sources of the metadata. So an operation can add metadata about its sources and then additional functions can help to access a processed item's sources and metadata can be grabbed from the source. The obvious downside is if the source is missing.

Another approach would be to get rid of the concept of unstructured metadata completely and define subsets that are more specific and may be copied. We already do this to some extent with calibrations and data descriptions. In the near future we may add coordinate systems. The first example might be "description of processing". Another new example might be "original data acquisition information." This needs some thinking.

@cmeyer
Copy link
Collaborator

cmeyer commented Apr 30, 2020

Before I close this issue, I'd like to understand your use case. What are you trying to accomplish by copying metadata?

@Brow71189
Copy link
Collaborator Author

I think there are a lot of processing routines where copying the metadata is valid. Think of things like a Gaussian blur (or any other filter). Even alining a sequence does not invalidate metadata.
Right now if you process your data in Swift, you always loose all metadata.
If you then export or even just snapshot the result, you have no idea about how this data was acquired because all the information about it is just gone.
There might be operations that invalidate metdata, but defaulting to just dropping everything is the wrong approach in my opinion.

@cmeyer
Copy link
Collaborator

cmeyer commented May 4, 2020

I've considered the exporting case in the past and my conclusion in the past was that export is a special operation that should consolidate processing information and include metadata from sources. See nion-software/nionswift#397 and nion-software/nionswift#398.

If the user has a way to view sources and metadata of those sources directly in an expanded metadata editor and export has an option for including processing info and metadata from sources, would that satisfy your concerns?

@Brow71189
Copy link
Collaborator Author

It is better than nothing but I would still prefer if functions that do not cause metadata to be invalid would just copy it to their results.

@Brow71189
Copy link
Collaborator Author

New example where this behavior leads to an awkward implementation:

Consider a camera with the new acquire_synchronized capabilities. It returns Partial data with the xdata attribute being the SI data which usually still contains the flyback pixels. So the obvoius thing to do would be to use the follwing code to crop the data:

from nion.data import xdata_1_0 as xd
partial_data = camera.acquire_synchronized_continue(...)
xdata = xd.data_slice(partial_data.xdata, (slice(None), slice(-2), ...))

Which is great because it keeps calibrations etc. and it is an API function.
But now we have to access a "private" function to get our metadata dict back:

xdata._set_metadata(partial_data.xdata.metadata)

Why does data_slice strip the metadata dict from the data? This makes absolutely no sense to me...

@cmeyer
Copy link
Collaborator

cmeyer commented Feb 25, 2021

An additional rationale for this issue is described in the similar-to-niondata xarray project, where they only copy metadata unless explicitly requested or it is unambiguous.

xarray: What is your approach to metadata?

@cmeyer
Copy link
Collaborator

cmeyer commented Jan 17, 2023

Additional notes 2021-07-21:

1:

  1. In general metadata describes what we are looking at, how it was collected, and its calibrations. Obviously none of these things change from an align process. An integrate will obviously affect the effective exposure time.
  2. Chris' comment that in the metadata there is an item specific to "data acquired from a camera" means that we should discard all information is highly facetious. The proper way to handle this is to have a metadata tag "Origin" that gets set to what the origin of the data was after copying the metadata. I would also add a tag "Trail" that appends the current Origin to the Trail of the source. So we would see "Trail"="Camera ELA,Align,Integrate,Crop" for a basic wq.
  3. losing calibrations is nuts, and it is not helped by it not being possible to use inspector to copy them, as inspector only shows a heavily rounded version of the calibration.

2:

We would be hugely better off if we simply copied the meta-data from the parent to the child. There is indeed a chance that this could lead to some confusion, and this would be an improvement on the current situation where confusion is guaranteed.

Generally speaking, a more robust method for handling meta data is needed. We need options for what to do to handle complex cases, with sensible defaults for the most common operations.

Common use cases include:

  1. multi-acquire NxM, align, integrate -- please do not lose the calibration information (scale nor intensity), microscope kV, acquisition detector, etc...
    (this could be 10,000 spectra with acquisition time 1ms, or 1,000 images with 1us/pixel, or other)
  2. acquire a spectrum, subtract dark image, multiply by gain image. please do not lose the calibration information (scale nor intensity), microscope kV, acquisition detector, etc...

@cmeyer cmeyer added audience - scripting affects advanced users doing scripting or plug-in development effort - month a month or less impact - medium has straightforward but non-obvious workaround needs planning needs planning reach - medium affects several users weekly labels Feb 23, 2024
@cmeyer cmeyer added effort - few weeks a few weeks impact - high has difficult or onerous workaround and removed effort - month a month or less impact - medium has straightforward but non-obvious workaround labels Mar 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
audience - scripting affects advanced users doing scripting or plug-in development effort - few weeks a few weeks impact - high has difficult or onerous workaround needs planning needs planning reach - medium affects several users weekly
Projects
None yet
Development

No branches or pull requests

2 participants