Build the ingestion pipeline #42

matthewcarbone · 2022-10-21T22:02:16Z

Building the ingestion pipeline

We are working with Eli to develop a pipeline for uploading his XAS beam line data into aimmdb. Particularly, we want to accomplish the following with this issue:

Summary

Our endpoint (for now) will be a .dat file which contains comments starting with # (some of which are critical pieces of metadata), and otherwise columnated data which are space-delimited.
Using the existing xas schema, each channel (column other than the energy, basically) will be read into aimmdb, where the energy column is self-explanatory, and the mu column will be any of the many channels. The channel which is chosen will be indicated in the metadata as measurement_type. Eli's code below provides a good starting point. It's somewhat pseudocode and some work needs to be done.

import numpy as np
import pandas as pd


MEASUREMENT_INSTRUCTIONS = {
    "transmission": {
        "name": "transmission",
        "numerator": "it",
        "denominator": "i0",
        "log": True,
        "invert": True,
        "col_name": "mu_trans",
    },
    "fluorescence": {
        "name": "fluorescence",
        "numerator": "iff",
        "denominator": "i0",
        "log": False,
        "invert": False,
        "col_name": "mu_fluo",
    },
}


def extract_mu(path, measurement_kind):

    df = pd.read_csv(path)
    
    measurement_description = MEASUREMENT_INSTRUCTIONS[measurement_kind]

    energy = df["energy"]

    mu = (
        df[measurement_description["numerator"]]
        / df[measurement_description["denominator"]]
    )

    if measurement_description["log"]:
        mu = np.log10(mu)

    if measurement_description["invert"]:
        mu = -mu

    # Also read the metadata from the file, include all commented lines, but
    # we need to pick out the particularly important databroker unique id
    metadata = ...

    # process data frame...

    return df, metadata

Specific steps

Create a module aimmdb.ingest
Create a particular file aimmdb/ingest/eli.py (we'll rename this to the name of Eli's beam line later)
Create a single function (ingest) which takes a single path as an argument and returns a pd.DataFrame (the data) and dict (metadata).
Don't forget that the pd.DataFrame columns must be energy and mu. The actual column we use for mu will change depending on the channel we're looking at.
In Eli's examples, we only have "transmission" and "fluorescence". Eli has provided instructions (code above) on how to process these particular types of data and how they should be represented in aimmdb
We MUST document every type of processing we do (see above code) before it gets uploaded into aimmdb. I recommend a README file in aimmdb.ingest for now, until we move to a more standard documentation solution.

The text was updated successfully, but these errors were encountered:

matthewcarbone added the enhancement New feature or request label Oct 21, 2022

matthewcarbone self-assigned this Oct 21, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Build the ingestion pipeline #42

Build the ingestion pipeline #42

matthewcarbone commented Oct 21, 2022

Build the ingestion pipeline #42

Build the ingestion pipeline #42

Comments

matthewcarbone commented Oct 21, 2022

Building the ingestion pipeline

Summary

Specific steps