Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Build the ingestion pipeline #42

Open
matthewcarbone opened this issue Oct 21, 2022 · 0 comments
Open

Build the ingestion pipeline #42

matthewcarbone opened this issue Oct 21, 2022 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@matthewcarbone
Copy link
Contributor

Building the ingestion pipeline

We are working with Eli to develop a pipeline for uploading his XAS beam line data into aimmdb. Particularly, we want to accomplish the following with this issue:

Summary

  • Our endpoint (for now) will be a .dat file which contains comments starting with # (some of which are critical pieces of metadata), and otherwise columnated data which are space-delimited.
  • Using the existing xas schema, each channel (column other than the energy, basically) will be read into aimmdb, where the energy column is self-explanatory, and the mu column will be any of the many channels. The channel which is chosen will be indicated in the metadata as measurement_type. Eli's code below provides a good starting point. It's somewhat pseudocode and some work needs to be done.
import numpy as np
import pandas as pd


MEASUREMENT_INSTRUCTIONS = {
    "transmission": {
        "name": "transmission",
        "numerator": "it",
        "denominator": "i0",
        "log": True,
        "invert": True,
        "col_name": "mu_trans",
    },
    "fluorescence": {
        "name": "fluorescence",
        "numerator": "iff",
        "denominator": "i0",
        "log": False,
        "invert": False,
        "col_name": "mu_fluo",
    },
}


def extract_mu(path, measurement_kind):

    df = pd.read_csv(path)
    
    measurement_description = MEASUREMENT_INSTRUCTIONS[measurement_kind]

    energy = df["energy"]

    mu = (
        df[measurement_description["numerator"]]
        / df[measurement_description["denominator"]]
    )

    if measurement_description["log"]:
        mu = np.log10(mu)

    if measurement_description["invert"]:
        mu = -mu

    # Also read the metadata from the file, include all commented lines, but
    # we need to pick out the particularly important databroker unique id
    metadata = ...

    # process data frame...

    return df, metadata

Specific steps

  • Create a module aimmdb.ingest
  • Create a particular file aimmdb/ingest/eli.py (we'll rename this to the name of Eli's beam line later)
  • Create a single function (ingest) which takes a single path as an argument and returns a pd.DataFrame (the data) and dict (metadata).
  • Don't forget that the pd.DataFrame columns must be energy and mu. The actual column we use for mu will change depending on the channel we're looking at.
  • In Eli's examples, we only have "transmission" and "fluorescence". Eli has provided instructions (code above) on how to process these particular types of data and how they should be represented in aimmdb
  • We MUST document every type of processing we do (see above code) before it gets uploaded into aimmdb. I recommend a README file in aimmdb.ingest for now, until we move to a more standard documentation solution.
@matthewcarbone matthewcarbone added the enhancement New feature or request label Oct 21, 2022
@matthewcarbone matthewcarbone self-assigned this Oct 21, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant