-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
A FEFF schema #37
Comments
The schema validation requirements are all supported. Tiled core supports updating metadata, such as to move a dataset between “states” as you describe, but we have not worked that feature into aimmdb (which we should think of as a Tiled plugin). With Joe’s departure and a high priority on ingesting new datasets, this may be more than a month away, but it will certainly happen. Would you be polling for the set of empty datasets and using Tiled as a kind of work queue? It lacks the synchronization primitives you would get from Redis or Kafka or Celery. For a single worker this may be fine. If you may grow multiple concurrent workers it may be better to move the work queue into a real queue and only store the finished results in Tiled. |
I see. I suppose in principle then we could just have two schemas for now: one for completed jobs and one for incomplete jobs. It's certainly a hack but if we can do this then it will let us use the aimmdb framework as is. At least for initial testing. Once this feature is merged in we can just adopt it.
Yes and it would be only a single worker/machine. Basically, on HPC I will have a cronjob or something that, every minute, pings aimmdb for incomplete jobs, and pulls those down (and then pushes them back after they complete). Similarly, every few minutes on local, I will ping aimmdb for completed jobs.
No doubt, and I'm exploring those options too, but I don't have a better database solution than this one right now, and doing this has other indirect benefits, like letting you guys stress test the database a bit more. It's also the path of least resistance for me and Mike, and it will get all of us a nice scientific paper (hopefully!) 😊 |
Ok after chatting with Mike and seeing Dan's thumbs up, it seems to make the most sense to have two separate schemas. I'm going to update the main post here with the details. |
A FEFF schema
In this issue, I'll outline the plan for constructing a schema for FEFF data. We wish to store FEFF data for two purposes:
Point 2 is the more interesting one here. I would like the FEFF schema to allow for two "states" of completeness.
Pending calculation: the
data
would consist of an empty data frame, with just the column names.metadata
would contain just the information required for submitting a job.Complete calculation: the
data
will now contain the actual spectral data/FEFF output.metadata
will contain output logs in addition to everything contained in the pending calculations.Schema plan
Instead of one schema for both incomplete and complete jobs, let's have two schemas, one for completed FEFF jobs and one for incomplete jobs. I will detail below (lots of edits).
The data
Completed FEFF jobs
FEFF9 spectra output is quite simple. It consists of columnated data with the following columns:
omega
e
k
mu
mu0
chi
Each column simply contains floats. This should be quite straightforward to implement.
Incomplete FEFF jobs
The DataFrame will have the same columns but will be trivially empty.
The metadata
Note that complete and incomplete FEFF jobs will be linked by a metadata field analogous to
sample_id
. I think we can actually just call itsample_id
. For example, a molecule-site pair will have one entry in the incomplete database and one in the complete (once the job is done); these two data points will be linked by thissample_id
. always requiredCommon metadata that will be searchable:
XDIElement
(edge+element pair). Though I do wish to reference Refactor XDIElement into Element and Edge #21 as I feel the nameXDIElement
is misleading... for now we'll stick with it though. always requiredidentifier
: string. This can mean a few things, but in particular, for molecules it could mean the SMILES string. It's important that this be searchable because a single molecule may have multiple absorbing sites and therefore multiple FEFF spectra. always requiredabsorbing_site_index
: int, zero indexed; always requiredcalculation_type
: string, eitherXANES
orEXAFS
. always requiredCompleted FEFF jobs
feff.out
: string (output file read as a string); always requiredIncomplete FEFF jobs
feff.inp
: string (input file read as a string, or perhaps can be decomposed into different blocks); always requiredComments
@danielballan I know this might not be exactly what you had in mind as far as aimmdb's use cases are concerned, but I would love your feedback on this. We'll be using it for dynamic querying of completed FEFF spectra for inverse design of molecules, and for Mike's really cool frontend GUI for visualizing XAS.
If this idea works we can duplicate the principle for e.g. Gaussian and do geometry optimization.
Finally, this does have a multi-modal aspect, since for a given molecule we'll compute e.g. the C, N and O XANES and use them all for multi-modal structure refinement.
The text was updated successfully, but these errors were encountered: