This code controls the export of FlyBase data to Alliance LinkML-compliant JSON
format. FlyBase and Alliance data types are represented by their own classes in
the fb_datatypes.py
and agr_datatypes.py
files, respectively.
There are various handlers dedicated to the transformation of specific
datatypes: e.g., gene_handler.py
.
Specific scripts are then used to call various handlers. A script may call on
multiple handlers for the generation of a single export file: e.g., Alliance
AGMs encompass both FB strains and genotypes. A script may also output multiple
files: e.g., AGM_data_retrieval_curation_gene.py
generates both gene and
allele-gene-association files.
The export files are then uploaded to the Alliance persistent store.
Modifications this repo should be made in new "release" branches corresponding to Alliance LinkML schema release branches: e.g., v1.3.1
.
These release branches should be used to spawn git releases/tags.
These scripts are intended to be run in Docker using GoCD pipelines.
Files should be validated locally before uploading to the Alliance.
Use the Alliance Curation API (details below) to find the latest supported LinkML version.
- Write your code to this LinkML version.
- Note that different objects may be have different version support as the persistent store develops.
The Alliance LinkML-based data schema is held in the agr_curation_schemas repo.
Get an OKTA account (through Chris Grove). You will need this to interact with the persistent store.
The Alliance Curation Site is the main hub. Once logged into the curation site, you'll have access to persistent store data:
- View/update entries.
- View file loading.
- Get your personal
Curation API Token
- required for file uploads and API use. You can also access the persistent store using the Alliance Curation API. - You will need to log into the API with your personal
Curation API Token
. - You can find the
Curation API Token
under your profile (top right corner) in the Alliance Curation Site
See the Alliance A-Team Confluence Site for more info.
Use the Alliance_LinkML_Submission
pipeline in the Reporting_Build
pipeline group to generate, validate and upload files.
With each run, files generated are stored locally in a directory within /data/alliance/
.
- The directory name will include the LinkML version and date (
MMMDD
), which correspond to theLINKML_VERSION
andITERATION
pipeline values.
- Specify pipeline-specific variables.
- Sometimes the FB release used for LinkML submission will not be the latest release build.
- This happens when a release build has just been sent to FB-Indiana, but the public GFF file is not yet available for that release.
- For this reason, do not rely upon the
Reporting_Build
environment values (except forALLIANCETOKEN
). - Specify these variables at the
Alliance_LinkML_Submission
pipeline level.SERVER
- flysql machine where the reporting db is located: e.g.,flysql25
DATABASE
- the name of the db to use (withaudit_chado
): e.g.,production_chado
orfb_2022_05_reporting_audit
RELEASE
- ensure that the release matches the db used: e.g.,2022_05
(reporting) or2022_05_EP3
(production).ITERATION
- the dateMMMDD
on which the pipeline is run.LINKML_VERSION
- the LinkML version for the data: e.g.,v1.3.1
.
The Alliance_LinkML_Submission
pipeline automates these steps:
- Builds directory for data and log output in
/data/alliance/
folder. - Gets HarvDev docker container.
- Builds docker image with FB
alliance-linkml-flybase
and Allianceagr_curation_schemas
repos (the latter for validation). - Fetches data using
alliance-linkml-flybase
scripts from a specified branch or release tag. - Validates data files using the Alliance
validate_agr_schema.py
against thejsonschema/allianceModel.schema
file. - Sends e-mail that files are ready for review (size, contents), then pauses the pipeline.
- Has a separate, manually initiated phase for each file upload (due to dependencies).
As mentioned above, the GoCD pipeline will upload files to the Alliance.
Details of how to upload files are here.
- Your personal
API Curation Token
is required (available from your profile the Alliance Curation Site).
Expect anOK
message upon completion of a successful file upload. There are dependencies: e.g., the ALLELE file should be fully loaded (takes many hours) at the Alliance before uploading the disease file.
Go to the Alliance Curation Site.
- From the left options, choose
Other Links
(at the bottom), thenData Loads
. - For each data type uploaded, there should be a
Direct (LinkML) * Loads
selection on the right. - For each data type, all files ever uploaded are listed.
- Click on the most recent one to see if it's been loaded (it can take hours).
- Review any errors in loading any of the data objects - may require code or data fix.
- Note - there can be load order dependencies: e.g., make sure all alleles are loaded before loading disease annotations.
- If you suspect that load order created problems, click on the pencil icon at the right to initiate a re-load of the file.
- If you want to re-load a file, might be best to first check with the A-Team though.
- There are dependencies on references too (see below).
Loading of most LinkML data is dependent on presence of FB references at the Alliance.
These references are in a separate database, the Alliance Bibliography Central (ABC), handled by the Alliance Blue Team.
The pub schema is based on the distinct agr_schemas model.
As such, FB code of pub submission is in the related alliance-flybase repo.
- See the README for this repo for details about submission and load issues.
Submissions to the ABC are handled by theAlliance_Pub_Export
pipeline (runs as part of theEpicycle
pipeline group).