Schema for the Experts@Minnesota project database.
UMN Libraries created this database to complement the Elsevier product, Pure, which we have branded Experts@Minnesota. We did this both to provide alternative means of access to some of the data, as well as to maintain a vendor-independent, UMN-local copy of that data.
Read-only direct SQL access is available to anyone with a UMN internet ID. This database is on the OIT Oracle Hotel,
currently only on the tst instance, with access granted via the oit_expert_rd_all
role, managed by OIT. Each role
member must be an Oracle Hotel account, or "schema" in Oracle-speak. These accounts can be personal internet IDs or
departmental/functional accounts. Note that personal internet IDs must use Duo two-factor authentication, so a
functional/departmental account will be better for non-interactive, automated access. For access to the role and
database connection information, and to optionally create an Oracle Hotel account, send email to: [email protected]
The DBAs at that address will probably tell you to go to the
Access Request Forms on Service Now and fill out the
"IT Hosted Databases" form with the information described above. If you already have an Oracle Hotel account you
want to use, it may be faster to just go to that form directly.
Also, Elsevier has just added to Pure a new JSON-based web services API, which we are currently evaluating. We had originally planned to design and implement our own web services API. However, it seems the Pure JSON-based API may meet user needs, so we are deferring that plan for now.
Announcements and discussion about this database and other Experts@Minnesota services happen on both the Experts Data Warehouse-API Google Group and on the #research-activity channel on Tech-People UMN Slack. Both are open to anyone with a UMN internet ID.
It may be charitable to assign this database an alpha level of maturity. Probably obvious that this repository is a bit of a mess. We want your feedback to help make it better! Our goal was to release a useful, usable product to users as soon as possible. We recognize that we can't give you exactly what you need or want without working directly with you to find out what that is. Please join the user community discussion by following the instructions above.
Of course the Pure data model heavily influenced the design of this data model, but Citation Style Language (CSL) was at least as big an influence. Again, one goal was vendor-independence. CSL is a popular standard used by many citation managers. Another goal was to find a robust data model already implemented in popular web services APIs. csl-data.json was the best such model we could find. Our upcoming web services will adhere to it as closely as possible.
As the diagrams and documentation below should make painfully obvious, the data model is most complex wherever organizations are involved. We especially welcome feedback from users about these parts of the data model. There are likely many improvements we could make to allow for more convenient and performant queries, as well as conceptual clarity and simplicity.
The following diagrams are exports from Oracle SQL Developer.
The following information, in a separate section for each table, is in the database itself, in the form of comments on all tables and columns. We re-produce it here for convenience.
A person, usually an author of research outputs. May be internal or external to UMN.
Column | Description |
---|---|
UUID | Universally-unique ID for the person, generated for this Experts@Minnesota database. |
PURE_UUID | Universally-unique ID for the person in our Elsevier Pure database. |
PURE_ID | Unique ID for the person in our Elsevier Pure database. For UMN persons whose data we loaded into the Elsevier predecessor product, SciVal, this will be the SciVal ID. For other UMN persons whose data we have loaded into Pure, this will be the UMN employee ID (emplid). For UMN-external persons, this will be NULL. Note that because we have not loaded data for all UMN persons into Pure, some UMN persons will be classified as external in Pure. |
ORCID | Open Researcher and Contributor ID for the person. |
SCOPUS_ID | Unique ID for the person in the Elsevier Scopus database. |
HINDEX | An index that attempts to measure both the productivity and impact of the published work of a scientist or scholar. Used only in some disciplines, so for many persons this will be NULL. More info on blog.scopus.com. |
EMPLID | UMN employee ID (emplid). |
INTERNET_ID | UMN internet ID. |
FIRST_NAME | The given name for the person. |
LAST_NAME | The family name for the person. |
PURE_INTERNAL | "Y" if Pure classifies the person as UMN-internal, "N" otherwise. Note that, because we have not loaded data for all UMN persons into Pure, some UMN persons will be classified as external in Pure. |
Research output. Named "pub", short for "publication", due to Oracle character-lenght limits.
Column | Description |
---|---|
UUID | Universally-unique ID for the item, generated for this Experts@Minnesota database. |
PURE_UUID | Universally-unique ID for the item in our Elsevier Pure database. |
SCOPUS_ID | Unique ID for the item in the Elsevier Scopus database. |
PMID | Unique ID for the item in the NCBI PubMed database. |
DOI | Digital Object Identifier for the item. |
TYPE | Publication type or format of the item. See the CSL spec for a list of values. |
TITLE | Primary title of the item. |
CONTAINER_TITLE | Title of the container holding the item (e.g. the book title for a book chapter, the journal title for a journal article). |
ISSUED | Date the item was issued/published. |
ISSUED_PRECISION | Precision of the ISSUED column, in days: 366 (year), 31 (month), 1 (day). |
VOLUME | Volume holding the item (e.g. “2” when citing a chapter from book volume 2). |
ISSUE | Issue holding the item (e.g. “5” when citing a journal article from journal volume 2, issue 5). |
PAGES | Range of pages the item (e.g. a journal article) covers in a container (e.g. a journal issue). |
CITATION_TOTAL | Number of citations of the item. |
ISSN | International Standard Serial Number. |
OWNER_PURE_ORG_UUID | Unique ID for the organization that owns the item in our Elsevier Pure database. |
Associates research outputs with persons (authors).
Column | Description |
---|---|
PUB_UUID | Foreign key to PUB. |
PERSON_UUID | Foreign key to PERSON. |
PERSON_ORDINAL | The position of the person in the author list for the research output in Pure. |
PERSON_ROLE | "author" or "editor". Need to find Pure documentation on any other possible values. |
PERSON_PURE_INTERNAL | "Y" if Pure classified the person as UMN-internal at the time of publication of the research output, "N" otherwise. Note that, because we have not loaded data for all UMN persons into Pure, some UMN persons will be classified as external in Pure. |
FIRST_NAME | The given name for the person as it appears in the author list for the research output in Pure. Note that this may be differ from PERSON.FIRST_NAME. |
LAST_NAME | The family name for the person as it appears in the author list for the research output in Pure. Note that this may be differ from PERSON.LAST_NAME. |
EMPLID | De-normalization column. See the description in PERSON. |
An organization (e.g. university, college, department, etc.) in Pure. May be internal or external to UMN. Pure requires all UMN-internal organizations to be part of a single hierarchy, with UMN itself as the root. Note that sometimes we combine multiple UMN departments into one Pure organization. UMN-external organizations are never part of a hierarchy in Pure, and Pure gives us limited information for them in general.
Column | Description |
---|---|
PURE_UUID | Universally-unique ID for the organization in our Elsevier Pure database. |
PURE_ID | Unique ID for the organization in our Elsevier Pure database. NULL for UMN-external organizations, and some UMN-internal organizations. |
PARENT_PURE_UUID | Universally-unique ID for the parent organization in our Elsevier Pure database. NULL for UMN-external organizations. |
PARENT_PURE_ID | Unique ID for the parent organization in our Elsevier Pure database. NULL for UMN-external organizations, and some UMN-internal organizations. |
PURE_INTERNAL | "Y" if Pure classifies the organization as UMN-internal, "N" otherwise. |
TYPE | "academic", "college", "corporate", "department", "government", "initiative", "institute", "medical", "private non-profit", "university", or "unknown" |
NAME_EN | Name of the organization. Called "name_en" to be consistent with Pure naming, and to indicate that this is an English name. |
NAME_VARIANT_EN | An alternative name of the organization. Called "name_variant_en" to be consistent with Pure naming, and to indicate that this is an English name. |
URL | The website for the organization. |
The hierarchy (tree) of Pure UMN-internal organizations. This tree uses nested sets, as implemented by the Python package sqlalchemy_mptt. However, because Oracle supports recursive queries, this may not be the best implementation. Because parent-child relationships (adjacency lists) already exist in the PURE_ORG table, this entire table may be unnecessary and may go away.
Column | Description |
---|---|
ID | The unique ID for the node. Defined by sqlalchemy_mptt. |
LFT | The left number for the node. Defined by sqlalchemy_mptt. |
PARENT_ID | The unique ID for the parent of the node. Defined by sqlalchemy_mptt. |
level | The depth (i.e. generation) of this node in the tree. Defined by sqlalchemy_mptt. |
RGT | The right number for the node. Defined by sqlalchemy_mptt. |
TREE_ID | The unique ID of the tree that contains the node. Defined by sqlalchemy_mptt, which supports multiple trees in a single table. |
PURE_UUID | See the description in PURE_ORG. |
PURE_ID | See the description in PURE_ORG. |
NAME_EN | See the description in PURE_ORG. |
Associates UMN departments with Pure organizations. Note that many UMN departments may map to one Pure organization.
Column | Description |
---|---|
UMN_DEPT_ID | Unique ID for the UMN department in PeopleSoft. |
UMN_DEPT_NAME | Name of the UMN department in PeopleSoft. De-normalization column. |
PURE_ORG_UUID | Foreign key to PURE_ORG. |
PURE_ORG_ID | De-normalization column. See the description in PURE_ORG. |
Associates persons that Pure classifies as UMN-internal with Pure organizations. We use this table, in addition to PERSON_PURE_ORG, because Pure attaches far more data to UMN-internal persons, some of which we use to ensure row uniqueness. Note that there are four columns in the primary key: PURE_ORG_UUID, PERSON_UUID, JOB_DESCRIPTION, and START_DATE. This is because UMN-internal persons may change positions, and also organization affiliations, over time. There may be multiple rows for the same person in this table.
Column | Description |
---|---|
PERSON_UUID | Foreign key to PERSON. |
PURE_ORG_UUID | Foreign key to PURE_ORG. |
JOB_DESCRIPTION | The description of this job in PeopleSoft. Maybe be better to use a job code here instead. |
EMPLOYED_AS | Always "Academic" for the data we have loaded so far. Uncertain whether we will have other values in the future. |
STAFF_TYPE | "academic" or "nonacademic". |
START_DATE | The date the person started this job with this organization. |
END_DATE | The date the person ended this job with this organization. |
PRIMARY | "Y" if this is the person"s primary organization affiliation, otherwise "N". |
EMPLID | De-normalization column. See the description in PERSON. |
PURE_PERSON_ID | De-normalization column. See the description for PERSON.PURE_ID. |
PURE_ORG_ID | De-normalization column. See the description for PURE_ORG.PURE_ID. |
Associates persons with their organizations.
Column | Description |
---|---|
PERSON_UUID | Foreign key to PERSON. |
PURE_ORG_UUID | Foreign key to PURE_ORG. |
Associates with persons with their organization affiliations at the time of publication of a research output.
Column | Description |
---|---|
PUB_UUID | Foreign key to PUB. |
PERSON_UUID | Foreign key to PERSON. |
PURE_ORG_UUID | Foreign key to PURE_ORG. |
This package uses these environment variables to configure Oracle connections:
Variable | Description |
---|---|
EXPERTS_DB_USER | Database connection username |
EXPERTS_DB_PASS | Database connection password |
EXPERTS_DB_SERVICE_NAME | Oracle database service name (e.g. TNSNAMES definition) |
EXPERTS_DB_HOSTNAME | Oracle database domain name. Necessary for making cx_Oracle connections. |
This project uses SQLAlchemy with Alembic to track database schema changes in migrations.
Alembic has been configured to read the required environment variables
from the environment if already present, or to load them automatically from dotenv files using
dotenv_switch
. Create dotenv files in the
project directory, e.g., .env.test
, .env.prod
, as needed.
dotenv_switch
will use the test
environment by default; instruct it to load a different
environment by setting the APP_ENV
variable. For instance APP_ENV=prod
will cause
dotenv_switch
to load .env.prod
.
This project uses poetry to manage dependencies, so Alembic must be
invoked within a poetry-created virtual environment. One way to do that is by preceding commands
with poetry run
. Examples, including the use of the APP_ENV
variable described above:
# Run migrations against test environment (default)
$ poetry run alembic upgrade head
# Run migrations against prod environment
$ APP_ENV=prod poetry run alembic upgrade head
# Upgrade to a specific revision from alembic/versions
$ poetry run alembic upgrade fd1239632c06
# Downgrade to a previous revision
$ poetry run alembic downgrade fd1239632c06