404
+ +Page not found
+ + +diff --git a/.nojekyll b/.nojekyll new file mode 100644 index 0000000..e69de29 diff --git a/404.html b/404.html new file mode 100644 index 0000000..f22d7ca --- /dev/null +++ b/404.html @@ -0,0 +1,154 @@ + + +
+ + + + +Page not found
+ + +The Data Station architecture gives a view of the components of one Data Station. Compare it a slice +in the core services diagram, including the Data Vault at the core. As stated before, there are +two slices in this diagram that are not Data Stations, but can be viewed as variations on—or different +configurations of—the Data Station architecture. Apart from these, there is a temporary configuration that will +be used during the migration of datasets from the legacy repository system EASY to the Data Stations and the Data Vault.
+Details about the various configurations can be found in the following sections:
+ + +Migration in progress
+This documentation describes the future core services of DANS. As of writing, some of these services are still under +development. For the current status contact the Data Station Manager of the relevant Data Station.
+The DANS Core Services are centered around the concept of a Data Station. A Data Station is a repository system that +is used for depositing, curating and disseminating datasets, as well as creating long-term preservation copies of those +datasets. These long-term preservation copies are stored in the DANS Data Vault. The following diagram gives a +high-level overview:
+ +Each Data Station targets a part of the scientific research community. There is a Data Station for each of:
+ +The Data Stations use Dataverse as their repository system. Dataverse is an open source repository +system developed by Harvard University. The Data Stations create a long-term preservation copy of each dataset in the +DANS Data Vault.
+The Data Stations will be trusted repositories. They are displayed as blue slices in the diagram. The grey slices are +not Data Stations, as they are not in themselves full trusted repositories. In the technical architecture, however, +they are described as variations on the Data Station architecture, as they are built using mostly +the same components.
+DataverseNL is a Dataverse installation that offers deposit and dissemination services. Datasets +stored in DataverseNL are also preserved in the DANS Data Vault. However, curation of the datasets is the responsibility +of the DataverseNL customer.
+Vault as a Service offers an interface for automated deposit of datasets directly into the DANS Data Vault. +This service can be used as a building block in a customer's own archival workflow.
+The services have both human and machine interfaces. This is represented in the diagram by the people and computer +icons. Note that the Vault as a Service has no human interface. See for more information +under External interfaces.
+ +The Data Vault is subdivided into Storage Roots, each one containing the long term preservation copies for either a Data Station or a "Vault as a +Service" (VaaS) customer. The Data Vault Storage Root (DVSR) can be viewed as a type of interface, or exchange format, albeit an atypical one, as it is aimed +at +future users, rather than current ones.
+dd-data-vault interface
+Do not confuse the DVSR with the service interface of dd-data-vault, which is an internal microservice interface that is used by the transfer service to +store data in the Data Vault.
+The DANS Data Vault is implemented as an array of OCFL repositories. OCFL stands for Oxford Common File Layout. It is a community +specification for the layout of a repository that stores versioned digital objects. Each repository, or "storage root," is one +Data Vault Storage Root (DVSR). The Data Stations each have their own DVSR as does each customer of the Vault as a Service.
+OCFL repositories can be serialized in different ways, for example as a directory structure on a file system, or as objects in an object store. The DANS Data +Vault uses the SURF Data Archive tape storage. The tape storage system that is used by Data Archive organizes files in a file-folder +structure, so in principle serialization should be the same as to a disk-based files system, from OCFL's perspective. However, the tape storage system requires +a minimum file size of 1GB, which is much larger than the typical data file stored in the DANS Data Vault. To meet this requirement, the OCFL repositories are +stored as a series of DMF TAR archives (see note below on how this is different from a regular TAR archive), each of which is larger than 1GB. Each archive +forms a layer. To restore the OCFL repository, the layers must be extracted in the correct order. For a more detailed description of the layers, see the +documentation of dans-layer-store-lib.
+DMF TAR
+The tape storage system used by Data Archive is managed by DMF, which stands for Data Migration Facility. SURF has developed a utility called
+dmftar: "dmftar is a wrapper for the Linux tool gnutar and automatically creates multi-volume archive files (...) and can incorporate
+the transfer of the files to the archive file system if necessary." dmftar
stores the TAR volumes in a directory with the extension .dmftar
, which also
+contains an index and a checksum file.
OCFL is a generic storage model. It does not define the concept of a dataset. The DANS archival systems (Data Stations and Vault as a Service), on the other +hand, are built around the dataset concept. The mapping between the two models is as follows:
+DANS dataset model | +OCFL model | +
---|---|
Dataset | +OCFL Object | +
Dataset Version | +OCFL Object Version | +
Datafile | +OCFL Content File | +
Each Dataset Version Export (DVE) is stored in a separate OCFL Object Version. This means that there is a 1-to-1 mapping between a DVE and an OCFL Object +Version. Note however, that it is possible that one dataset version is exported multiple times. The mapping of a dataset version to an OCFL Object is therefore +a 1-to-n relationship.
+A multiple exports scenario
+A scenario where a dataset version is exported multiple times is when the dataset was updated in the Data Station without creating a new version. This can +be done by a superuser and is known as "updatecurrent". A new Dataset Version Export will be created and therefore a new OCFL Object Version will be +created as well. The Data Station version history, however, will not display an additional version.
+To identify datasets, versions and data files in the OCFL repository, the following metadata is used:
+ +The full metadata of each dataset version is stored, but the way it is stored depends on the export format used. The current export format is based on Dataverse +implementation of the RDA Research Data Repository Interoperability WG recommendations.
+ +This document gives an overview of the Data Station architecture. The schema below displays the components of a Data +Station and how they relate to each other. The notation used is not a formal one and is intended to be self-explanatory. +To the extent that it is not, you might want to consult +the legend that is included at the end of this page.
+ +++"The Dataverse Project is an open source web application to share, preserve, cite, explore, and analyze research +data."
+
In the Data Station this repository system is used for depositing, storing and disseminating datasets, as well as +creating long-term preservation copies of those datasets.
+Dataverse provides event hooks that allow to configure workflows to run just before and after a publication event. These
+workflows can have multiple steps. A step can be implemented as part of Dataverse or as an external service. The
+following microservices are configured to run as PrePublishDataset
workflow steps:
The following microservices are candidates to become part of the PrePublishDataset
workflow in the future:
The RDA Bag Export flow step is implemented in Dataverse and is used to export an RDA compliant bag (also a "Dataset
+Version Export" or DVE) for each dataset version after publication (i.e. in the PostPublishDataset
workflow). This
+exported bag is then picked up by dd-transfer-to-vault.
Docs | +Code | +
---|---|
Dataverse | +https://github.com/IQSS/dataverse | +
Workflows | +Part of the Dataverse code base | +
DANS implementation of the SWORD v2 protocol for automated deposits.
+Docs | +Code | +
---|---|
dd-sword2 | +https://github.com/DANS-KNAW/dd-sword2 | +
dd-dans-sword2-examples | +https://github.com/DANS-KNAW/dd-dans-sword2-examples | +
A proxy that authenticates clients on behalf of Dataverse, using the basic auth protocol or a Dataverse API token. It is +used by dd-sword2 to authenticate its clients by their Dataverse account credentials.
+Docs | +Code | +
---|---|
dd-dataverse-authenticator | +https://github.com/DANS-KNAW/dd-dataverse-authenticator | +
Service for ingesting deposit directories into Dataverse.
+Docs | +Code | +
---|---|
dd-ingest-flow | +https://github.com/DANS-KNAW/dd-ingest-flow | +
Service that checks whether a bag complies with DANS BagIt Profile v1. It is used by dd-ingest-flow +to validate bags that are uploaded via dd-sword2.
+Docs | +Code | +
---|---|
dd-validate-dans-bag | +https://github.com/DANS-KNAW/dd-validate-dans-bag | +
DANS BagIt Profile v1 | +https://github.com/DANS-KNAW/dans-bagit-profile | +
DANS schema | +https://github.com/DANS-KNAW/dans-schema | +
Service that manages and maintains information about deposits in a deposit area.
+Docs | +Code | +
---|---|
dd-manage-deposit | +https://github.com/DANS-KNAW/dd-manage-deposit | +
Command line utilities for Data Station application management.
+Docs | +Code | +
---|---|
dans-datastation-tools | +https://github.com/DANS-KNAW/dans-datastation-tools | +
A service that scans all files in a dataset for virus using clamav
and blocks publication if a virus is found.
Docs | +Code | +
---|---|
dd-virus-scan | +https://github.com/DANS-KNAW/dd-virus-scan | +
A service that fills in the "Vault Metadata" for a dataset version. These metadata will be used later on +by dd-transfer-to-vault to catalogue the long-term preservation copy of the dataset version +when it is stored on tape.
+Docs | +Code | +
---|---|
dd-vault-metadata | +https://github.com/DANS-KNAW/dd-vault-metadata | +
A thesaurus service developed by the National Library of Finland. It is used to serve the external controlled vocabulary +fields.
+Docs | +Code | +
---|---|
Skosmos | +https://github.com/NatLibFi/Skosmos | +
Service for preparing Dataset Version Exports for storage in the DANS Data Vault. This includes +validation, aggregation into larger files and creating a vault catalog entry for each export.
+Docs | +Code | +
---|---|
dd-transfer-to-vault | +https://github.com/DANS-KNAW/dd-transfer-to-vault | +
Service that manages a catalog of all Dataset Version Exports in the DANS Data Vault. It will expose +a summary page for each stored dataset.
+Docs | +Code | +
---|---|
dd-vault-catalog | +https://github.com/DANS-KNAW/dd-vault-catalog | +
Interface to the DANS Data Vault for depositing and managing Dataset Version Exports.
+Docs | +Code | +
---|---|
dd-data-vault | +https://github.com/DANS-KNAW/dd-data-vault | +
Provides the data-vault
command line tool for interacting with the DANS Data Vault.
Docs | +Code | +
---|---|
dd-data-vault-cli | +https://github.com/DANS-KNAW/dd-data-vault-cli | +
The NBN resolver service operated by DANS in cooperation with the Koninklijke Bibliotheek. It resolves +NBN persistent identifiers to their current location. The resolver is hosted at +https://persistent-identifier.nl/.
+Docs and code | +
---|
NBN | +
https://github.com/DANS-KNAW/gmh-registration-service | +
https://github.com/DANS-KNAW/gmh-resolver-ui | +
https://github.com/DANS-KNAW/gmh-meresco | +
The DANS long-term preservation archive. It is implemented as an array of OCFL repositories, stored in DMF TAR files on tape. Each TAR file +represents a layer. If the layers are extracted to disk in the correct order, the result is an OCFL repository. For more details see the docs on the Data Vault +internal interface.
+Docs | +
---|
SURF Data Archive | +
OCFL | +
Data Vault internal interface | +
The components mentioned above use many open source libraries. A couple of these are developed by DANS and are available +on GitHub.
+Library | +Code | +
---|---|
dans-bagit-lib | +https://github.com/DANS-KNAW/dans-bagit-lib | +
dans-dataverse-client-lib | +https://github.com/DANS-KNAW/dans-dataverse-client-lib | +
dans-java-utils | +https://github.com/DANS-KNAW/dans-java-utils | +
The differences between DataverseNL and the Data Stations are mainly contractual and organisational. In DataverseNL, the +customer organization is responsible for the curation of datasets. The technical differences are minimal:
+A deposit directory is a directory containing:
+.
+└── deposit-directory
+ ├── <deposit files>
+ └── deposit.properties
+
+<deposit files>
¶The deposit files are one or more files or directories. Typically, it will be one directory, a bag, and more specifically, one +conforming to the DANS BagIt Profile v1. However, applications have different requirements with respect to the contents and lay-out of the +deposit.
+deposit.properties
¶Processing metadata about the deposit are stored in a properties file called deposit.properties
.
It shall have at minimum the following properties:
+Key | +Format | +Description | +
---|---|---|
creation.timestamp |
+ISO 8601 datetime, including timezone and in ms precision |
+Date/time when the deposit directory was created |
+
state.label |
++ | A label indicating the current state of the deposit |
+
state.description |
++ | A human readable description of the state or an error message, if state.label indicates an error |
+
Applications may use additional properties.
+ +The following sections discuss development on various parts of a Data Station. Some components are developed by DANS, +other components are provided by open source projects in which DANS may participate.
+The DANS microservices are based on the dans-module-archetype. This archetype creates a skeleton +microservice based on the DropWizard framework. When working on DANS microservices, please comply +with the best practises documented in the dans-module-archetype documentation.
+DANS command line tools are written in Python. Poetry is used as the build tool. The Python-based +modules are released to the DANS-KNAW PyPI account. PyCharm is the preferred IDE for +Python-based projects.
+The architecture overview makes clear that Dataverse plays a key role in the Data Station. That is why DANS +is actively involved in its developmentvia the Dataverse community. When working on Dataverse code take notice of +the developer docs.
+Debugging Dataverse
+For DANS developers it is not necessary (nor preferable) to set up Dataverse and its dependencies Solr and +PostGreSQL on your development laptop, as described in the developer docs. Instead you should use the pre-build +vagrant box. Information can be found in the private repository +dd-dtap +(only accessible for DANS developers).
+Skosmos is used to serve vocabulary terms to Dataverse. DANS is currently not actively involved in +development, but it is entirely possible that bug fixes may need to be contributed in the future. The project is written +in PHP but there is no information on the Skosmos website about the development environment set-up.
+The documentation for DANS projects (including this site) is written using mkdocs. The source code for
+those sites consists of markdown in combination with other resources, such as images. Images are often created
+with yEd. The graphml source code of the images is committed along with the image exports. You should
+always check that your changes render correctly. This is made easy by the start-mkdocs.sh
script
+in dans-dev-tools.
Until a Data Station obtains the Core Trust Seal (CTS), it will mirror all deposited dataset +to EASY, thus ensuring that these datasets are stored in a trusted repository. This temporary +configuration swaps out the Transfer Server for the EASY server.
+ + +The following interfaces are exposed to the outside world:
+This documentation site discusses the technical details of the core software-based services of DANS, the Dutch national +centre of expertise and repository for research data. It is intended for developers and system administrators, both +within DANS and outside.
+For more general information on DANS, its mission and services, see the DANS website.
+The documentation discusses the following topics:
+The Data Station architecture follows the microservices architectural style as far as possible. Dataverse is the only component that is not a +microservice. It is a monolith that is used as a black box.
+The interfaces between the microservices fall into the following categories:
+EASY (Electronic Archiving SYstem) is the legacy repository system of DANS. It was created in 2007 and has been +operational for over 15 years. It has the Core Trust Seal (CTS).We are currently in the process of migrating all datasets from +EASY to the Data Stations and/or the DANS Data Vault.
+For the current status of the migration, contact the Data Station Manager of the Data Station you are interested +in.
+ +' + escapeHtml(summary) +'
' + noResultsText + '
'); + } +} + +function doSearch () { + var query = document.getElementById('mkdocs-search-query').value; + if (query.length > min_search_length) { + if (!window.Worker) { + displayResults(search(query)); + } else { + searchWorker.postMessage({query: query}); + } + } else { + // Clear results for short queries + displayResults([]); + } +} + +function initSearch () { + var search_input = document.getElementById('mkdocs-search-query'); + if (search_input) { + search_input.addEventListener("keyup", doSearch); + } + var term = getSearchTermFromLocation(); + if (term) { + search_input.value = term; + doSearch(); + } +} + +function onWorkerMessage (e) { + if (e.data.allowSearch) { + initSearch(); + } else if (e.data.results) { + var results = e.data.results; + displayResults(results); + } else if (e.data.config) { + min_search_length = e.data.config.min_search_length-1; + } +} + +if (!window.Worker) { + console.log('Web Worker API not supported'); + // load index in main thread + $.getScript(joinUrl(base_url, "search/worker.js")).done(function () { + console.log('Loaded worker'); + init(); + window.postMessage = function (msg) { + onWorkerMessage({data: msg}); + }; + }).fail(function (jqxhr, settings, exception) { + console.error('Could not load worker.js'); + }); +} else { + // Wrap search in a web worker + var searchWorker = new Worker(joinUrl(base_url, "search/worker.js")); + searchWorker.postMessage({init: true}); + searchWorker.onmessage = onWorkerMessage; +} diff --git a/search/search_index.json b/search/search_index.json new file mode 100644 index 0000000..363d1cc --- /dev/null +++ b/search/search_index.json @@ -0,0 +1 @@ +{"config":{"indexing":"full","lang":["en"],"min_search_length":3,"prebuild_index":false,"separator":"[\\s\\-]+"},"docs":[{"location":"","text":"About this site \u00b6 This documentation site discusses the technical details of the core software-based services of DANS, the Dutch national centre of expertise and repository for research data. It is intended for developers and system administrators, both within DANS and outside. For more general information on DANS, its mission and services, see the DANS website . The documentation discusses the following topics: Architecture . The core services of DANS are centered around the concept of a Data Station . See Core Services for more information about what services are available. Each Data Station has basically the same structure. To read more about the architecture of a Data Station, see the section about the Data Station architecture . This page is also a jumping off point for more detailed information about the various components of the system. Configurations . The basic Data Station configuration has a couple of variations such as DataverseNL , Vault as a Service Interfaces . The Data Stations have a number of interfaces, both internal and external. For more information, see the sections about internal interfaces and about external interfaces . Migration . DANS is currently in the process of migrating all datasets from the legacy repository system EASY to the Data Stations and/or the DANS Data Vault. For more information, see the section about migration . Development . The Data Stations are developed by the DANS Core Systems Team. For more information about the development environment, see the section about development .","title":"About"},{"location":"#about-this-site","text":"This documentation site discusses the technical details of the core software-based services of DANS, the Dutch national centre of expertise and repository for research data. It is intended for developers and system administrators, both within DANS and outside. For more general information on DANS, its mission and services, see the DANS website . The documentation discusses the following topics: Architecture . The core services of DANS are centered around the concept of a Data Station . See Core Services for more information about what services are available. Each Data Station has basically the same structure. To read more about the architecture of a Data Station, see the section about the Data Station architecture . This page is also a jumping off point for more detailed information about the various components of the system. Configurations . The basic Data Station configuration has a couple of variations such as DataverseNL , Vault as a Service Interfaces . The Data Stations have a number of interfaces, both internal and external. For more information, see the sections about internal interfaces and about external interfaces . Migration . DANS is currently in the process of migrating all datasets from the legacy repository system EASY to the Data Stations and/or the DANS Data Vault. For more information, see the section about migration . Development . The Data Stations are developed by the DANS Core Systems Team. For more information about the development environment, see the section about development .","title":"About this site"},{"location":"configurations/","text":"Configurations \u00b6 The Data Station architecture gives a view of the components of one Data Station. Compare it a slice in the core services diagram , including the Data Vault at the core. As stated before, there are two slices in this diagram that are not Data Stations, but can be viewed as variations on\u2014or different configurations of\u2014the Data Station architecture. Apart from these, there is a temporary configuration that will be used during the migration of datasets from the legacy repository system EASY to the Data Stations and the Data Vault. Details about the various configurations can be found in the following sections: DataverseNL Vault as a Service Mirroring to EASY","title":"Overview"},{"location":"configurations/#configurations","text":"The Data Station architecture gives a view of the components of one Data Station. Compare it a slice in the core services diagram , including the Data Vault at the core. As stated before, there are two slices in this diagram that are not Data Stations, but can be viewed as variations on\u2014or different configurations of\u2014the Data Station architecture. Apart from these, there is a temporary configuration that will be used during the migration of datasets from the legacy repository system EASY to the Data Stations and the Data Vault. Details about the various configurations can be found in the following sections: DataverseNL Vault as a Service Mirroring to EASY","title":"Configurations"},{"location":"core-services/","text":"Core Services \u00b6 Migration in progress This documentation describes the future core services of DANS. As of writing, some of these services are still under development. For the current status contact the Data Station Manager of the relevant Data Station . Overview \u00b6 The DANS Core Services are centered around the concept of a Data Station . A Data Station is a repository system that is used for depositing, curating and disseminating datasets, as well as creating long-term preservation copies of those datasets. These long-term preservation copies are stored in the DANS Data Vault. The following diagram gives a high-level overview: Data Stations \u00b6 Each Data Station targets a part of the scientific research community. There is a Data Station for each of: Archaeology Social Sciences and Humanities Life Sciences Physical and Technical Sciences The Data Stations use Dataverse as their repository system. Dataverse is an open source repository system developed by Harvard University. The Data Stations create a long-term preservation copy of each dataset in the DANS Data Vault. Other services \u00b6 The Data Stations will be trusted repositories. They are displayed as blue slices in the diagram. The grey slices are not Data Stations , as they are not in themselves full trusted repositories. In the technical architecture, however, they are described as variations on the Data Station architecture , as they are built using mostly the same components. DataverseNL \u00b6 DataverseNL is a Dataverse installation that offers deposit and dissemination services. Datasets stored in DataverseNL are also preserved in the DANS Data Vault. However, curation of the datasets is the responsibility of the DataverseNL customer. Vault as a Service \u00b6 Vault as a Service offers an interface for automated deposit of datasets directly into the DANS Data Vault. This service can be used as a building block in a customer's own archival workflow. Interfaces \u00b6 The services have both human and machine interfaces. This is represented in the diagram by the people and computer icons. Note that the Vault as a Service has no human interface. See for more information under External interfaces .","title":"Core Services"},{"location":"core-services/#core-services","text":"Migration in progress This documentation describes the future core services of DANS. As of writing, some of these services are still under development. For the current status contact the Data Station Manager of the relevant Data Station .","title":"Core Services"},{"location":"core-services/#overview","text":"The DANS Core Services are centered around the concept of a Data Station . A Data Station is a repository system that is used for depositing, curating and disseminating datasets, as well as creating long-term preservation copies of those datasets. These long-term preservation copies are stored in the DANS Data Vault. The following diagram gives a high-level overview:","title":"Overview"},{"location":"core-services/#data-stations","text":"Each Data Station targets a part of the scientific research community. There is a Data Station for each of: Archaeology Social Sciences and Humanities Life Sciences Physical and Technical Sciences The Data Stations use Dataverse as their repository system. Dataverse is an open source repository system developed by Harvard University. The Data Stations create a long-term preservation copy of each dataset in the DANS Data Vault.","title":"Data Stations"},{"location":"core-services/#other-services","text":"The Data Stations will be trusted repositories. They are displayed as blue slices in the diagram. The grey slices are not Data Stations , as they are not in themselves full trusted repositories. In the technical architecture, however, they are described as variations on the Data Station architecture , as they are built using mostly the same components.","title":"Other services"},{"location":"core-services/#dataversenl","text":"DataverseNL is a Dataverse installation that offers deposit and dissemination services. Datasets stored in DataverseNL are also preserved in the DANS Data Vault. However, curation of the datasets is the responsibility of the DataverseNL customer.","title":"DataverseNL"},{"location":"core-services/#vault-as-a-service","text":"Vault as a Service offers an interface for automated deposit of datasets directly into the DANS Data Vault. This service can be used as a building block in a customer's own archival workflow.","title":"Vault as a Service"},{"location":"core-services/#interfaces","text":"The services have both human and machine interfaces. This is represented in the diagram by the people and computer icons. Note that the Vault as a Service has no human interface. See for more information under External interfaces .","title":"Interfaces"},{"location":"data-vault-storage-root/","text":"Data Vault Storage Root \u00b6 Introduction \u00b6 The Data Vault is subdivided into Storage Roots , each one containing the long term preservation copies for either a Data Station or a \"Vault as a Service\" (VaaS) customer. The Data Vault Storage Root (DVSR) can be viewed as a type of interface, or exchange format, albeit an atypical one, as it is aimed at future users, rather than current ones. dd-data-vault interface Do not confuse the DVSR with the service interface of dd-data-vault , which is an internal microservice interface that is used by the transfer service to store data in the Data Vault. OCFL repositories \u00b6 The DANS Data Vault is implemented as an array of OCFL repositories. OCFL stands for Oxford Common File Layout . It is a community specification for the layout of a repository that stores versioned digital objects. Each repository, or \"storage root,\" is one Data Vault Storage Root (DVSR) . The Data Stations each have their own DVSR as does each customer of the Vault as a Service. Serialization in layers \u00b6 OCFL repositories can be serialized in different ways, for example as a directory structure on a file system, or as objects in an object store. The DANS Data Vault uses the SURF Data Archive tape storage. The tape storage system that is used by Data Archive organizes files in a file-folder structure, so in principle serialization should be the same as to a disk-based files system, from OCFL's perspective. However, the tape storage system requires a minimum file size of 1GB, which is much larger than the typical data file stored in the DANS Data Vault. To meet this requirement, the OCFL repositories are stored as a series of DMF TAR archives (see note below on how this is different from a regular TAR archive), each of which is larger than 1GB. Each archive forms a layer. To restore the OCFL repository, the layers must be extracted in the correct order. For a more detailed description of the layers, see the documentation of dans-layer-store-lib . DMF TAR The tape storage system used by Data Archive is managed by DMF, which stands for Data Migration Facility . SURF has developed a utility called dmftar : \"dmftar is a wrapper for the Linux tool gnutar and automatically creates multi-volume archive files (...) and can incorporate the transfer of the files to the archive file system if necessary.\" dmftar stores the TAR volumes in a directory with the extension .dmftar , which also contains an index and a checksum file. Dataset model mapping \u00b6 OCFL is a generic storage model. It does not define the concept of a dataset. The DANS archival systems (Data Stations and Vault as a Service), on the other hand, are built around the dataset concept. The mapping between the two models is as follows: DANS dataset model OCFL model Dataset OCFL Object Dataset Version OCFL Object Version Datafile OCFL Content File Versions \u00b6 Each Dataset Version Export (DVE) is stored in a separate OCFL Object Version. This means that there is a 1-to-1 mapping between a DVE and an OCFL Object Version. Note however, that it is possible that one dataset version is exported multiple times. The mapping of a dataset version to an OCFL Object is therefore a 1-to- n relationship. A multiple exports scenario A scenario where a dataset version is exported multiple times is when the dataset was updated in the Data Station without creating a new version. This can be done by a superuser and is known as \"updatecurrent\" . A new Dataset Version Export will be created and therefore a new OCFL Object Version will be created as well. The Data Station version history, however, will not display an additional version. Identifying metadata \u00b6 To identify datasets, versions and data files in the OCFL repository, the following metadata is used: The full metadata of each dataset version is stored, but the way it is stored depends on the export format used. The current export format is based on Dataverse implementation of the RDA Research Data Repository Interoperability WG recommendations .","title":"Data Vault Storage Root"},{"location":"data-vault-storage-root/#data-vault-storage-root","text":"","title":"Data Vault Storage Root"},{"location":"data-vault-storage-root/#introduction","text":"The Data Vault is subdivided into Storage Roots , each one containing the long term preservation copies for either a Data Station or a \"Vault as a Service\" (VaaS) customer. The Data Vault Storage Root (DVSR) can be viewed as a type of interface, or exchange format, albeit an atypical one, as it is aimed at future users, rather than current ones. dd-data-vault interface Do not confuse the DVSR with the service interface of dd-data-vault , which is an internal microservice interface that is used by the transfer service to store data in the Data Vault.","title":"Introduction"},{"location":"data-vault-storage-root/#ocfl-repositories","text":"The DANS Data Vault is implemented as an array of OCFL repositories. OCFL stands for Oxford Common File Layout . It is a community specification for the layout of a repository that stores versioned digital objects. Each repository, or \"storage root,\" is one Data Vault Storage Root (DVSR) . The Data Stations each have their own DVSR as does each customer of the Vault as a Service.","title":"OCFL repositories"},{"location":"data-vault-storage-root/#serialization-in-layers","text":"OCFL repositories can be serialized in different ways, for example as a directory structure on a file system, or as objects in an object store. The DANS Data Vault uses the SURF Data Archive tape storage. The tape storage system that is used by Data Archive organizes files in a file-folder structure, so in principle serialization should be the same as to a disk-based files system, from OCFL's perspective. However, the tape storage system requires a minimum file size of 1GB, which is much larger than the typical data file stored in the DANS Data Vault. To meet this requirement, the OCFL repositories are stored as a series of DMF TAR archives (see note below on how this is different from a regular TAR archive), each of which is larger than 1GB. Each archive forms a layer. To restore the OCFL repository, the layers must be extracted in the correct order. For a more detailed description of the layers, see the documentation of dans-layer-store-lib . DMF TAR The tape storage system used by Data Archive is managed by DMF, which stands for Data Migration Facility . SURF has developed a utility called dmftar : \"dmftar is a wrapper for the Linux tool gnutar and automatically creates multi-volume archive files (...) and can incorporate the transfer of the files to the archive file system if necessary.\" dmftar stores the TAR volumes in a directory with the extension .dmftar , which also contains an index and a checksum file.","title":"Serialization in layers"},{"location":"data-vault-storage-root/#dataset-model-mapping","text":"OCFL is a generic storage model. It does not define the concept of a dataset. The DANS archival systems (Data Stations and Vault as a Service), on the other hand, are built around the dataset concept. The mapping between the two models is as follows: DANS dataset model OCFL model Dataset OCFL Object Dataset Version OCFL Object Version Datafile OCFL Content File","title":"Dataset model mapping"},{"location":"data-vault-storage-root/#versions","text":"Each Dataset Version Export (DVE) is stored in a separate OCFL Object Version. This means that there is a 1-to-1 mapping between a DVE and an OCFL Object Version. Note however, that it is possible that one dataset version is exported multiple times. The mapping of a dataset version to an OCFL Object is therefore a 1-to- n relationship. A multiple exports scenario A scenario where a dataset version is exported multiple times is when the dataset was updated in the Data Station without creating a new version. This can be done by a superuser and is known as \"updatecurrent\" . A new Dataset Version Export will be created and therefore a new OCFL Object Version will be created as well. The Data Station version history, however, will not display an additional version.","title":"Versions"},{"location":"data-vault-storage-root/#identifying-metadata","text":"To identify datasets, versions and data files in the OCFL repository, the following metadata is used: The full metadata of each dataset version is stored, but the way it is stored depends on the export format used. The current export format is based on Dataverse implementation of the RDA Research Data Repository Interoperability WG recommendations .","title":"Identifying metadata"},{"location":"datastation/","text":"Data Station architecture \u00b6 Overview \u00b6 This document gives an overview of the Data Station architecture. The schema below displays the components of a Data Station and how they relate to each other. The notation used is not a formal one and is intended to be self-explanatory. To the extent that it is not, you might want to consult the legend that is included at the end of this page . Enlarge Image Actors \u00b6 Data Station User - a user of the Data Station, typically a customer who downloads or deposits data. Data Manager - a user with special privileges, who curates and publishes datasets submitted for review by a user. SWORD2 Client - a software client that interacts with the DANS SWORD2 Service to deposit datasets. Components \u00b6 Dataverse \u00b6 \"The Dataverse Project is an open source web application to share, preserve, cite, explore, and analyze research data.\" In the Data Station this repository system is used for depositing, storing and disseminating datasets, as well as creating long-term preservation copies of those datasets. Workflows \u00b6 Dataverse provides event hooks that allow to configure workflows to run just before and after a publication event. These workflows can have multiple steps. A step can be implemented as part of Dataverse or as an external service. The following microservices are configured to run as PrePublishDataset workflow steps: dd-vault-metadata The following microservices are candidates to become part of the PrePublishDataset workflow in the future: dd-virus-scan The RDA Bag Export flow step is implemented in Dataverse and is used to export an RDA compliant bag (also a \"Dataset Version Export\" or DVE) for each dataset version after publication (i.e. in the PostPublishDataset workflow). This exported bag is then picked up by dd-transfer-to-vault . Docs Code Dataverse https://github.com/IQSS/dataverse Workflows Part of the Dataverse code base dd-sword2 \u00b6 DANS implementation of the SWORD v2 protocol for automated deposits. Docs Code dd-sword2 https://github.com/DANS-KNAW/dd-sword2 dd-dans-sword2-examples https://github.com/DANS-KNAW/dd-dans-sword2-examples dd-dataverse-authenticator \u00b6 A proxy that authenticates clients on behalf of Dataverse, using the basic auth protocol or a Dataverse API token. It is used by dd-sword2 to authenticate its clients by their Dataverse account credentials. Docs Code dd-dataverse-authenticator https://github.com/DANS-KNAW/dd-dataverse-authenticator dd-ingest-flow \u00b6 Service for ingesting deposit directories into Dataverse. Docs Code dd-ingest-flow https://github.com/DANS-KNAW/dd-ingest-flow dd-validate-dans-bag \u00b6 Service that checks whether a bag complies with DANS BagIt Profile v1. It is used by dd-ingest-flow to validate bags that are uploaded via dd-sword2 . Docs Code dd-validate-dans-bag https://github.com/DANS-KNAW/dd-validate-dans-bag DANS BagIt Profile v1 https://github.com/DANS-KNAW/dans-bagit-profile DANS schema https://github.com/DANS-KNAW/dans-schema dd-manage-deposit \u00b6 Service that manages and maintains information about deposits in a deposit area. Docs Code dd-manage-deposit https://github.com/DANS-KNAW/dd-manage-deposit dans-datastation-tools \u00b6 Command line utilities for Data Station application management. Docs Code dans-datastation-tools https://github.com/DANS-KNAW/dans-datastation-tools dd-virus-scan \u00b6 A service that scans all files in a dataset for virus using clamav and blocks publication if a virus is found. Docs Code dd-virus-scan https://github.com/DANS-KNAW/dd-virus-scan dd-vault-metadata \u00b6 A service that fills in the \"Vault Metadata\" for a dataset version. These metadata will be used later on by dd-transfer-to-vault to catalogue the long-term preservation copy of the dataset version when it is stored on tape. Docs Code dd-vault-metadata https://github.com/DANS-KNAW/dd-vault-metadata Skosmos \u00b6 A thesaurus service developed by the National Library of Finland. It is used to serve the external controlled vocabulary fields. Docs Code Skosmos https://github.com/NatLibFi/Skosmos dd-transfer-to-vault \u00b6 Service for preparing Dataset Version Exports for storage in the DANS Data Vault . This includes validation, aggregation into larger files and creating a vault catalog entry for each export. Docs Code dd-transfer-to-vault https://github.com/DANS-KNAW/dd-transfer-to-vault dd-vault-catalog \u00b6 Service that manages a catalog of all Dataset Version Exports in the DANS Data Vault . It will expose a summary page for each stored dataset. Docs Code dd-vault-catalog https://github.com/DANS-KNAW/dd-vault-catalog dd-data-vault \u00b6 Interface to the DANS Data Vault for depositing and managing Dataset Version Exports. Docs Code dd-data-vault https://github.com/DANS-KNAW/dd-data-vault dd-data-vault-cli \u00b6 Provides the data-vault command line tool for interacting with the DANS Data Vault . Docs Code dd-data-vault-cli https://github.com/DANS-KNAW/dd-data-vault-cli BRI-GMH \u00b6 The NBN resolver service operated by DANS in cooperation with the Koninklijke Bibliotheek . It resolves NBN persistent identifiers to their current location. The resolver is hosted at https://persistent-identifier.nl/ . Docs and code NBN https://github.com/DANS-KNAW/gmh-registration-service https://github.com/DANS-KNAW/gmh-resolver-ui https://github.com/DANS-KNAW/gmh-meresco DANS Data Vault \u00b6 The DANS long-term preservation archive. It is implemented as an array of OCFL repositories, stored in DMF TAR files on tape. Each TAR file represents a layer. If the layers are extracted to disk in the correct order, the result is an OCFL repository. For more details see the docs on the Data Vault internal interface. Docs SURF Data Archive OCFL Data Vault internal interface Libraries \u00b6 The components mentioned above use many open source libraries. A couple of these are developed by DANS and are available on GitHub. Library Code dans-bagit-lib https://github.com/DANS-KNAW/dans-bagit-lib dans-dataverse-client-lib https://github.com/DANS-KNAW/dans-dataverse-client-lib dans-java-utils https://github.com/DANS-KNAW/dans-java-utils Schema Legend \u00b6","title":"Data Station"},{"location":"datastation/#data-station-architecture","text":"","title":"Data Station architecture"},{"location":"datastation/#overview","text":"This document gives an overview of the Data Station architecture. The schema below displays the components of a Data Station and how they relate to each other. The notation used is not a formal one and is intended to be self-explanatory. To the extent that it is not, you might want to consult the legend that is included at the end of this page . Enlarge Image","title":"Overview"},{"location":"datastation/#actors","text":"Data Station User - a user of the Data Station, typically a customer who downloads or deposits data. Data Manager - a user with special privileges, who curates and publishes datasets submitted for review by a user. SWORD2 Client - a software client that interacts with the DANS SWORD2 Service to deposit datasets.","title":"Actors"},{"location":"datastation/#components","text":"","title":"Components"},{"location":"datastation/#dataverse","text":"\"The Dataverse Project is an open source web application to share, preserve, cite, explore, and analyze research data.\" In the Data Station this repository system is used for depositing, storing and disseminating datasets, as well as creating long-term preservation copies of those datasets.","title":"Dataverse"},{"location":"datastation/#workflows","text":"Dataverse provides event hooks that allow to configure workflows to run just before and after a publication event. These workflows can have multiple steps. A step can be implemented as part of Dataverse or as an external service. The following microservices are configured to run as PrePublishDataset workflow steps: dd-vault-metadata The following microservices are candidates to become part of the PrePublishDataset workflow in the future: dd-virus-scan The RDA Bag Export flow step is implemented in Dataverse and is used to export an RDA compliant bag (also a \"Dataset Version Export\" or DVE) for each dataset version after publication (i.e. in the PostPublishDataset workflow). This exported bag is then picked up by dd-transfer-to-vault . Docs Code Dataverse https://github.com/IQSS/dataverse Workflows Part of the Dataverse code base","title":"Workflows"},{"location":"datastation/#dd-sword2","text":"DANS implementation of the SWORD v2 protocol for automated deposits. Docs Code dd-sword2 https://github.com/DANS-KNAW/dd-sword2 dd-dans-sword2-examples https://github.com/DANS-KNAW/dd-dans-sword2-examples","title":"dd-sword2"},{"location":"datastation/#dd-dataverse-authenticator","text":"A proxy that authenticates clients on behalf of Dataverse, using the basic auth protocol or a Dataverse API token. It is used by dd-sword2 to authenticate its clients by their Dataverse account credentials. Docs Code dd-dataverse-authenticator https://github.com/DANS-KNAW/dd-dataverse-authenticator","title":"dd-dataverse-authenticator"},{"location":"datastation/#dd-ingest-flow","text":"Service for ingesting deposit directories into Dataverse. Docs Code dd-ingest-flow https://github.com/DANS-KNAW/dd-ingest-flow","title":"dd-ingest-flow"},{"location":"datastation/#dd-validate-dans-bag","text":"Service that checks whether a bag complies with DANS BagIt Profile v1. It is used by dd-ingest-flow to validate bags that are uploaded via dd-sword2 . Docs Code dd-validate-dans-bag https://github.com/DANS-KNAW/dd-validate-dans-bag DANS BagIt Profile v1 https://github.com/DANS-KNAW/dans-bagit-profile DANS schema https://github.com/DANS-KNAW/dans-schema","title":"dd-validate-dans-bag"},{"location":"datastation/#dd-manage-deposit","text":"Service that manages and maintains information about deposits in a deposit area. Docs Code dd-manage-deposit https://github.com/DANS-KNAW/dd-manage-deposit","title":"dd-manage-deposit"},{"location":"datastation/#dans-datastation-tools","text":"Command line utilities for Data Station application management. Docs Code dans-datastation-tools https://github.com/DANS-KNAW/dans-datastation-tools","title":"dans-datastation-tools"},{"location":"datastation/#dd-virus-scan","text":"A service that scans all files in a dataset for virus using clamav and blocks publication if a virus is found. Docs Code dd-virus-scan https://github.com/DANS-KNAW/dd-virus-scan","title":"dd-virus-scan"},{"location":"datastation/#dd-vault-metadata","text":"A service that fills in the \"Vault Metadata\" for a dataset version. These metadata will be used later on by dd-transfer-to-vault to catalogue the long-term preservation copy of the dataset version when it is stored on tape. Docs Code dd-vault-metadata https://github.com/DANS-KNAW/dd-vault-metadata","title":"dd-vault-metadata"},{"location":"datastation/#skosmos","text":"A thesaurus service developed by the National Library of Finland. It is used to serve the external controlled vocabulary fields. Docs Code Skosmos https://github.com/NatLibFi/Skosmos","title":"Skosmos"},{"location":"datastation/#dd-transfer-to-vault","text":"Service for preparing Dataset Version Exports for storage in the DANS Data Vault . This includes validation, aggregation into larger files and creating a vault catalog entry for each export. Docs Code dd-transfer-to-vault https://github.com/DANS-KNAW/dd-transfer-to-vault","title":"dd-transfer-to-vault"},{"location":"datastation/#dd-vault-catalog","text":"Service that manages a catalog of all Dataset Version Exports in the DANS Data Vault . It will expose a summary page for each stored dataset. Docs Code dd-vault-catalog https://github.com/DANS-KNAW/dd-vault-catalog","title":"dd-vault-catalog"},{"location":"datastation/#dd-data-vault","text":"Interface to the DANS Data Vault for depositing and managing Dataset Version Exports. Docs Code dd-data-vault https://github.com/DANS-KNAW/dd-data-vault","title":"dd-data-vault"},{"location":"datastation/#dd-data-vault-cli","text":"Provides the data-vault command line tool for interacting with the DANS Data Vault . Docs Code dd-data-vault-cli https://github.com/DANS-KNAW/dd-data-vault-cli","title":"dd-data-vault-cli"},{"location":"datastation/#bri-gmh","text":"The NBN resolver service operated by DANS in cooperation with the Koninklijke Bibliotheek . It resolves NBN persistent identifiers to their current location. The resolver is hosted at https://persistent-identifier.nl/ . Docs and code NBN https://github.com/DANS-KNAW/gmh-registration-service https://github.com/DANS-KNAW/gmh-resolver-ui https://github.com/DANS-KNAW/gmh-meresco","title":"BRI-GMH"},{"location":"datastation/#dans-data-vault","text":"The DANS long-term preservation archive. It is implemented as an array of OCFL repositories, stored in DMF TAR files on tape. Each TAR file represents a layer. If the layers are extracted to disk in the correct order, the result is an OCFL repository. For more details see the docs on the Data Vault internal interface. Docs SURF Data Archive OCFL Data Vault internal interface","title":"DANS Data Vault"},{"location":"datastation/#libraries","text":"The components mentioned above use many open source libraries. A couple of these are developed by DANS and are available on GitHub. Library Code dans-bagit-lib https://github.com/DANS-KNAW/dans-bagit-lib dans-dataverse-client-lib https://github.com/DANS-KNAW/dans-dataverse-client-lib dans-java-utils https://github.com/DANS-KNAW/dans-java-utils","title":"Libraries"},{"location":"datastation/#schema-legend","text":"","title":"Schema Legend"},{"location":"dataversenl/","text":"DataverseNL \u00b6 The differences between DataverseNL and the Data Stations are mainly contractual and organisational. In DataverseNL, the customer organization is responsible for the curation of datasets. The technical differences are minimal: DataverseNL has no SWORD2 service.","title":"DataverseNL"},{"location":"dataversenl/#dataversenl","text":"The differences between DataverseNL and the Data Stations are mainly contractual and organisational. In DataverseNL, the customer organization is responsible for the curation of datasets. The technical differences are minimal: DataverseNL has no SWORD2 service.","title":"DataverseNL"},{"location":"deposit-directory/","text":"Deposit directory \u00b6 A deposit directory is a directory containing: deposit files deposit properties . \u2514\u2500\u2500 deposit-directory \u251c\u2500\u2500The Simple Web-service Offering Repository Deposit (SWORD) protocol is a standard for depositing content into +repositories. The DANS implementation of version 2 of this protocol has been operational for several years for +depositing datasets in EASY. The SWORD2 interface is also available in the Data Stations.
+SWORD3 not yet available
+The latest version of SWORD is version 3. This version is not yet available in the Data Stations. Currently, we are +focusing on guaranteeing the continued availability of the SWORD2 interface for our existing customers. It is not yet +decided whether we will implement SWORD3 in the future.
+The following documents and examples are available for developers who want to use the DANS SWORD2 service:
+ +The bags that are deposited to the DANS SWORD2 service must comply with the DANS BagIt Profile.
+ + +Clients can also use the DANS Data Vault as a building block in their own archival workflows. To this end DANS offers +Vault as a Service. It exposes the same SWORD2 interface as the Data Stations. Instead of storing the datasets in +Dataverse, they are directly converted into RDA compliant bags and transferred to the Data Vault. In this scenario, +curation as well as dissemination of the datasets remain the responsibility of the customer.
+ +The "Vault as a Service" configuration has mostly the same components as a Data Station. The main difference is that +Dataverse is not part of the configuration. Instead, a new component is introduced to convert the datasets into RDA +compliant bags and transfer them to the Data Vault.
+Service to convert datasets into RDA compliant bags and transfer them to the Data Vault.
+Docs | +Code | +
---|---|
dd-vault-ingest-flow | +https://github.com/DANS-KNAW/dd-vault-ingest-flow | +