-
What is data validation?
-
Why is data validation difficult?
-
How can data validation be made easy?
-
Where can I use it?
::: incremental
-
Eventually all data is sequences of bits
-
Data must conform to expected shapes
-
Data validation
$=$ check expectations
:::
::: incremental
-
completeness
e.g. all records have year -
constraints
e.g. year < 2022 -
consistency
e.g. birth < death (except time-travellers)
:::
-
completeness
-
internal: e.g. all authors have names
-
external: e.g. all authors are listed
-
-
code: unit tests
against software rot -
data: data validation
against propagation of errors
::: incremental
-
Big data & data integration
e.g. bibliographic data + knowledge graphs -
Many formats & different expectations
-
Diverse validation technologies
:::
::: incremental
-
Custom parser/rules (if ... then ...)
-
Schema languages
JSON JSON Schema
XML XSD, DTD, Schematron RDF SHACL/ShEx
String RegEx, EBNF
MARC Avram
:::
digraph {
node [fontname="sans-serif"];
Bytes -> XML;
{ rank=same; sequence -> Bytes[style=invis]; sequence[shape=plaintext] }
XML -> marcxml;
{ rank=same; tree -> XML[style=invis]; tree[shape=plaintext] }
{ rank=same;
XML -> parser [dir=back,arrowtail=vee]
parser[shape=plaintext]
}
marcxml[label="MARC/XML"];
{ rank=same;
marcxml -> schema [dir=back,arrowtail=vee]
schema[shape=plaintext]
}
marcxml -> MARC;
{ rank=same; fields -> MARC[style=invis]; fields[shape=plaintext] }
MARC -> Custom;
Custom[label="MARC21 Subset"]
{ rank=same;
Custom -> rules [dir=back,arrowtail=vee]
rules[shape=plaintext]
}
}
digraph {
rankdir="LR";
node [shape=box, style=rounded, fontname="sans-serif"]
edge [fontname = "sans-serif"]
Service[label="Validation Service"]
Data[shape=none]
Formats -> Service:sw [label=configuration]
Schemas -> Formats
Data -> Service:nw [penwidth=2,label=validate]
Data -> Service:w [label=errors,dir=back]
}
Request
: data
(file, URL, file or stream) and
format
identifier (+ optional version)
Response : list of errors
Error
: message
(+ optional positions)
-
Web service to validate data
-
Configured with formats and schemas
::: incremental
-
HTTP GET & POST
- raw data or web form file upload
- Use in any web application (CORS)
-
Web interface
-
Command line (requires configuration)
:::
https://format.gbv.de/validate
curl https://format.gbv.de/validate/vzg-article \
--data-binary @article.json
[
{
"message": "must be array",
"position": {
"jsonpointer": "/authors"
}
}
]
-
Registry of known formats and schemas
-
No local installation required
-
Unified API
::: incremental
-
Authority based
done by experts is good by definition -
Evidence based
continuous measuring & improving
:::
digraph {
rankdir="LR";
node [shape=box, style=rounded, fontname="sans-serif"]
edge [fontname = "sans-serif"]
Service[label="Validation Service"]
Data[shape=none]
Formats -> Service:sw
Schemas -> Formats
Data -> Service:nw [penwidth=2,label=validate]
}
-
Validation Server
-
Configure formats with schemas
-
Public instance format.gbv.de/validate
-
Support more schema languages
(Avram, EBNF, Schematron SHACL/ShEx...) -
Support validating MARC21
-
Show error context
-
Build-in rules of black-box library system 😕
-
Validator engines for each schema language (e.g.
xmllint
) 😐 -
Metadata Quality Assurance Framework 😀