Preliminary DQ / Data Quality experiments and related utilities.
Note: these tools have evolved beyond just "Data Quality", read the various other sections below.
Fully runnable "uber" jar is available here:
- https://github.com/LucidWorks/data-quality/releases/tag/0.5
- Click the link for
data-quality-java-1.0-SNAPSHOT.jar
and the download will start
See also Building From Source below.
- Not an official Lucidworks product; not maintained in product release cycles
- Not officially supported by Lucidworks Support
- No pretty doc (other than this README file, some included examples, syntax messages if you don't pass any command line args, and source code comments)
- Very little / no unit tests; see the main code for some helpful notes
- See the TODO section at the end for additional shortcominges :-)
Despite these caveats, I find the tool very useful and pretty darn fast, and many featurs were added as a direct result of trying to get work done.
empty_fields
- generally which fields are populated, including percentages (com.lucidworks.dq.data.EmptyFieldStats)term_stats
- token length, and terms > 3 standard deviations from average (com.lucidworks.dq.data.TermStats)date_checker
- report on date fields, fit to idealized exponential growth curve (com.lucidworks.dq.data.DateChecker)code_points
- look for potentially corrupted tokens by looking for strings that span the most Unicode classes (com.lucidworks.dq.data.TermCodepointStats)
diff_ids
- documents that are only in A or B (com.lucidworks.dq.diff.DiffIds)diff_schema
- compares fields, types, dynamic field patterns, etc. (com.lucidworks.dq.diff.DiffSchema)diff_empty_fields
- compare population of collections (com.lucidworks.dq.diff.DiffEmptyFieldStats)
doc_count
- Count of active documents in a collection and send to standard out / stdout (com.lucidworks.dq.data.DocCount)dump_ids
- Dump all the IDs from a collection to standard out / stdout (com.lucidworks.dq.data.DumpIds)delete_by_ids
- Delete documents by their ID, either passed on the command line, or from a file, or from standard in / stdin (com.lucidworks.dq.data.DeleteByIds)solr_to_solr
- Copy records from one Solr collection or core to another, can control which fields and records (com.lucidworks.dq.data.SolrToSolr)solr_to_csv
- Export records from Solr collection or core to delimited file, such as CSV. (com.lucidworks.dq.data.SolrToCsv)
hash_and_shard
- Calculate hash and shard for a document ID (com.lucidworks.dq.util.HashAndShard)
See src/main/resources/sample-reports/
See also Download Prebuilt Binary above.
This project assumes Java 7 (aka Java 1.7)
If you were given a pre-built .jar file, skip to the section Running
To checkout and build the project you'll also need git and maven. Issue the command:
git clone [email protected]:LucidWorks/data-quality.git
cd data-quality
mvn package
It will create a convenient SELF CONTAINED jar file at target/data-quality-java-1.0-SNAPSHOT.jar
Henceforth we'll refer to this as just data-quality.jar, but substitute the full path and name of the file you created.
Error:
[ERROR] COMPILATION ERROR :
[INFO] -------------------------------------------------------------
[ERROR] Failure executing javac, but could not parse the error:
javac: invalid target release: 1.7
Fix: Try setting JAVA_HOME to your Java 1.7 instance.
For example on Mac OS X:
```export JAVA_HOME=`javahome -v 1.7````
Or update ~/.mavenrc
Or define separate Java variables for Java 6 and 7 and then call for Java 7 in the pom.xml
Example Mac OS X ~/.mavenrc
export JAVA_HOME_6=`javahome -v 1.6`
export JAVA_HOME_7=`javahome -v 1.7`
Additions to pom.xml
to specifically call for Java 7:
<project ...>
...
<build>
<plugins>
...
<plugin>
<artifactId>maven-surefire-plugin</artifactId>
<configuration>
<!-- Set in ~/.mavenrc -->
<!-- export JAVA_HOME_7=`javahome -v 1.7` -->
<jvm>${env.JAVA_HOME_7}/bin/java</jvm>
</configuration>
</plugin>
</plugins>
</build>
</project>
I might add this to a future version of this project's pom.xml
In the following examples we refer to data-quality.jar, but the actual file you have might be called something like data-quality-java-1.0-SNAPSHOT.jar
; use that full name wherever we say data-quality.jar. Also, if the jar file isn't in your current directory, you should include the full file path.
The jar is self contained, including all other project dependency libraries including SolrJ, and is a little over 30 megabytes in size.
The jar was built to be used with the java -jar
convention. Since this requires that only one class be declared as primary, we include a CmdLineLauncher class that routes to the other classes. It's also possible to use the jar in your classpath and call specific java classes directly, provided you know the full package and class name.
Developer Note: The mapping between command name and full class name is in com.lucidworks.dq.util.CmdLineLauncher.java in the static CLASSES field; the "commands" are really just class aliases.
Example: See what classes and commands are available:
java -jar data-quality.jar
Example output (and list of valid commands):
Pass a command name on the command line to see help for that class:
empty_fields: Look for fields that aren't fully populated.
term_stats: Look at indexed tokens and lengths in each field.
code_points: Look for potentially corrupted tokens. Assumption is corrupted data is more random and will therefore tend to span more Unicode classes.
data_checker: Look at the dates stored the collection.
diff_empty_fields: Compare fields that aren't fully populated between two cores/collections.
diff_ids: Compare IDs between two cores/collections.
diff_schema: Compare schemas between two cores/collections.
doc_count: Count of active documents in a collection to standard out / stdout.
dump_ids: Dump all the IDs from a collection to standard out / stdout.
delete_by_ids: Delete documents by their ID, either passed on the command line, or from a file, or from standard in / stdin.
solr_to_solr: Copy records from one Solr collection or core to another.
solr_to_csv: Export records from Solr collection or core to delimited file, such as CSV.
hash_and_shard: Calculate hash and shard for a document ID
Example: Show the syntax for a specific command, for example empty_fields
:
java -jar data-quality.jar empty_fields
Modified Example: Show the same thing using more traditional Java syntax:
java -jar data-quality.jar com.lucidworks.dq.data.EmptyFieldStats
Example output, using either java syntax:
usage: EmptyFieldStats -u http://localhost:8983 [-c <arg>] [-f <arg>] [-h
<arg>] [-i] [-p <arg>] [-s] [-u <arg>]
-c,--collection <arg> Collection/Core for Solr, Eg: collection1
-f,--fields <arg> Fields to analyze, Eg: fields=name,category,
default is all fields
-h,--host <arg> IP address for Solr, default=localhost
-i,--ids Include IDs of docs with empty fields. WARNING:
may create large report
-p,--port <arg> Port for Solr, default=8983
-s,--stored_fields Also check stats of Stored fields. WARNING: may
take lots of time and memory for large
collections
-u,--url <arg> URL for Solr, OR set host, port and possibly
collection
If you'll be doing this frequently you might wish to write a shell script wrapper.
data-quality.sh (Linux, Mac, etc.)
#!/bin/bash
JAR="/full/path/data-quality-java-1.0-SNAPSHOT.jar"
java -jar "$JAR" $*
On Unix, don't forget to chmod +x data-quality.sh
Run with data-quality.sh empty_fields
data-quality.cmd (Windows)
@echo off
set JAR=c:\full\path\data-quality-java-1.0-SNAPSHOT.jar
java -jar "%JAR%" %*
Run with data-quality empty_fields
General rules:
- Give the command_name first, if using
java -jar
syntax. - For most classes with a main, running with no arguments will give command line syntax.
- Do not use "-h" for help; "-h" is short for "--host" not "--help"; running with no arguments gives syntax help.
Set the full URL:
- -u | --url http://.....
Or just set portions of it:
- -h | --host localhost
- -p | --port 8983 or 8888, etc
- -c | --collection demo_shard1_replica1
For example, to get information about partially populated fields:
java -jar data-quality.jar empty_fields --host localhost --collection demo_shard1_replica1
The idea is that you're referring to two Solr instances, Solr instances A and B:
- Lowercase single letters refer to Solr instance A
- Uppercase single letters refer to Solr instance B
- Long options have the suffix "_a" or "_b" added.
For example, to compare IDs of 2 cores, the following commands are equivalent:
java -jar data-quality.jar diff_ids -h localhost -p 8983 -H localhost -P 8984
java -jar data-quality.jar diff_ids --host_a localhost --port_a 8983 --host_b localhost --port_b 8984
DiffSchema can also read from XML files or automatically provide a Solr default schema.
A few arguments are specific to only 1 or 2 commands, either because they don't make sense elsewhere or because they're experimental. If an option becomes popular, it could be added to other commands.
Example: For partially populated fields, include all the actual IDs of docs with missing values.
java -jar data-quality.jar empty_fields --ids --host localhost --collection demo_shard1_replica1
The --ids
only exists in this one report at the moment, and can generate a very long report!
All three of these examples do the same thing but using different command line syntax:
java -jar data-quality.jar dump_ids -u http://localhost:8983/solr/collection1 > collection1_ids.txt
java -jar data-quality.jar dump_ids -h localhost -c collection1 > collection1_ids.txt
java -jar data-quality.jar dump_ids --host localhost --collection collection1 > collection1_ids.txt
Find IDs that are in an the old collection but are not in the new collection:
java -jar data-quality.jar diff_ids --url_a http://localhost:8983/solr/old_collection --url_b http://localhost:8983/solr/new_collection --mode a_only --output_file old_records.txt
Note: You can also compare a live collection to a file containing document IDs. This is useful if you cannot access both solr instances at the same time; you can run on one system, then ftp or scp the file, and compare it to a second system.
Read IDs from a file and remove records from the collection, and just specify the host and collection, and use the short form of the options:
java -jar data-quality.jar delete_by_ids -h localhost -c collection1 -f bad_ids.txt
Copy data from an old Solr instance to a new instance:
java -jar data-quality.jar solr_to_solr --url_a http://old_server:8983/solr/old_collection --url_b http://new_server:8983/solr/new_collection --exclude_fields timestamp,text_en --xml
Remember, to see all the options available from solr_to_solr just run it without any options:
java -jar data-quality.jar solr_to_solr
Example output:
Copy records from one Solr collection or core to another.
usage: SolrToSolr --url_a http://localhost:8983/collection1 --url_b
http://localhost:8983/collection2 [-b] [-c <arg>] [-C <arg>] [-f <arg>]
[-F <arg>] [-H <arg>] [-h <arg>] [-i] [-l] [-P <arg>] [-p <arg>] [-q
<arg>] [-U <arg>] [-u <arg>] [-x]
Useful for tasks such as copying data to/from Solr clusters, migrating between
Solr versions, schema debugging, or synchronizing Solr instances. Can ONLY
COPY Stored Fields, though this is the default for many fields in Solr. In
syntax messages below, SolrA=source and SolrB=destination. Will use Solr
"Cursor Marks", AKA "Deep Paging", if available which is in Solr version 4.7+,
see https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results and
SOLR-5463
Options:
-b,--batch_size Batch size, 1=doc-by-doc. 0=all-at-once but be
careful memory-wise and 0 also disables deep paging
cursors. Default=1000
-c,--collection_a <arg> Collection/Core for SolrA, Eg: collection1
-C,--collection_b <arg> Collection/Core for SolrB, Eg: collection2
-f,--include_fields <arg> Fields to copy, Eg: include_fields=id,name,category
Make sure to include the id! By default all stored
fields are included except system fields like
_version_ and _root_. You can also use simple
globbing patterns like billing_* but make sure to
use quotes on the command line to protect them from
the operating system. Field name and pattern
matching IS case sensitive unless you set
ignore_case. Patterns do NOT match system fields
either, so if really need a field like _version_
then add the full name to include_fields not using
a wildcard. Solr field names should not contain
commas, spaces or wildcard pattern characters. Does
not use quite the same rules as dynamicField
pattern matching, different implementation. See
also exclude_fields
-F,--exclude_fields <arg> Fields to NOT copy over, Eg:
exclude_fields=timestamp,text_en Useful for
skipping fields that will be re-populated by
copyField in SolrB schema.xml. System fields like
_version_ are already skipped by default. Use
literal field names or simple globbing patterns
like text_*; remember to use quotes on the command
line to protect wildcard characters from the
operating system. Excludes override includes when
comparing literal field names or when comparing
patterns, except that literal fields always take
precedence over patterns. If a literal field name
appears in both include and exclude, it will not be
included. If a field matches both include and
exclude patterns, it will not be included. However,
if a field appears as a literal include but also
happens to match an exclude pattern, then the
literal reference will win and it WILL be included.
See also include_fields
-H,--host_b <arg> IP address for SolrB, destination of records,
default=localhost
-h,--host_a <arg> IP address for SolrA, source of records,
default=localhost
-i,--ignore_case Ignore UPPER and lowercase differences when
matching field names and patterns, AKA case
insensitive; the original form of the fieldname
will still be output to the destination collection
unless output_lowercase_names is used
-l,--lowercase_names Change fieldnames to lowercase before submitting to
destination collection; does NOT affect field name
matching. Note: May create multi-valued fields from
previously single-valued fields, Eg: Type=food,
type=fruit -> type=[food, fruit]; if you see an
error about "multiple values encountered for non
multiValued field type" this setting can be changed
in SolrB's schema.xml file. There is no
output_uppercase since that would complicate the id
field. See also ignore_case
-P,--port_b <arg> Port for SolrB, default=8983
-p,--port_a <arg> Port for SolrA, default=8983
-q,--query <arg> Query to select which records will be copied; by
default all records are copied.
-U,--url_b <arg> URL SolrB, destination of records, OR set host_b
(and possibly port_b / collection_b)
-u,--url_a <arg> URL for SolrA, source of records, OR set host_a
(and possibly port_a / collection_a)
-x,--xml Use XML transport (XMLResponseParser) instead of
default javabin; useful when working with older
versions of Solr, though slightly slower. Helps fix
errors "RuntimeException: Invalid version or the
data in not in 'javabin' format",
"org.apache.solr.common.util.JavaBinCodec.unmarshal
", or similar errors.
Have you ever wondered which shard a document will wind up in? This can be useful when testing if you suspect your shards are not equally fully. This could happen, for example, if document keys happen to be generated with a hash algorithm similar to that used by Solr (MurmurHash3).
As you may know, once indexed, you can always find out which shards documents were put in by including [shard]
in the field list:
http://localhost:8983/solr/collection1/select?q=*&fl=*,[shard]
Or more compact output:
http://localhost:8983/solr/collection1/select?q=*&fl=id,[shard]&wt=csv
But you can also use a utility that's included in this toolkit, hash_and_shard
, to view the hash of a document ID, and then which shard it would be routed to.
To figure out the 32-bit hash value, it only needs the document ID.
But to figure out which shard it would be routed to, it also needs to know the total number of shards. And this is only an approximation; if you've split shards this output won't be correct.
Here's how to find out the hash for "doc1", and how it'll be routed in a 4-shard system:
java -jar data-quality.jar hash_and_shard doc1 4
This gives the output:
docId: "doc1"
32-bit Hash (signed decimal int): -657533388
32-bit Hash (unsigned dec int): 3637433908
32-bit Hash (hex): 0xd8ced634
32-bit Hash (binary): 11011000110011101101011000110100
Number of Shards: 4
Shard # 1
Range: 0x80000000 to 0xbfffffff
Shard # 2
Range: 0xc0000000 to 0xffffffff
contains 0xd8ced634
Shard # 3
Range: 0x00000000 to 0x3fffffff
Shard # 4
Range: 0x40000000 to 0x7fffffff
Shard boundaries are inclusive. Running with 3 shards instead of 4 gives more interesting output for shard boundaries:
java -jar data-quality.jar hash_and_shard doc1 3
docId: "doc1"
32-bit Hash (signed decimal int): -657533388
32-bit Hash (unsigned dec int): 3637433908
32-bit Hash (hex): 0xd8ced634
32-bit Hash (binary): 11011000110011101101011000110100
Number of Shards: 3
Shard # 1
Range: 0x80000000 to 0xd554ffff
Shard # 2
Range: 0xd5550000 to 0x2aa9ffff
contains 0xd8ced634
Shard # 3
Range: 0x2aaa0000 to 0x7fffffff
The ranges might look a bit confusing:
- Remember that Java uses only signed integers, so that hex numbers starting with 8 or above are actually negative numbers, so the shards ARE sorted numerically from smallest to largest.
- Further, Solr likes to put shard boundaries at certain powers of 2, vs. random integer division.
If you only give the utility a document ID, but not the number of shards, it'll just show the 32-bit hash.
You can include the -q
option along with the number of shards, you'll get a very compact output.
With -q you'll get just the docID, then a space, a checksum, and if you added the number of shards,
then a final space and the shard number it would be routed to.
This compact output is useful for scripting!
(Unlike other DQ utilities, the -q must come last)
Here's a script to check the simple doc ID's "1" through "10":
calc-shards.sh
#!/bin/bash
JAR=data-quality.jar
SHARDS=3 # try values like 2, 3, 4, 7, 23!
echo
echo Predicting final shard for $SHARDS shards
echo
for (( i = 1; i <= 10; i++ ))
do
java -jar "$JAR" hash_and_shard $i $SHARDS -q
done
echo
echo Reminder: Those predictions were for $SHARDS shards
Summary of output:
id hash shard
1 0x9416ac93 1
2 0x0129e217 2
3 0x0fc7a1b4 2
4 0xe131cc88 2
5 0x531a35e4 3
6 0x27fa7cc0 2
7 0x23ea8628 2
8 0xbd920017 1
9 0x248be6a1 2
10 0x86e4093f 1
All under src/main/java/com/lucidworks/dq/util/
- General:
- Mostly static methods, for easy/safe reuse
- SolrUtils - SolrJ Wrappers!
- Example code showing how to use SolrJ for more than just searching!
- Source code comments show some equivalent HTTP URL syntax
- Get all values from a field
- Indexed vs. Stored fields
- Wrapper around /terms
- Wrapper around /admin/luke
- Wrapper around /schema/...
- Wrapper around /clustering; requires
-Dsolr.clustering.enabled=true
on Solr's Java command line - Grabbing Facet values
- Using Solr Stats
- Traversing SolrJ
NamedList
andSimpleOrderedMap
collection data types - Whether your Solr instance supports new "Cursor Marks", AKA "Deep Paging", see https://cwiki.apache.org/confluence/display/solr/Pagination+of+Results and SOLR-5463
- SetUtils:
- inAOnly, inBOnly
- Union, Intersection
- choice of destructive (slightly faster) or non-destructive (safer! probably what you want)
- Stable maps... usually preserves insertion order
- head, tail, reverse
- sortMapByValues
- DateUtils:
- to / from various formats
- workaround to SolrJ's habit of sometimes returning dates as strings
- corrects for timezone issues
- StringUtils:
- Glob Patterns and Regex Utils
- StatsUtils:
- sum, min, max
- average, standardDeviation
- Note: Solr can also do this, which is often faster
- LeastSquares line fit and Exponential curve fitting
- LLR / Log-Likelihood Ratio:
- more advanced statistics
- start at Log Likelihood Ration / G2
- may have +/- sign issue
- Doesn't yet support ZK ensemble syntax
- Blog posts w/ code snippets
- Pre-Built downloadable .jar
- Add Java 7 specific parameters to pom.xml ?
- Consider adding
--ids
to more classes - Refactor to be more consistent about when data is actually fetched, when tabulations are actually performed, etc. Ideally allow for an empty constructor, then setters, then a "run now" mode.
- Call for "is clustering enabled"
- Call to enable dynamic schemas, field guessing, etc.
- Maybe wrappers for simple collection maint, aliases, etc
- Then refactor command line wrapper
- util.SolrUtils is getting pretty large...
- Maybe use logging instead of println... although that drags in library and config issues, warning messages, etc. Dealing with slf4j warnings if you're a bit new to command line java is a hassle.
- Maybe use a real reporting framework... but worried about overhead...
- Unit tests: would need mock/static solr cores
- LLR: verify sign, package with report, command line args, etc
- Curve fitting: alternative to Least Squares
- Maybe include .sh and .cmd scripts, but also need overall .zip packaging
- Not very friendly for non-command-line
- Ajax/HTML5 wrapper might be nice
- Fix indenting to be consistently just 2 spaces
- Javadoc