- None now treated as missing data indicator. Warnings for deprecations of older types of missing data indicators
Features
- Handle FuzzyCategoricalType in datamodel
Features
- Speed up learning
- Parallelize sampling
- Optional CRF Edit Distance
Support for Python 3.4 added. Support for Python 2.6 dropped.
Features
- Windows OS supported
- train method has argument for not considering index predicates
- TfIDFNGram Index Predicate added (for shorter string)
- SuffixArray Predicate
- Double Metaphone Predicates
- Predicates for numbers, OrderOfMagnitude, Round
- Set Predicate OrderOfCardinality
- Final, learned predicates list will now often be smaller without loss of coverage
- Variables refactored to support external extensions like https://github.com/datamade/dedupe-variable-address
- Categorical distance, regularized logistic regression, affine gap distance, canonicalization have been turned into separate libraries.
- Simplejson is now dependency
Features
- Individual record cluster membership scores
- New predicates
- New Exists Variable Type
Bug Fixes
- Latlong predicate fixed
- Set TFIDF canopy working properly
Features
- Sampling methods now use blocked sampling
Version 0.7.0 is backwards compatible, except for the match method of Gazetteer class
Features
- new index, unindex, and match methods in Gazetter Matching. Useful for streaming matching
Version 0.6.0 is not backwards compatible.
Features :
- new Text, ShortString, and exact string types
- multiple variables can be defined on same field
- new Gazette linker for matching dirty records against a master list
- performance improvements, particularly in memory usage
- canonicalize function in dedupe.convenience for creating a canonical representation of a cluster of records
- tons of bugfixes
API breaks
- when initializing an ActiveMatching object,
variable_definition
replacesfield_definition
and is a list of dictionaries instead of a dictionary. See the documentation for details - also when initializing a Matching object,
num_processes
has been replaced bynum_cores
, which now defaults to the number of cpus on the machine - when initializing a StaticMatching object,
settings_file
is now expected to be a file object not a string. ThereadTraining
,writeTraining
,writeSettings
methods also all now expect file objects
Version 0.5 is not backwards compatible.
Features :
- Special case code for linking two datasets that, individually are unique
- Parallel processing using python standard library multiprocessing
- Much faster canopy creation using zope.index
- Asynchronous active learning methods
API breaks :
duplicateClusters
has been removed, it has been replaced bymatch
andmatchBlocks
goodThreshold
has been removed, it has been replaced bythreshold
andthresholdBlocks
- the meaning of
train
has changed. To train from training file usereadTraining
. To use console labeling, pass a dedupe instance to theconsoleLabel
function - The convenience function dataSample has been removed. It has been replaced by
the
sample
methods - It is no longer necessary to pass
frozendicts
toMatching
classes blockingFunction
has been removed and been replaced by theblocker
method