Release 0.12.0

zoyahav released this 20 Feb 19:39

· 799 commits to master since this release

97bdbe8

Major Features and Improvements

Python 3.5 readiness complete (all tests pass). Full Python 3.5 compatibility
is expected to be available with the next version of Transform (after
Apache Beam 2.11 is released).
Performance improvements for vocabulary generation when using top_k.
New optimized highly experimental API for analyzing a dataset was added,
AnalyzeDatasetWithCache, which allows reading and writing analyzer cache.
Update DatasetMetadata to be a wrapper around the
tensorflow_metadata.proto.v0.schema_pb2.Schema proto. TensorFlow Metadata
will be the schema used to define data parsing across TFX. The serialized
DatasetMetadata is now the Schema proto in ascii format, but the previous
format can still be read.
Change ApplySavedModel implementation to use tf.Session.make_callable
instead of tf.Session.run for improved performance.

Bug Fixes and Other Changes

tft.vocabulary and tft.compute_and_apply_vocabulary now support
filtering based on adjusted mutual information when
use_adjusetd_mutual_info is set to True.
tft.vocabulary and tft.compute_and_apply_vocabulary now takes
regularization term 'min_diff_from_avg' that adjusts mutual information to
zero whenever the difference between count of the feature with any label and
its expected count is lower than the threshold.
Added an option to tft.vocabulary and tft.compute_and_apply_vocabulary
to compute a coverage vocabulary, using the new coverage_top_k,
coverage_frequency_threshold and key_fn parameters.
Added tft.ptransform_analyzer for advanced use cases.
Modified QuantilesCombiner to use tf.Session.make_callable instead of
tf.Session.run for improved performance.
ExampleProtoCoder now also supports non-serialized Example representations.
tft.tfidf now accepts a scalar Tensor as vocab_size.
assertItemsEqual in unit tests are replaced by assertCountEqual.
NumPyCombiner now outputs TF dtypes in output_tensor_infos instead of
numpy dtypes.
Adds function tft.apply_pyfunc that provides limited support for
tf.pyfunc. Note that this is incompatible with serving. See documentation
for more details.
CombinePerKey now adds a dimension for the key.
Depends on numpy>=1.14.5,<2.
Depends on apache-beam[gcp]>=2.10,<3.
Depends on protobuf==3.7.0rc2.
ExampleProtoCoder.encode now converts a feature whose value is None to an
empty value, where before it did not accept None as a valid value.
AnalyzeDataset, AnalyzeAndTransformDataset and TransformDataset can now
accept dictionaries which contain None, and which will be interpreted the
same as an empty list. They will never produce an output containing None.

Breaking changes

ColumnSchema and related classes (Domain, Axis and
ColumnRepresentation and their subclasses) have been removed. In order to
create a schema, use from_feature_spec. In order to inspect a schema
use the as_feature_spec and domains methods of Schema. The
constructors of these classes are replaced by functions that still work when
creating a Schema but this usage is deprecated.
Requires pre-installed TensorFlow >=1.12,<2.
ExampleProtoCoder.decode now converts a feature with empty value (e.g.
features { feature { key: "varlen" value { } } }) or missing key for a
feature (e.g. features { }) to a None in the output dictionary. Before
it would represent these with an empty list. This better reflects the
original example proto and is consistent with TensorFlow Data Validation.
Coders now returns a list instead of an ndarray for a VarLenFeature.

Deprecations

Assets 2