Skip to content

Release 0.12.0

Compare
Choose a tag to compare
@zoyahav zoyahav released this 20 Feb 19:39
· 799 commits to master since this release

Major Features and Improvements

  • Python 3.5 readiness complete (all tests pass). Full Python 3.5 compatibility
    is expected to be available with the next version of Transform (after
    Apache Beam 2.11 is released).
  • Performance improvements for vocabulary generation when using top_k.
  • New optimized highly experimental API for analyzing a dataset was added,
    AnalyzeDatasetWithCache, which allows reading and writing analyzer cache.
  • Update DatasetMetadata to be a wrapper around the
    tensorflow_metadata.proto.v0.schema_pb2.Schema proto. TensorFlow Metadata
    will be the schema used to define data parsing across TFX. The serialized
    DatasetMetadata is now the Schema proto in ascii format, but the previous
    format can still be read.
  • Change ApplySavedModel implementation to use tf.Session.make_callable
    instead of tf.Session.run for improved performance.

Bug Fixes and Other Changes

  • tft.vocabulary and tft.compute_and_apply_vocabulary now support
    filtering based on adjusted mutual information when
    use_adjusetd_mutual_info is set to True.
  • tft.vocabulary and tft.compute_and_apply_vocabulary now takes
    regularization term 'min_diff_from_avg' that adjusts mutual information to
    zero whenever the difference between count of the feature with any label and
    its expected count is lower than the threshold.
  • Added an option to tft.vocabulary and tft.compute_and_apply_vocabulary
    to compute a coverage vocabulary, using the new coverage_top_k,
    coverage_frequency_threshold and key_fn parameters.
  • Added tft.ptransform_analyzer for advanced use cases.
  • Modified QuantilesCombiner to use tf.Session.make_callable instead of
    tf.Session.run for improved performance.
  • ExampleProtoCoder now also supports non-serialized Example representations.
  • tft.tfidf now accepts a scalar Tensor as vocab_size.
  • assertItemsEqual in unit tests are replaced by assertCountEqual.
  • NumPyCombiner now outputs TF dtypes in output_tensor_infos instead of
    numpy dtypes.
  • Adds function tft.apply_pyfunc that provides limited support for
    tf.pyfunc. Note that this is incompatible with serving. See documentation
    for more details.
  • CombinePerKey now adds a dimension for the key.
  • Depends on numpy>=1.14.5,<2.
  • Depends on apache-beam[gcp]>=2.10,<3.
  • Depends on protobuf==3.7.0rc2.
  • ExampleProtoCoder.encode now converts a feature whose value is None to an
    empty value, where before it did not accept None as a valid value.
  • AnalyzeDataset, AnalyzeAndTransformDataset and TransformDataset can now
    accept dictionaries which contain None, and which will be interpreted the
    same as an empty list. They will never produce an output containing None.

Breaking changes

  • ColumnSchema and related classes (Domain, Axis and
    ColumnRepresentation and their subclasses) have been removed. In order to
    create a schema, use from_feature_spec. In order to inspect a schema
    use the as_feature_spec and domains methods of Schema. The
    constructors of these classes are replaced by functions that still work when
    creating a Schema but this usage is deprecated.
  • Requires pre-installed TensorFlow >=1.12,<2.
  • ExampleProtoCoder.decode now converts a feature with empty value (e.g.
    features { feature { key: "varlen" value { } } }) or missing key for a
    feature (e.g. features { }) to a None in the output dictionary. Before
    it would represent these with an empty list. This better reflects the
    original example proto and is consistent with TensorFlow Data Validation.
  • Coders now returns a list instead of an ndarray for a VarLenFeature.

Deprecations