Skip to content

Releases: tensorflow/transform

Version 0.22.0

13 May 20:19
Compare
Choose a tag to compare

Major Features and Improvements

Bug Fixes and Other Changes

  • tft.bucketize_per_key no longer assumes that the keys during
    transformation existed in the analysis dataset. If a key is missing then the
    assigned bucket will be -1.
  • tft.estimated_probability_density, when categorical=True, no longer
    assumes that the values during transformation existed in the analysis dataset,
    and will assume 0 density in that case.
  • Switched analyzer cache representation of dataset keys from using a primitive
    str to a DatasetKey class.
  • tft_beam.analyzer_cache.ReadAnalysisCacheFromFS can now filter cache entry
    keys when given a cache_entry_keys parameter. cache_entry_keys can be
    produced by utilizing get_analysis_cache_entry_keys.
  • Reduced number of shuffles via packing multiple combine merges into a
    single Beam combiner.
  • Switch tft.TransformFeaturesLayer to use the TF 2 tf.saved_model.load API
    to load a previously exported SavedModel.
  • Adds tft.sparse_tensor_left_align as a utility which aligns
    tf.SparseTensors to the left.
  • Depends on avro-python3>=1.8.1,!=1.9.2.*,<2.0.0 for Python3.5 + MacOS.
  • Depends on apache-beam[gcp]>=2.20.0,<3.
  • Depends on tensorflow>=1.15,!=2.0.*,<2.3.
  • Depends on tensorflow-metadata>=0.22.0,<0.23.0.
  • Depends on tfx-bsl>=0.22.0,<0.23.0.

Breaking changes

  • tft.AnalyzeDatasetWithCache no longer accepts a flat pcollection as an
    input. Instead it will flatten the datasets in the input_values_pcoll_dict
    input if needed.
  • tft.TransformFeaturesLayer no longer takes a parameter
    drop_unused_features. Its default behavior is now equivalent to having set
    drop_unused_features to True.

Deprecations

Release 0.21.2

04 Mar 21:24
Compare
Choose a tag to compare

Release 0.21.2

Major Features and Improvements

  • Expanded capability for per-key analyzers to analyze larger sets of keys that
    would not fit in memory, by storing the key-value pairs in vocabulary files.
    This is enabled by passing a per_key_filename to tft.count_per_key and
    tft.scale_to_z_score_per_key.
  • Added tft.TransformFeaturesLayer and
    tft.TFTransformOutput.transform_features_layers to allow transforming
    features for a TensorFlow Keras model.

Bug Fixes and Other Changes

  • tft.apply_buckets_with_interpolation now handles NaN values by imputing with
    the middle of the normalized range.
  • Depends on tfx-bsl>=0.21.3,<0.22.

Breaking changes

Deprecations

Release 0.21.0

17 Jan 18:57
Compare
Choose a tag to compare

Release 0.21.0

Major Features and Improvements

  • Added a new version of the census example to demonstrate usage in TF 2.0.
  • New mapper estimated_probability_density to compute either exact
    probabilities (for discrete categorical variable) or approximate density over
    fixed intervals (continuous variables).
  • New analyzers count_per_key and histogram to return counts of unique
    elements or values within predefined ranges. Calling tft.histogram on
    non-categorical value will assign each data point to the appropriate fixed
    bucket and then count for each bucket.
  • Provided capability for per-key analyzers to analyze larger sets of keys that
    would not fit in memory, by storing the key-value pairs in vocabulary files.
    This is enabled by passing a per_key_filename to
    tft.scale_by_min_max_per_key and tft.scale_to_0_1_per_key.

Bug Fixes and Other Changes

  • Added beam counters to log analyzer and mapper usage.
  • Cleanup deprecated APIs used in census and sentiment examples.
  • Support windows style paths in analyzer_cache.
  • tft_beam.WriteTransformFn and tft_beam.WriteMetadata have been made
    idempotent to allow retrying them in case of a failure.
  • tft_beam.WriteMetadata takes an optional argument write_to_unique_subdir
    and returns the path to which metadata was written. If
    write_to_unique_subdir is True, metadata is written to a unique subdirectory
    under path, otherwise it is written to path.
  • Support non utf-8 characters when reading vocabularies in
    tft.TFTransformOutput
  • tft.TFTransformOutput.vocabulary_by_name now returns bytes instead of str
    with python 3.

Breaking changes

Deprecations

Release 0.15.0

21 Oct 21:32
Compare
Choose a tag to compare

Release 0.15.0

Major Features and Improvements

  • This release introduces initial beta support for TF 2.0. TF 2.0 programs
    running in "safety" mode (i.e. using TF 1.X APIs through the
    tensorflow.compat.v1 compatibility module are expected to work. Newly
    written TF 2.0 programs may not work if they exercise functionality that is
    not yet supported. If you do encounter an issue when using
    tensorflow-transform with TF 2.0, please create an issue
    https://github.com/tensorflow/transform/issues with instructions on how to
    reproduce it.
  • Performance improvements for preprocessing_fns with many Quantiles
    analyzers.
  • tft.quantiles and tft.bucketize are now using new TF core quantiles ops
    instead of contrib ops.
  • Performance improvements due to packing multiple combine analyzers into a
    single Beam Combiner.

Bug Fixes and Other Changes

  • Existing analyzer cache is invalidated.
  • Saved transforms now support composite tensors (such as tf.RaggedTensor).
  • Vocabulary's cache coder now supports non utf-8 encodable tokens.
  • Fixes encoding of the tft.covariance accumulator cache.
  • Fixes encoding per-key analyzers accumulator cache.
  • Make various utility methods in tft.inspect_preprocessing_fn support
    tf.RaggedTensor.
  • Moved beam/shared lib to tfx-bsl. If running with latest master, tfx-bsl
    must also be latest master.
  • preprocessing_fns now have beta support of calls to tf.functions, as long
    as they don't contain calls to tf.Transform analyzers/mappers or table
    initializers.
  • tft.quantiles and tft.bucketize are now using core TF ops.
  • Depends on tfx-bsl>=0.15,<0.16.
  • Depends on tensorflow-metadata>=0.15,<0.16.
  • Depends on apache-beam[gcp]>=2.16,<3.
  • Depends on tensorflow>=0.15,<2.2.
    • Starting from 1.15, package
      tensorflow comes with GPU support. Users won't need to choose between
      tensorflow and tensorflow-gpu.
    • Caveat: tensorflow 2.0.0 is an exception and does not have GPU
      support. If tensorflow-gpu 2.0.0 is installed before installing
      tensorflow-transform, it will be replaced with tensorflow 2.0.0.
      Re-install tensorflow-gpu 2.0.0 if needed.

Breaking changes

  • always_return_num_quantiles changed to default to True in tft.quantiles
    and tft.bucketize, resulting in exact bucket count returned.
  • Removes the input_fn_maker module which has been deprecated since TFT 0.11.
    For idiomatic construction of input_fn, see tensorflow_transform examples.

Deprecations

Release 0.14.0

05 Aug 16:46
Compare
Choose a tag to compare

Major Features and Improvements

  • New tft.word_count mapper to identify the number of tokens for each row
    (for pre-tokenized strings).
  • All tft.scale_to_* mappers now have per-key variants, along with analyzers
    for mean_and_var_per_key and min_and_max_per_key.
  • New tft_beam.AnalyzeDatasetWithCache allows analyzing ranges of data while
    producing and utilizing cache. tft.analyzer_cache can help read and write
    such cache to a filesystem between runs. This caching feature is worth using
    when analyzing a rolling range in a continuous pipeline manner. This is an
    experimental feature.
  • Added reduce_instance_dims support to tft.quantiles and elementwise to
    tft.bucketize, while avoiding separate beam calls for each feature.

Bug Fixes and Other Changes

  • sparse_tensor_to_dense_with_shape now accepts an optional default_value
    parameter.
  • tft.vocabulary and tft.compute_and_apply_vocabulary now support
    fingerprint_shuffle to sort the vocabularies by fingerprint instead of
    counts. This is useful for load balancing the training parameter servers.
    This is an experimental feature.
  • Fix numerical instability in tft.vocabulary mutual information calculations.
  • tft.vocabulary and tft.compute_and_apply_vocabulary now support computing
    vocabularies over integer categoricals and multivalent input features, and
    computing mutual information for non-binary labels.
  • New numeric normalization method available:
    tft.apply_buckets_with_interpolation.
  • Changes to make this library more compatible with TensorFlow 2.0.
  • Fix sanitizing of vocabulary filenames.
  • Emit a friendly error message when context isn't set.
  • Analyzer output dtypes are enforced to be TensorFlow dtypes, and by extension
    ptransform_analyzer's output_dtypes is enforced to be a list of TensorFlow
    dtypes.
  • Make tft.apply_buckets_with_interpolation support SparseTensors.
  • Adds an experimental api for analyzers to annotate the post-transform schema.
  • TFTransformOutput.transform_raw_features now accepts an optional
    drop_unused_features parameter to exclude unused features in output.
  • If not specified, the min_diff_from_avg parameter of tft.vocabulary now
    defaults to a reasonable value based on the size of the dataset (relevant
    only if computing vocabularies using mutual information).
  • Convert some tf.contrib functions to be compatible with TF2.0.
  • New tft.bag_of_words mapper to compute the unique set of ngrams for each row
    (for pre-tokenized strings).
  • Fixed a bug in tf_utils.reduce_batch_count_mean_and_var, and as a result
    mean_and_var analyzer, was miscalculating variance for the sparse
    elementwise=True case.
  • At test utility tft_unit.cross_named_parameters for creating parameterized
    tests that involve the cartesian product of various parameters.
  • Depends on tensorflow-metadata>=0.14,<0.15.
  • Depends on apache-beam[gcp]>=2.14,<3.
  • Depends on numpy>=1.16,<2.
  • Depends on absl-py>=0.7,<2.
  • Allow preprocessing_fn to emit a tf.RaggedTensor. In this case, the
    output Schema proto will not be able to be converted to a feature spec,
    and so the output data will not be able to be materialized with tft.coders.
  • Ability to directly set exact num_buckets with new parameter
    always_return_num_quantiles for analyzers.quantiles and
    mappers.bucketize, defaulting to False in general but True when
    reduce_instance_dims is False.

Breaking changes

  • tf_utils.reduce_batch_count_mean_and_var, which feeds into
    tft.mean_and_var, now returns 0 instead of inf for empty columns of a
    sparse tensor.
  • tensorflow_transform.tf_metadata.dataset_schema.Schema class is removed.
    Wherever a dataset_schema.Schema was used, users should now provide a
    tensorflow_metadata.proto.v0.schema_pb2.Schema proto. For backwards
    compatibility, dataset_schema.Schema is now a factory method that produces
    a Schema proto. Updating code should be straightforward because the
    dataset_schema.Schema class was already a wrapper around the Schema proto.
  • Only explicitly public analyzers are exported to the tft module, e.g.
    combiners are no longer exported and have to be accessed directly through
    tft.analyzers.
  • Requires pre-installed TensorFlow >=1.14,<2.

Deprecations

  • DatasetSchema is now a deprecated factory method (see above).
  • tft.tf_metadata.dataset_schema.from_feature_spec is now deprecated.
    Equivalent functionality is provided by
    tft.tf_metadata.schema_utils.schema_from_feature_spec.

Release 0.13.0

01 Mar 21:55
Compare
Choose a tag to compare

Major Features and Improvements

  • Now AnalyzeDataset, TransformDataset and AnalyzeAndTransformDataset can
    accept input data that only contains columns needed for that operation as
    opposed to all columns defined in schema. Utility methods to infer the list of
    needed columns are added to tft.inspect_preprocessing_fn. This makes it
    easier to take advantage of columnar projection when data is stored in
    columnar storage formats.
  • Python 3.5 is supported.

Bug Fixes and Other Changes

  • Version is now accessible as tensorflow_transform.__version__.
  • Depends on apache-beam[gcp]>=2.11,<3.
  • Depends on protobuf>=3.7,<4.

Breaking changes

  • Coders now return index and value features rather than a combined feature for
    SparseFeature.
  • Requires pre-installed TensorFlow >=1.13,<2.

Deprecations

Release 0.12.0

20 Feb 19:39
Compare
Choose a tag to compare

Major Features and Improvements

  • Python 3.5 readiness complete (all tests pass). Full Python 3.5 compatibility
    is expected to be available with the next version of Transform (after
    Apache Beam 2.11 is released).
  • Performance improvements for vocabulary generation when using top_k.
  • New optimized highly experimental API for analyzing a dataset was added,
    AnalyzeDatasetWithCache, which allows reading and writing analyzer cache.
  • Update DatasetMetadata to be a wrapper around the
    tensorflow_metadata.proto.v0.schema_pb2.Schema proto. TensorFlow Metadata
    will be the schema used to define data parsing across TFX. The serialized
    DatasetMetadata is now the Schema proto in ascii format, but the previous
    format can still be read.
  • Change ApplySavedModel implementation to use tf.Session.make_callable
    instead of tf.Session.run for improved performance.

Bug Fixes and Other Changes

  • tft.vocabulary and tft.compute_and_apply_vocabulary now support
    filtering based on adjusted mutual information when
    use_adjusetd_mutual_info is set to True.
  • tft.vocabulary and tft.compute_and_apply_vocabulary now takes
    regularization term 'min_diff_from_avg' that adjusts mutual information to
    zero whenever the difference between count of the feature with any label and
    its expected count is lower than the threshold.
  • Added an option to tft.vocabulary and tft.compute_and_apply_vocabulary
    to compute a coverage vocabulary, using the new coverage_top_k,
    coverage_frequency_threshold and key_fn parameters.
  • Added tft.ptransform_analyzer for advanced use cases.
  • Modified QuantilesCombiner to use tf.Session.make_callable instead of
    tf.Session.run for improved performance.
  • ExampleProtoCoder now also supports non-serialized Example representations.
  • tft.tfidf now accepts a scalar Tensor as vocab_size.
  • assertItemsEqual in unit tests are replaced by assertCountEqual.
  • NumPyCombiner now outputs TF dtypes in output_tensor_infos instead of
    numpy dtypes.
  • Adds function tft.apply_pyfunc that provides limited support for
    tf.pyfunc. Note that this is incompatible with serving. See documentation
    for more details.
  • CombinePerKey now adds a dimension for the key.
  • Depends on numpy>=1.14.5,<2.
  • Depends on apache-beam[gcp]>=2.10,<3.
  • Depends on protobuf==3.7.0rc2.
  • ExampleProtoCoder.encode now converts a feature whose value is None to an
    empty value, where before it did not accept None as a valid value.
  • AnalyzeDataset, AnalyzeAndTransformDataset and TransformDataset can now
    accept dictionaries which contain None, and which will be interpreted the
    same as an empty list. They will never produce an output containing None.

Breaking changes

  • ColumnSchema and related classes (Domain, Axis and
    ColumnRepresentation and their subclasses) have been removed. In order to
    create a schema, use from_feature_spec. In order to inspect a schema
    use the as_feature_spec and domains methods of Schema. The
    constructors of these classes are replaced by functions that still work when
    creating a Schema but this usage is deprecated.
  • Requires pre-installed TensorFlow >=1.12,<2.
  • ExampleProtoCoder.decode now converts a feature with empty value (e.g.
    features { feature { key: "varlen" value { } } }) or missing key for a
    feature (e.g. features { }) to a None in the output dictionary. Before
    it would represent these with an empty list. This better reflects the
    original example proto and is consistent with TensorFlow Data Validation.
  • Coders now returns a list instead of an ndarray for a VarLenFeature.

Deprecations

Release 0.11.0

02 Nov 14:40
Compare
Choose a tag to compare

Major Features and Improvements

Bug Fixes and Other Changes

  • 'tft.vocabulary' and 'tft.compute_and_apply_vocabulary' now support filtering
    based on mutual information when labels is provided.
  • Export all package level exports of tensorflow_transform, from the
    tensorflow_transform.beam subpackage. This allows users to just import the
    tensorflow_transform.beam subpackage for all functionality.
  • Adding API docs.
  • Fix bug where Transform returned a different dtype for a VarLenFeature with
    0 elements.
  • Depends on apache-beam[gcp]>=2.8,<3.

Breaking changes

  • Requires pre-installed TensorFlow >=1.11,<2.

Deprecations

  • All functions in tensorflow_transform.saved.input_fn_maker are deprecated.
    See the examples for how to construct the input_fn for training and serving.
    Note that the examples demonstrate the use of the tf.estimator API. The
    functions named *_serving_input_fn were for use with the
    tf.contrib.estimator API which is now deprecated. We do not provide
    examples of usage of the tf.contrib.estimator API, instead users should
    upgrade to the tf.estimator API.

Release 0.9.0

06 Sep 20:46
Compare
Choose a tag to compare

Major Features and Improvements

  • Performance improvements for vocabulary generation when using top_k.
  • Utility to deep-copy Beam PCollections was added to avoid unnecessary
    materialization.
  • Utilize deep_copy to avoid unnecessary materialization of pcollections when
    the input data is immutable. This feature is currently off by default and can
    be enabled by setting tft.Context.use_deep_copy_optimization=True.
  • Add bucketize_per_key which computes separate quantiles for each key and then
    bucketizes each value according to the quantiles computed for its key.
  • tft.scale_to_z_score is now implemented with a single pass over the data.
  • Export schema_utils package to convert from the tensorflow-metadata package
    to the (soon to be deprecated) tf_metadata subpackage of
    tensorflow-transform.

Bug Fixes and Other Changes

  • Memory reduction during vocabulary generation.
  • Clarify documentation on return values from tft.compute_and_apply_vocabulary
    and tft.string_to_int.
  • tft.unit now explicitly creates Beam PCollections and validates the
    transformed dataset by writing and then reading it from disk.
  • tft.min, tft.size, tft.sum, tft.scale_to_z_score and tft.bucketize
    now support tf.SparseTensor.
  • Fix to tft.scale_to_z_score so it no longer attempts to divide by 0 when the
    variance is 0.
  • Fix bug where internal graph analysis didn't handle the case where an
    operation has control inputs that are operations (as opposed to tensors).
  • tft.sparse_tensor_to_dense_with_shape added which allows densifying a
    SparseTensor while specifying the resulting Tensor's shape.
  • Add load_transform_graph method to TFTransformOutput to load the transform
    graph without applying it. This has the effect of adding variables to the
    checkpoint when calling it from the training input_fn when using
    tf.Estimator.
  • 'tft.vocabulary' and 'tft.compute_and_apply_vocabulary' now accept an
    optional weights argument. When weights is provided, weighted frequencies
    are used instead of frequencies based on counts.
  • 'tft.quantiles' and 'tft.bucketize' now accept an optoinal weights argument.
    When weights is provided, weighted count is used for quantiles instead of
    the counts themselves.
  • Updated examples to construct the schema using
    dataset_schema.from_feature_spec.
  • Updated the census example to allow the 'education-num' feature to be missing
    and fill in a default value when it is.
  • Depends on tensorflow-metadata>=0.9,<1.
  • Depends on apache-beam[gcp]>=2.6,<3.

Breaking changes

  • We now validate a Schema in its constructor to make sure that it can be
    converted to a feature spec. In particular only tf.int64, tf.string and
    tf.float32 types are allowed.
  • We now disallow default values for FixedColumnRepresentation.
  • It is no longer possible to set a default value in the Schema, and validation
    of shape parameters will occur earlier.
  • Removed Schema.as_batched_placeholders() method.
  • Removed all components of DatasetMetadata except the schema, and removed all
    related classes and code.
  • Removed the merge method for DatasetMetadata and related classes.
  • read_metadata can now only read from a single metadata directory and
    read_metadata and write_metadata no longer accept the versions parameter.
    They now only read/write the JSON format.
  • Requires pre-installed TensorFlow >=1.9,<2.

Deprecations

  • apply_function is no longer needed and is deprecated.
    apply_function(fn, *args) is now equivalent to fn(*args). tf.Transform
    is able to handle while loops and tables without the user wrapping the
    function call in apply_function.

Release 0.8.0

28 Jun 20:33
Compare
Choose a tag to compare

Major Features and Improvements

  • Add TFTransformOutput utility class that wraps the output of tf.Transform for
    use in training. This makes it easier to consume the output written by
    tf.Transform (see update examples for usage).
  • Increase efficiency of quantiles (and therefore bucketize).

Bug Fixes and Other Changes

  • Change tft.sum/tft.mean/tft.var to only support basic numeric types.
  • Widen the output type of tft.sum for some input types to avoid overflow
    and/or to preserve precision.
  • For int32 and int64 input types, change the output type of tft.mean/
    tft.var/tft.scale_to_z_score from float64 to float32 .
  • Change the output type of tft.size to be always int64.
  • Context now accepts passthrough_keys which can be used when additional
    information should be attached to dataset instances in the pipeline which
    should not be part of the transformation graph, for example: instance keys.
  • In addition to using TFTransformOutput, the examples demonstrate new workflows
    where a vocabulary is computed, but not applied, in the preprocessing_fn.
  • Added dependency on the absl-py package.
  • TransformTestCase test cases can now be parameterized.
  • Add support for partitioned variables when loading a model.
  • Export the coders subpackage so that users can access it as tft.coders,
    e.g. tft.coders.ExampleProtoCoder.
  • Setting dtypes for numpy arrays in tft.coders.ExampleProtoCoder and
    tft.coders.CsvCoder.
  • tft.mean, tft.max and tft.var now support tf.SparseTensor.
  • Update examples to use "core" TensorFlow estimator API (tf.estimator).
  • Depends on protobuf>=3.6.0<4.

Breaking changes

  • apply_saved_transform is removed. See note on
    partially_apply_saved_transform in the Deprecations section.
  • No longer set vocabulary_file in IntDomain when using
    tft.compute_and_apply_vocabulary or tft.apply_vocabulary.
  • Requires pre-installed TensorFlow >=1.8,<2.

Deprecations

  • The expected_asset_file_contents of
    TransformTestCase.assertAnalyzeAndTransformResults has been deprecated, use
    expected_vocab_file_contents instead.
  • transform_fn_io.TRANSFORMED_METADATA_DIR and
    transform_fn_io.TRANSFORM_FN_DIR should not be used, they are now aliases
    for TFTransformOutput.TRANSFORMED_METADATA_DIR and
    TFTransformOutput.TRANSFORM_FN_DIR respectively.
  • partially_apply_saved_transform is deprecated, users should use the
    transform_raw_features method of TFTransformOuptut instead. These differ
    in that partially_apply_saved_transform can also be used to return both the
    input placeholders and the outputs. But users do not need this functionality
    because they will typically create the input placeholders themselves based
    on the feature spec.
  • Renamed tft.uniques to tft.vocabulary, tft.string_to_int to
    tft.compute_and_apply_vocabulary and tft.apply_vocab to
    tft.apply_vocabulary. The existing methods will remain for a few more minor
    releases but are now deprecated and should get migrated away from.