Skip to content

v0.6.0 - 2021-10-29

Compare
Choose a tag to compare
@amontanez24 amontanez24 released this 29 Oct 21:28

This release makes major changes to the underlying code for RDT as well as the API for both the HyperTransformer and BaseTransformer.
The changes enable the following functionality:

  • The HyperTransformer can now apply a sequence of transformers to a column.
  • Transformers can now take multiple columns as an input.
  • RDT has been expanded to allow for infinite data types to be added instead of being restricted to pandas.dtypes.
  • Users can define acceptable output types for running HyperTransformer.transform.
  • The HyperTransformer will continuously apply transformations to the input fields until only acceptable data types are in the output.
  • Transformers can return data of any data type.
  • Transformers now have named outputs and output types.
  • Transformers can suggest which transformer to use on any of their outputs.

To take advantage of this functionality, the following API changes were made:

  • The HyperTransformer has new initialization parameters that allow users to specify data types for any field in their data as well as
    specify which transformer to use for a field or data type. The parameters are:
    • field_transformers - A dictionary allowing users to specify which transformer to use for a field or derived field. Derived fields
      are fields created by running transform on the input data.
    • field_data_types - A dictionary allowing users to specify the data type of a field.
    • default_data_type_transformers - A dictionary allowing users to specify the default transformer to use for a data type.
    • transform_output_types - A dictionary allowing users to specify which data types are acceptable for the output of transform.
      This is a result of the fact that transformers can now be applied in a sequence, and not every transformer will return numeric data.
  • Methods were also added to the HyperTransformer to allow these parameters to be modified. These include get_field_data_types,
    update_field_data_types, get_default_data_type_transformers, update_default_data_type_transformers and set_first_transformers_for_fields.
  • The BaseTransformer now requires the column names it will transform to be provided to fit, transform and reverse_transform.
  • The BaseTransformer added the following method to allow for users to see its output fields and output types: get_output_types.
  • The BaseTransformer added the following method to allow for users to see the next suggested transformer for each output field:
    get_next_transformers.

On top of the changes to the API and the capabilities of RDT, many automated checks and tests were also added to ensure that contributions
to the library abide by the current code style, stay performant and result in data of a high quality. These tests run on every push to the
repository. They can also be run locally via the following functions:

  • validate_transformer_code_style - Checks that new code follows the code style.
  • validate_transformer_quality - Tests that new transformers yield data that maintains relationships between columns.
  • validate_transformer_performance - Tests that new transformers don't take too much time or memory.
  • validate_transformer_unit_tests - Checks that the unit tests cover all new code, follow naming conventions and pass.
  • validate_transformer_integration - Checks that the integration tests follow naming conventions and pass.

New Features

Bugs fixed

  • If the input data has a different index, the reverse transformed data may be out of order - Issue #277 by @amontanez24

Documentation changes

Internal improvements

Other issues closed

  • DeprecationWarning: np.float is a deprecated alias for the builtin float - Issue #304 by @csala
  • Add pip check to CI workflows - Issue #290 by @csala
  • Should Transformers subclasses exist for specific configurations? - Issue #243 by @fealho