v0.6.0 - 2021-10-29
This release makes major changes to the underlying code for RDT as well as the API for both the HyperTransformer
and BaseTransformer
.
The changes enable the following functionality:
- The
HyperTransformer
can now apply a sequence of transformers to a column. - Transformers can now take multiple columns as an input.
- RDT has been expanded to allow for infinite data types to be added instead of being restricted to
pandas.dtypes
. - Users can define acceptable output types for running
HyperTransformer.transform
. - The
HyperTransformer
will continuously apply transformations to the input fields until only acceptable data types are in the output. - Transformers can return data of any data type.
- Transformers now have named outputs and output types.
- Transformers can suggest which transformer to use on any of their outputs.
To take advantage of this functionality, the following API changes were made:
- The
HyperTransformer
has new initialization parameters that allow users to specify data types for any field in their data as well as
specify which transformer to use for a field or data type. The parameters are:field_transformers
- A dictionary allowing users to specify which transformer to use for a field or derived field. Derived fields
are fields created by runningtransform
on the input data.field_data_types
- A dictionary allowing users to specify the data type of a field.default_data_type_transformers
- A dictionary allowing users to specify the default transformer to use for a data type.transform_output_types
- A dictionary allowing users to specify which data types are acceptable for the output oftransform
.
This is a result of the fact that transformers can now be applied in a sequence, and not every transformer will return numeric data.
- Methods were also added to the
HyperTransformer
to allow these parameters to be modified. These includeget_field_data_types
,
update_field_data_types
,get_default_data_type_transformers
,update_default_data_type_transformers
andset_first_transformers_for_fields
. - The
BaseTransformer
now requires the column names it will transform to be provided tofit
,transform
andreverse_transform
. - The
BaseTransformer
added the following method to allow for users to see its output fields and output types:get_output_types
. - The
BaseTransformer
added the following method to allow for users to see the next suggested transformer for each output field:
get_next_transformers
.
On top of the changes to the API and the capabilities of RDT, many automated checks and tests were also added to ensure that contributions
to the library abide by the current code style, stay performant and result in data of a high quality. These tests run on every push to the
repository. They can also be run locally via the following functions:
validate_transformer_code_style
- Checks that new code follows the code style.validate_transformer_quality
- Tests that new transformers yield data that maintains relationships between columns.validate_transformer_performance
- Tests that new transformers don't take too much time or memory.validate_transformer_unit_tests
- Checks that the unit tests cover all new code, follow naming conventions and pass.validate_transformer_integration
- Checks that the integration tests follow naming conventions and pass.
New Features
- Update HyperTransformer API - Issue #298 by @amontanez24
- Create validate_pull_request function - Issue #254 by @pvk-developer
- Create validate_transformer_unit_tests function - Issue #249 by @pvk-developer
- Create validate_transformer_performance function - Issue #251 by @katxiao
- Create validate_transformer_quality function - Issue #253 by @amontanez24
- Create validate_transformer_code_style function - Issue #248 by @pvk-developer
- Create validate_transformer_integration function - Issue #250 by @katxiao
- Enable users to specify transformers to use in HyperTransformer - Issue #233 by @amontanez24 and @csala
- Addons implementation - Issue #225 by @pvk-developer
- Create ways for HyperTransformer to know which transformers to apply to each data type - Issue #232 by @amontanez24 and @csala
- Update categorical transformers - PR #231 by @fealho
- Update numerical transformer - PR #227 by @fealho
- Update datetime transformer - PR #230 by @fealho
- Update boolean transformer - PR #228 by @fealho
- Update null transformer - PR #229 by @fealho
- Update the baseclass - PR #224 by @fealho
Bugs fixed
- If the input data has a different index, the reverse transformed data may be out of order - Issue #277 by @amontanez24
Documentation changes
- RDT contributing guide - Issue #301 by @katxiao and @amontanez24
Internal improvements
- Add PR template for new transformers - Issue #307 by @katxiao
- Implement Quality Tests for Transformers - Issue #252 by @amontanez24
- Update performance test structure - Issue #257 by @katxiao
- Automated integration test for transformers - Issue #223 by @katxiao
- Move datasets to its own module - Issue #235 by @katxiao
- Fix missing coverage in rdt unit tests - Issue #219 by @fealho
- Add repo-wide automation - Issue #309 by @katxiao