Skip to content

Latest commit

 

History

History
436 lines (333 loc) · 21.4 KB

CHANGELOG.md

File metadata and controls

436 lines (333 loc) · 21.4 KB

[16.4.1] - 2022-04-14

Changed

  • log4j2 updated to 2.17.1 (was 2.4.1) these are provided dependencies.
  • spring-boot updated to 1.5.22.RELEASE (was 1.3.8.RELEASE).
  • spring-core updated to 4.3.9.RELEASE (was 4.2.8.RELEASE).
  • Fix google library conflicts with guava and gson.

[16.4.0] - 2021-08-24

Changed

  • Various code changes to allow compilation and build on Java 11.
  • hotels-oss-parent version to 6.2.1 (was 5.0.0).

[16.3.3] - 2020-12-10

Fixed

  • Issue where rename table operation would be incorrect if tables are in different databases.
  • Added check in delete operation so that it doesn't try to delete empty lists of keys.

[16.3.2] - 2020-10-27

Fixed

  • Issue where external AVRO schemas generated lots of copy jobs. See #203.

[16.3.1] - 2020-09-15

Changed

  • Added fields sourceTable and sourcePartitions to the CopierContext class.

[16.3.0] - 2020-09-01

Added

  • Added method newInstance(CopierContext) to com.hotels.bdp.circustrain.api.copier.CopierFactory. This provides Copiers with more configuration information in a future proof manner. See #195.

Deprecated

  • Deprecated other newInstance() methods on com.hotels.bdp.circustrain.api.copier.CopierFactory.

[16.2.0] - 2020-07-01

Changed

  • Changed version of hive.version to 2.3.7 (was 2.3.2). This allows Circus Train to be used on JDK>=9.

Added

  • Replication mode FULL_OVERWRITE to overwrite a previously replicated table and delete its data. Useful for incompatible schema changes.

[16.1.0] - 2020-03-18

Changed

  • Updated S3S3Copier to have a configurable max number of threads to pass to TransferManager.
  • Fix AssumeRoleCredentialProvider not auto-renewing credentials on expiration.

Fixed

  • Fixed issue where replication breaks if struct columns have changed. See #173.

[16.0.0] - 2020-02-26

Changed

  • Minimum supported Java version is now 8 (was 7).
  • hotels-oss-parent version to 5.0.0 (was 4.3.1).
  • Updated property aws-jdk.version to 1.11.728 (was 1.11.505).
  • Updated property httpcomponents.httpclient.version to 4.5.11 (was 4.5.5).

[15.1.1] - 2020-02-06

Fixed

  • When replicating tables with large numbers of partitions, Replica.updateMetadata now calls add/alter partition in batches of 1000. See #166.

[15.1.0] - 2020-01-28

Changed

  • AVRO Schema Copier now re-uses the normal 'data' copier instead of its own. See #162.
  • Changed the order of the generated partition filter used by "HiveDiff" - it is now reverse natural order (which means new partitions first when partitions are date/time strings). When in doubt use the circus-train-tool check-filters.sh to see what would be generated.

Fixed

Fixed issue where partition-limit is not correctly applied when generating a partition filter. See #164.

[15.0.0] - 2019-11-12

Changed

  • Default avro-serde-options must now be included within transform-options. This is a backwards incompatible change to the configuration file. Please see Avro Schema Replication for more information.
  • Updated jackson version to 2.10.0 (was 2.9.10).
  • hotels-oss-parent version to 4.2.0 (was 4.0.0). Contains updates to the copyright header.

Fixed

  • Table properties can now be added to default transformations.

Added

  • Added copier-options.assume-role to assume a role when using the S3S3 copier.

[14.1.0] - 2019-10-04

Added

  • Table transformation to add custom properties to tables during a replication.
  • If a user doesn't specify avro-serde-options, Circus Train will still copy the external schema over to the target table. See #131.
  • Added copier-options.assume-role to assume a role when using the S3MapReduceCp copier class. See README.md for details.

Removed

  • Excluded org.pentaho:pentaho-aggdesigner-algorithm from build.

Fixed

  • Bug in AbstractAvroSerDeTransformation where the config state wasn't refreshed on every replication.

Changed

  • Updated jackson version to 2.9.10 (was 2.9.8).
  • Updated beeju version to 2.0.0 (was 1.2.1).
  • Updated circus-train-minimal.yml.template to include the required housekeeping configuration for using the default schema with H2.

[14.0.1] - 2019-04-09

Changed

  • Updated housekeeping version to 3.1.0 (was 3.0.6). Contains various housekeeping fixes.

[14.0.0] - 2019-03-04

Changed

  • Updated housekeeping version to 3.0.6 (was 3.0.5). This change modifies the default script for creating a housekeeping schema (from classpath:/schema.sql to empty string) and can cause errors for users that use the schema provided by default. To fix the errors, the property housekeeping.db-init-script can be updated to classpath:/schema.sql which uses a file provided by default by Circus Train.
  • Updated hotels-oss-parent version to 4.0.0 (was 2.3.5).

Fixed

  • Clear partitioned state correctly for SnsListener. See #104.
  • Fixed issue where in certain cases the table location of a partitioned table would be scheduled for housekeeping.
  • Removed default script for creating a housekeeping schema to allow the use of schemas that are already created. See #111.
  • Upgraded AWS SDK to remove deprecation warning. See #102.
  • Upgraded hcommon-hive-metastore version to 1.3.0 (was 1.2.4) to fix Thrift compatibility bug. See #115.

Added

  • Configurable retry mechanism to handle flaky AWS S3 to S3 copying. See #56.

[13.2.1] - 2019-01-24

Changed

  • Refactored project to remove checkstyle and findbugs warnings.
  • Upgraded hotels-oss-parent version to 2.3.5 (was 2.3.3).
  • Upgraded housekeeping version to 3.0.5 (was 3.0.0).
  • Upgraded jackson version to 2.9.8 (was 2.9.7).

Added

  • Support for getting AWS Credentials within a FARGATE instance in ECS. See #109.

[13.2.0] - 2019-01-11

Added

  • Added replication-strategy configuration that can be used to support propagating deletes (drop table/partition operations). See README.md for more details.
  • Ability to specify an S3 canned ACL via copier-options.canned-acl. See #99.

Fixed

  • Increased version (1.2.4) of hcommon-hive-metastore to fix an issue where the wrong exception was being propagated in the compatibility layer.

[13.1.0] - 2018-12-20

Changed

  • Housekeeping can be configured to control query batch size, this controls memory usage. See #40.
  • Housekeeping readme moved to Housekeeping project. See #31.
  • Upgraded Housekeeping library to also store replica database and table name in Housekeeping database. See #30.
  • Upgraded hotels-oss-parent pom to 2.3.3 (was 2.0.6). See #97.

[13.0.0] - 2018-10-15

Changed

  • Narrowed component scanning to be internal base packages instead of com.hotels.bdp.circustrain. See #95. Note this change is not backwards compatible for any Circus Train extensions that are in the com.hotels.bdp.circustrain package - these were in effect being implicitly scanned and loaded but won't be now. Instead these extensions will now need to be added using Circus Train's standard extension loading mechanism.
  • Upgraded jackson.version to 2.9.7 (was 2.6.6), aws-jdk.version to 1.11.431 (was 1.11.126) and httpcomponents.httpclient.version to 4.5.5 (was 4.5.2). See #91.
  • Refactored general metastore tunnelling code to leverage hcommon-hive-metastore libraries. See #85.
  • Refactored the remaining code in core.metastore from circus-train-core to leverage hcommon-hive-metastore libraries.

[12.1.0] - 2018-08-08

Changed

  • circus-train-gcp: avoid temporary copy of key file to user.dir when using absolute path to Google Cloud credentials file by transforming it into relative path.
  • circus-train-gcp: relative path can now be provided in the configuration for the Google Cloud credentials file.

[12.0.0] - 2018-07-13

Changed

  • circus-train-vacuum-tool moved into Housekeeping project under the module housekeeping-vacuum-tool.
  • Configuration classes moved from Core to API sub-project. See #78.

[11.5.2] - 2018-06-15

Changed

  • Refactored general purpose Hive metastore code to leverage hcommon-hive-metastore libraries. See #72.

Fixed

  • Avro schemas were not being replicated when a avro.schema.url without a scheme was specified. See #74

[11.5.1] - 2018-05-24

Fixed

  • Avro schemas were not being replicated when a HA NameNode is configured and the Avro replication feature is used. See #69.

[11.5.0] - 2018-05-24

Added

  • Add SSH timeout and SSH strict host key checking capabilities. #64.

Changed

  • Using hcommon-ssh-1.0.1 dependency to fix issue where metastore exceptions were lost and not propagated properly over tunnelled connections.
  • Replace SSH support with hcommon-ssh library. #46.

Fixed

  • Housekeeping was failing when attempting to delete a path which no longer exists on the replica filesystem. Upgraded Circus Train's Housekeeping dependency to a version which fixes this bug. See #61.

[11.4.0] - 2018-04-11

Added

  • Ability to select Copier via configuration. See #55.

[11.3.1] - 2018-03-21

Changed

  • Clearer replica-check exception message. See #47.

Fixed

  • S3-S3 Hive Diff calculating incorrect checksum on folders. See #49.

[11.3.0] - 2018-02-27

Changed

  • SNS message now indicates if message was truncated. See #41.
  • Exclude Guava 17.0 in favour of Guava 20.0 for Google Cloud library compatibility.
  • Add dependency management bom for Google Cloud dependencies.

Fixed

  • Backwards compatibility with Hive 1.2.x.

[11.2.0] - 2018-02-16

Added

  • Added ability to configure AWS Server Side Encryption for S3S3Copier via copier-options.s3-server-side-encryption configuration property.

Changed

  • Upgrade housekeeping to version 1.0.2.

[11.1.1] - 2018-02-15

Fixed

  • Google FileSystem classes not being placed onto the mapreduce.application.classpath in S3MapReduceCp and DistCp mapreduce jobs.

Changed

  • Google FileSystem and S3 FileSystems added to mapreduce.application.classpath in circus-train-gcp and circus-train-aws respectively.

[11.1.0] - 2018-02-05

Fixed

  • #23 - Housekeeping failing due to missing credentials.

Added

  • Added replicaTableLocation, replicaMetastoreUris and partitionKeys to SNS message.

Changed

  • SNS Message protocolVersion changed from "1.0" to "1.1".
  • Updated documentation for circus-train-aws-sns module (full reference of SNS message format, more examples).
  • Fixed references to README.md in command line runner help messages to point to correct GitHub locations.

[11.0.0] - 2018-01-16

Changed

  • Upgraded Hive version from 1.2.1 to 2.3.2 (changes are backwards compatible).
  • Upgraded Spring Platform version from 2.0.3.RELEASE to 2.0.8.RELEASE.
  • Replaced TunnellingMetaStoreClient "concrete" implementation with a Java reflection TunnellingMetaStoreClientInvocationHandler.
  • Replicating a partitioned table containing no partitions will now succeed instead of silently not replicating the table metadata.
  • Most functionality from Housekeeping module moved to https://github.com/HotelsDotCom/housekeeping.

[10.0.0] - 2017-11-21

Changed

  • Maven group ID changed to com.hotels.
  • Exclude logback in parent POM.
  • First open source release.
  • Various small code cleanups.

[9.2.0] - 2017-11-20

Fixed

  • S3S3Copier captures cross region replications from US-Standard AWS regions.

Added

  • Mock S3 end-point for HDFS-S3 and S3-S3 replications.
  • New S3MapreduceCp properties to control the size of the buffer used by the S3 TransferManager and to control the upload retries of the S3 client. Refer to README.md for details.

[9.1.1] - 2017-10-17

Fixed

  • EventIdExtractor RegEx changed so that it captures new event ID's and legacy event ID's.
  • Add read limit to prevent AWS library from trying to read more data than the size of the buffer provided by the Hadoop FileSystem.

[9.1.0] - 2017-10-03

Added

  • circus-train-housekeeping support for storing housekeeping data in JDBC compliant SQL databases.

Changed

  • circus-train-parent updated to inherit from hww-parent version 12.1.3.

[9.0.2] - 2017-09-20

Added

  • Support for replication of Hive views.

Changed

  • Removed circus-train-aws dependency on internal patched hadoop-aws.

[9.0.1] - 2017-09-14

Fixed

  • Fixed error when replicating partitioned tables with empty partitions.

[9.0.0] - 2017-09-05

Changed

  • Removed circus-train-aws dependency from circus-train-core.
  • Circus Train Tools to be packaged as TGZ.
  • Updated to parent POM 12.1.0 (latest versions of dependencies and plugins).
  • Relocate only Guava rather than Google Cloud Platform Dependencies + Guava

Removed

  • CircusTrainContext interface as it was a legacy leftover way to make CT pluggable.

Fixed

  • Fixed broken Circus Train Tool Scripts.

[8.0.0] - 2017-07-28

Added

  • S3/HDFS to GS Hive replication.
  • Support for users to be able to specify a list of extension-packages in their YAML configuration. This adds the specified packages to Spring's component scan, thereby allowing the loading of extensions via standard Spring annotations, such as @Component and @Configuration.
  • Added circus-train-package module that builds a TGZ file with a runnable circus-train script.

Changed

  • Changed public interface com.hotels.bdp.circustrain.api.event.TableReplicationListener, for easier error handling.

Removed

  • RPM module has been pulled out to a top-level project.

[7.0.0]

Added

  • New and improved HDFS-to-S3 copier.

Changed

  • Changed default instance.home from $user.dir to $user.home. If you relied on the $user.dir please set the instance.home variable in the YAML config.

Fixed

  • Fixed issue in LoggingListener where the wrong number of altered partitions was reported.

6.0.0

  • Fixed issue with the circus-train-aws-sns module.

5.2.0

  • Added S3 to S3 replication.
  • Should have been major release. The public com.hotels.bdp.circustrain.api.copier.CopierFactory.supportsSchemes method has changed signature. Please adjust your code if you rely on this for Circus Train extensions.

5.1.1

  • Clean up of AWS Credential Provider classes.
  • Fixed tables in documentation to display correctly.

5.1.0

  • Added support for skipping missing partition folder errors.
  • Added support for transporting AvroSerDe files when the avro.schema.url is specified within the SERDEPROPERTIES rather than the TBLPROPERTIES.
  • Fixed bug where table-location matching avro schema base-url causes an IOException to be thrown in future reads of the replica table.
  • Added transformation config per table replication (Used in Avro SerDe transformation)
  • Replaced usage of reflections.org with Spring's scanning provider.

5.0.2

  • Documented methods for implementing custom copiers and data transformations.
  • Cleaned up copier options configuration classes.

5.0.1

  • Support for composite copiers.

4.0.0

  • Extensions for HiveMetaStoreClientFactory to allow integration with AWS Athena.
  • Updated to extend hdw-parent 9.2.2 which in turn upgrades hive.version to 1.2.1000.2.4.3.3-2.

3.4.0

  • Support for downloading Avro schemas from the URI on the source table and uploading it to a user specified URL on replication.
  • Multiple transformations can now be loaded onto the classpath for application on replication rather than just one.

3.3.1

  • Update to new parent with HDP dependency updates for the Hadoop upgrade.
  • Fixed bug where incorrect table name resulted in a NullPointerException.

3.3.0

  • Added new "replication mode: METADATA_UPDATE" feature which provides the ability to replicate metadata only which is useful for updating the structure and configuration of previously replicated tables.

3.2.0

  • Added new "replication mode: METADATA_MIRROR" feature which provides the ability to replicate metadata only, pointing the replica table at the original source data locations.

3.1.1

  • Replication and Housekeeping can now be executed in separate processes.
    • Add the option --modules=replication to the scripts circus-train.sh to perform replication only.
    • Use the scripts housekeeping.sh and housekeeping-rush.sh to perform housekeeping in its own process.

3.0.2

  • Made scripts and code workable with HDP-2.4.3.

3.0.1

  • Fixed issue where wrong replication was listed as the failed replication.

3.0.0

  • Support for replication from AWS to on-premises when running on on-premises cluster.
  • Configuration element replica-catalog.s3 is now security. The following is an example of how to migrate your configuration files to this new version:

Old configuration file

```
...
replica-catalog:
  name: my-replica
  hive-metastore-uris: thrift://hive.localdomain:9083
  s3:
    credential-provider: jceks://hdfs/<hdfs-location-of-jceks-file>/my-credentials.jceks
...
```

New configuration file

```
...
replica-catalog:
  name: my-replica
  hive-metastore-uris: thrift://hive.localdomain:9083
security:
  credential-provider: jceks://hdfs/<hdfs-location-of-jceks-file>/my-credentials.jceks
...
```

2.2.4

  • Exit codes based on success or error.

2.2.3

  • Ignoring params that seem to be added in the replication process.
  • Support sending S3_DIST_CP_BYTES_REPLICATED/DIST_CP_BYTES_REPLICATED metrics to graphite for running (S3)DistCp jobs.

2.2.2

  • Support for SHH tunneling on source catalog.

2.2.1

  • Fixes for filter partition generator.

2.2.0

  • Enabled possibility to generate filter partitions for incremental replication.

2.1.3

  • Introduction of the transformations API: users can now provide a metadata transformation function for tables, partitions and column statistics.

2.1.2

  • Fixed issue with deleted paths.

2.1.1

  • Added some stricter preconditions to the vacuum tool so that data is not unintentionally removed from tables with inconsistent metadata.

2.1.0

  • Added the 'vacuum' tool for removing data orphaned by a bug in circus-train versions earlier than 2.0.0.
  • Moved the 'filter tool' into a 'tools' sub-module.

2.0.1

  • Fixed issue where housekeeping would fail when two processes deleted the same entry.

2.0.0

  • SSH tunnels with multiple hops. The property replica-catalog.metastore-tunnel.user has been replaced with replica-catalog.metastore-tunnel.route and the property replica-catalog.metastore-tunnel.private-key had been replaced with replica-catalog.metastore-tunnel.private-keys. Refer to README.md for details.
  • The executable script has been split to provide both non-RUSH and RUSH executions. If you are not using RUSH the keep using circus-train.sh and if you are using RUSH then you can either change your scripts to invoke circus-train-rush.sh instead or add the new parameter rush in the first position when invoking circus-train.sh.
  • Removal of property graphite.enabled.
  • Improvements and fixes to the housekeeping process that manages old data deletion:
    • Binds S3AFileSystem to s3[n]:// schemes in the tool for housekeeping.
    • Only remove housekeeping entries from the database if:
      • The leaf path described by the record no longer exists AND another sibling exists who can look after the ancestors
      • OR the ancestors of the leaf path no longer exist.
    • Stores the eventId of the deleted path along with the path.
    • If an existing record does not have the previous eventId then reconstruct it from the path (to support legacy data for the time being).

1.5.1

  • DistCP temporary path is now set per task.