log4j2
updated to2.17.1
(was2.4.1
) these are provided dependencies.spring-boot
updated to1.5.22.RELEASE
(was1.3.8.RELEASE
).spring-core
updated to4.3.9.RELEASE
(was4.2.8.RELEASE
).- Fix google library conflicts with
guava
andgson
.
- Various code changes to allow compilation and build on Java 11.
hotels-oss-parent
version to 6.2.1 (was 5.0.0).
- Issue where rename table operation would be incorrect if tables are in different databases.
- Added check in
delete
operation so that it doesn't try to delete empty lists of keys.
- Issue where external AVRO schemas generated lots of copy jobs. See #203.
- Added fields
sourceTable
andsourcePartitions
to theCopierContext
class.
- Added method
newInstance(CopierContext)
tocom.hotels.bdp.circustrain.api.copier.CopierFactory
. This provides Copiers with more configuration information in a future proof manner. See #195.
- Deprecated other
newInstance()
methods oncom.hotels.bdp.circustrain.api.copier.CopierFactory
.
- Changed version of
hive.version
to2.3.7
(was2.3.2
). This allows Circus Train to be used on JDK>=9.
- Replication mode
FULL_OVERWRITE
to overwrite a previously replicated table and delete its data. Useful for incompatible schema changes.
- Updated S3S3Copier to have a configurable max number of threads to pass to TransferManager.
- Fix AssumeRoleCredentialProvider not auto-renewing credentials on expiration.
- Fixed issue where replication breaks if struct columns have changed. See #173.
- Minimum supported Java version is now 8 (was 7).
hotels-oss-parent
version to 5.0.0 (was 4.3.1).- Updated property
aws-jdk.version
to 1.11.728 (was 1.11.505). - Updated property
httpcomponents.httpclient.version
to 4.5.11 (was 4.5.5).
- When replicating tables with large numbers of partitions,
Replica.updateMetadata
now calls add/alter partition in batches of 1000. See #166.
- AVRO Schema Copier now re-uses the normal 'data' copier instead of its own. See #162.
- Changed the order of the generated partition filter used by "HiveDiff" - it is now reverse natural order (which means new partitions first when partitions are date/time strings). When in doubt use the circus-train-tool
check-filters.sh
to see what would be generated.
Fixed issue where partition-limit is not correctly applied when generating a partition filter. See #164.
- Default
avro-serde-options
must now be included withintransform-options
. This is a backwards incompatible change to the configuration file. Please see Avro Schema Replication for more information. - Updated
jackson
version to 2.10.0 (was 2.9.10). hotels-oss-parent
version to 4.2.0 (was 4.0.0). Contains updates to the copyright header.
- Table properties can now be added to default transformations.
- Added
copier-options.assume-role
to assume a role when using the S3S3 copier.
- Table transformation to add custom properties to tables during a replication.
- If a user doesn't specify
avro-serde-options
, Circus Train will still copy the external schema over to the target table. See #131. - Added
copier-options.assume-role
to assume a role when using the S3MapReduceCp copier class. See README.md for details.
- Excluded
org.pentaho:pentaho-aggdesigner-algorithm
from build.
- Bug in
AbstractAvroSerDeTransformation
where the config state wasn't refreshed on every replication.
- Updated
jackson
version to 2.9.10 (was 2.9.8). - Updated
beeju
version to 2.0.0 (was 1.2.1). - Updated
circus-train-minimal.yml.template
to include the requiredhousekeeping
configuration for using the default schema with H2.
- Updated
housekeeping
version to 3.1.0 (was 3.0.6). Contains various housekeeping fixes.
- Updated
housekeeping
version to 3.0.6 (was 3.0.5). This change modifies the default script for creating a housekeeping schema (fromclasspath:/schema.sql
to empty string) and can cause errors for users that use the schema provided by default. To fix the errors, the propertyhousekeeping.db-init-script
can be updated toclasspath:/schema.sql
which uses a file provided by default by Circus Train. - Updated
hotels-oss-parent
version to 4.0.0 (was 2.3.5).
- Clear partitioned state correctly for
SnsListener
. See #104. - Fixed issue where in certain cases the table location of a partitioned table would be scheduled for housekeeping.
- Removed default script for creating a housekeeping schema to allow the use of schemas that are already created. See #111.
- Upgraded AWS SDK to remove deprecation warning. See #102.
- Upgraded
hcommon-hive-metastore
version to 1.3.0 (was 1.2.4) to fix Thrift compatibility bug. See #115.
- Configurable retry mechanism to handle flaky AWS S3 to S3 copying. See #56.
- Refactored project to remove checkstyle and findbugs warnings.
- Upgraded
hotels-oss-parent
version to 2.3.5 (was 2.3.3). - Upgraded
housekeeping
version to 3.0.5 (was 3.0.0). - Upgraded
jackson
version to 2.9.8 (was 2.9.7).
- Support for getting AWS Credentials within a FARGATE instance in ECS. See #109.
- Added replication-strategy configuration that can be used to support propagating deletes (drop table/partition operations). See README.md for more details.
- Ability to specify an S3 canned ACL via
copier-options.canned-acl
. See #99.
- Increased version (1.2.4) of hcommon-hive-metastore to fix an issue where the wrong exception was being propagated in the compatibility layer.
- Housekeeping can be configured to control query batch size, this controls memory usage. See #40.
- Housekeeping readme moved to Housekeeping project. See #31.
- Upgraded Housekeeping library to also store replica database and table name in Housekeeping database. See #30.
- Upgraded
hotels-oss-parent
pom to 2.3.3 (was 2.0.6). See #97.
- Narrowed component scanning to be internal base packages instead of
com.hotels.bdp.circustrain
. See #95. Note this change is not backwards compatible for any Circus Train extensions that are in thecom.hotels.bdp.circustrain
package - these were in effect being implicitly scanned and loaded but won't be now. Instead these extensions will now need to be added using Circus Train's standard extension loading mechanism. - Upgraded
jackson.version
to 2.9.7 (was 2.6.6),aws-jdk.version
to 1.11.431 (was 1.11.126) andhttpcomponents.httpclient.version
to 4.5.5 (was 4.5.2). See #91. - Refactored general metastore tunnelling code to leverage hcommon-hive-metastore libraries. See #85.
- Refactored the remaining code in
core.metastore
fromcircus-train-core
to leverage hcommon-hive-metastore libraries.
- circus-train-gcp: avoid temporary copy of key file to
user.dir
when using absolute path to Google Cloud credentials file by transforming it into relative path. - circus-train-gcp: relative path can now be provided in the configuration for the Google Cloud credentials file.
- circus-train-vacuum-tool moved into Housekeeping project under the module housekeeping-vacuum-tool.
- Configuration classes moved from Core to API sub-project. See #78.
- Refactored general purpose Hive metastore code to leverage hcommon-hive-metastore libraries. See #72.
- Avro schemas were not being replicated when a avro.schema.url without a scheme was specified. See #74
- Avro schemas were not being replicated when a HA NameNode is configured and the Avro replication feature is used. See #69.
- Add SSH timeout and SSH strict host key checking capabilities. #64.
- Using hcommon-ssh-1.0.1 dependency to fix issue where metastore exceptions were lost and not propagated properly over tunnelled connections.
- Replace SSH support with hcommon-ssh library. #46.
- Housekeeping was failing when attempting to delete a path which no longer exists on the replica filesystem. Upgraded Circus Train's Housekeeping dependency to a version which fixes this bug. See #61.
- Ability to select Copier via configuration. See #55.
- Clearer replica-check exception message. See #47.
- S3-S3 Hive Diff calculating incorrect checksum on folders. See #49.
- SNS message now indicates if message was truncated. See #41.
- Exclude Guava 17.0 in favour of Guava 20.0 for Google Cloud library compatibility.
- Add dependency management bom for Google Cloud dependencies.
- Backwards compatibility with Hive 1.2.x.
- Added ability to configure AWS Server Side Encryption for
S3S3Copier
viacopier-options.s3-server-side-encryption
configuration property.
- Upgrade housekeeping to version 1.0.2.
- Google FileSystem classes not being placed onto the mapreduce.application.classpath in S3MapReduceCp and DistCp mapreduce jobs.
- Google FileSystem and S3 FileSystems added to mapreduce.application.classpath in circus-train-gcp and circus-train-aws respectively.
- #23 - Housekeeping failing due to missing credentials.
- Added
replicaTableLocation
,replicaMetastoreUris
andpartitionKeys
to SNS message.
- SNS Message
protocolVersion
changed from "1.0" to "1.1". - Updated documentation for circus-train-aws-sns module (full reference of SNS message format, more examples).
- Fixed references to README.md in command line runner help messages to point to correct GitHub locations.
- Upgraded Hive version from 1.2.1 to 2.3.2 (changes are backwards compatible).
- Upgraded Spring Platform version from 2.0.3.RELEASE to 2.0.8.RELEASE.
- Replaced
TunnellingMetaStoreClient
"concrete" implementation with a Java reflectionTunnellingMetaStoreClientInvocationHandler
. - Replicating a partitioned table containing no partitions will now succeed instead of silently not replicating the table metadata.
- Most functionality from Housekeeping module moved to https://github.com/HotelsDotCom/housekeeping.
- Maven group ID changed to com.hotels.
- Exclude logback in parent POM.
- First open source release.
- Various small code cleanups.
S3S3Copier
captures cross region replications from US-Standard AWS regions.
- Mock S3 end-point for HDFS-S3 and S3-S3 replications.
- New
S3MapreduceCp
properties to control the size of the buffer used by the S3TransferManager
and to control the upload retries of the S3 client. Refer to README.md for details.
EventIdExtractor
RegEx changed so that it captures new event ID's and legacy event ID's.- Add read limit to prevent AWS library from trying to read more data than the size of the buffer provided by the Hadoop
FileSystem
.
- circus-train-housekeeping support for storing housekeeping data in JDBC compliant SQL databases.
- circus-train-parent updated to inherit from hww-parent version 12.1.3.
- Support for replication of Hive views.
- Removed circus-train-aws dependency on internal patched hadoop-aws.
- Fixed error when replicating partitioned tables with empty partitions.
- Removed circus-train-aws dependency from circus-train-core.
- Circus Train Tools to be packaged as TGZ.
- Updated to parent POM 12.1.0 (latest versions of dependencies and plugins).
- Relocate only Guava rather than Google Cloud Platform Dependencies + Guava
CircusTrainContext
interface as it was a legacy leftover way to make CT pluggable.
- Fixed broken Circus Train Tool Scripts.
- S3/HDFS to GS Hive replication.
- Support for users to be able to specify a list of
extension-packages
in their YAML configuration. This adds the specified packages to Spring's component scan, thereby allowing the loading of extensions via standard Spring annotations, such as@Component
and@Configuration
. - Added circus-train-package module that builds a TGZ file with a runnable circus-train script.
- Changed public interface
com.hotels.bdp.circustrain.api.event.TableReplicationListener
, for easier error handling.
- RPM module has been pulled out to a top-level project.
- New and improved HDFS-to-S3 copier.
- Changed default
instance.home
from$user.dir
to$user.home
. If you relied on the$user.dir
please set theinstance.home
variable in the YAML config.
- Fixed issue in LoggingListener where the wrong number of altered partitions was reported.
- Fixed issue with the circus-train-aws-sns module.
- Added S3 to S3 replication.
- Should have been major release. The public
com.hotels.bdp.circustrain.api.copier.CopierFactory.supportsSchemes
method has changed signature. Please adjust your code if you rely on this for Circus Train extensions.
- Clean up of AWS Credential Provider classes.
- Fixed tables in documentation to display correctly.
- Added support for skipping missing partition folder errors.
- Added support for transporting AvroSerDe files when the avro.schema.url is specified within the SERDEPROPERTIES rather than the TBLPROPERTIES.
- Fixed bug where table-location matching avro schema base-url causes an IOException to be thrown in future reads of the replica table.
- Added transformation config per table replication (Used in Avro SerDe transformation)
- Replaced usage of reflections.org with Spring's scanning provider.
- Documented methods for implementing custom copiers and data transformations.
- Cleaned up copier options configuration classes.
- Support for composite copiers.
- Extensions for
HiveMetaStoreClientFactory
to allow integration with AWS Athena. - Updated to extend hdw-parent 9.2.2 which in turn upgrades hive.version to 1.2.1000.2.4.3.3-2.
- Support for downloading Avro schemas from the URI on the source table and uploading it to a user specified URL on replication.
- Multiple transformations can now be loaded onto the classpath for application on replication rather than just one.
- Update to new parent with HDP dependency updates for the Hadoop upgrade.
- Fixed bug where incorrect table name resulted in a NullPointerException.
- Added new "replication mode: METADATA_UPDATE" feature which provides the ability to replicate metadata only which is useful for updating the structure and configuration of previously replicated tables.
- Added new "replication mode: METADATA_MIRROR" feature which provides the ability to replicate metadata only, pointing the replica table at the original source data locations.
- Replication and Housekeeping can now be executed in separate processes.
- Add the option
--modules=replication
to the scriptscircus-train.sh
to perform replication only. - Use the scripts
housekeeping.sh
andhousekeeping-rush.sh
to perform housekeeping in its own process.
- Add the option
- Made scripts and code workable with HDP-2.4.3.
- Fixed issue where wrong replication was listed as the failed replication.
- Support for replication from AWS to on-premises when running on on-premises cluster.
- Configuration element
replica-catalog.s3
is nowsecurity
. The following is an example of how to migrate your configuration files to this new version:
Old configuration file
```
...
replica-catalog:
name: my-replica
hive-metastore-uris: thrift://hive.localdomain:9083
s3:
credential-provider: jceks://hdfs/<hdfs-location-of-jceks-file>/my-credentials.jceks
...
```
New configuration file
```
...
replica-catalog:
name: my-replica
hive-metastore-uris: thrift://hive.localdomain:9083
security:
credential-provider: jceks://hdfs/<hdfs-location-of-jceks-file>/my-credentials.jceks
...
```
- Exit codes based on success or error.
- Ignoring params that seem to be added in the replication process.
- Support sending
S3_DIST_CP_BYTES_REPLICATED
/DIST_CP_BYTES_REPLICATED
metrics to graphite for running (S3)DistCp jobs.
- Support for SHH tunneling on source catalog.
- Fixes for filter partition generator.
- Enabled possibility to generate filter partitions for incremental replication.
- Introduction of the transformations API: users can now provide a metadata transformation function for tables, partitions and column statistics.
- Fixed issue with deleted paths.
- Added some stricter preconditions to the vacuum tool so that data is not unintentionally removed from tables with inconsistent metadata.
- Added the 'vacuum' tool for removing data orphaned by a bug in circus-train versions earlier than 2.0.0.
- Moved the 'filter tool' into a 'tools' sub-module.
- Fixed issue where housekeeping would fail when two processes deleted the same entry.
- SSH tunnels with multiple hops. The property
replica-catalog.metastore-tunnel.user
has been replaced withreplica-catalog.metastore-tunnel.route
and the propertyreplica-catalog.metastore-tunnel.private-key
had been replaced withreplica-catalog.metastore-tunnel.private-keys
. Refer to README.md for details. - The executable script has been split to provide both non-RUSH and RUSH executions. If you are not using RUSH the keep using
circus-train.sh
and if you are using RUSH then you can either change your scripts to invokecircus-train-rush.sh
instead or add the new parameter rush in the first position when invokingcircus-train.sh
. - Removal of property
graphite.enabled
. - Improvements and fixes to the housekeeping process that manages old data deletion:
- Binds
S3AFileSystem
tos3[n]://
schemes in the tool for housekeeping. - Only remove housekeeping entries from the database if:
- The leaf path described by the record no longer exists AND another sibling exists who can look after the ancestors
- OR the ancestors of the leaf path no longer exist.
- Stores the eventId of the deleted path along with the path.
- If an existing record does not have the previous eventId then reconstruct it from the path (to support legacy data for the time being).
- Binds
DistCP
temporary path is now set per task.