Releases: snowplow/snowplow-rdb-loader
4.2.0
New authorization options for Snowflake and Databricks
Up until version 4.1.0, RDB Loader required that the warehouse had been pre-configured to have read-access to the data files in S3.
For Snowflake, this meant setting up an external stage with a storage integration.
For Databricks, it meant setting up a cluster to assume an AWS instance profile.
Starting with version 4.2.0, RDB Loader is able to generate temporary credentials using STS and pass these credentials to Snowflake/Databricks. This removes the need to pre-configure the warehouse with access permission.
To start using the new authorization method, you must add a loadAuthMethod
to the storage
block in your config file:
"storage": {
// other required fields go here
"loadAuthMethod": {
"type": "TempCreds"
"roleArn": "arn:aws:iam::123456789:role/example_role_name"
}
}
...where roleArn
is a role with permission to read files from the S3 bucket. The loader must have permission to assume this role.
Our Github repo has some examples of this configuration for Snowflake and for Databricks.
Note, for Snowflake loading, depending on your event volume and warehouse configuration, there may still be an advantage to setting up the storage integration, because the underlying COPY INTO
statement is more efficient.
For Databricks loading, though, there should be no impact of changing to the new authorization method.
Retry on target initialization
Initialization block is surrounded by retry block so that if an exception is thrown from initialization block instead of crashing the application, it will be retried according to the specified backoff strategy.
To enable this feature, initRetries
must be added to config file:
"initRetries": {
"backoff": "30 seconds"
"strategy": "EXPONENTIAL"
"attempts": 3,
"cumulativeBound": "1 hour"
},
Adjusting the path appended to Snowflake stage in Snowflake Loader
As it is mentioned in the first section, we've added a new authorization option to Snowflake Loader. However, old Snowflake stage method still can be used.
Previously, Snowflake stage path needs to be exactly the path where transformed run folders reside. If the path of an upper folder is given as a stage path, loading wouldn't work.
We've fixed this issue in this release. Even if the stage path is set to the path of transformed folder's upper directory, loading would still work correctly.
To use this feature, you need to update transformedStage
and folderMonitoringStage
blocks:
"transformedStage": {
# The name of the stage
"name": "snowplow_stage"
# The S3 path used as stage location
"location": "s3://bucket/transformed/"
}
"folderMonitoringStage": {
# The name of the stage
"name": "snowplow_folders_stage"
# The S3 path used as stage location
"location": "s3://bucket/monitoring/"
}
Bug fix for streaming transformer on multiple instances
In the previous released of RDB Loader we announced that the streaming transformer can now scale to multiple instances, which was a really important requirement for high volume pipelines.
We got one little thing wrong though, and it lead to some app crashes with error messages about lost Kinesis leases. This bug is now fixed in version 4.2.0, and we hope this unlocks your pipeline from scaling to higher event volumes with the streaming transformer.
Upgrading to 4.2.0
If you are already using a recent version of RDB Loader (3.0.0 or higher) then upgrading to 4.2.0 is as simple as pulling the newest docker images. There are no changes needed to your configuration files.
docker pull snowplow/transformer-kinesis:4.2.0
docker pull snowplow/rdb-loader-redshift:4.2.0
docker pull snowplow/rdb-loader-snowflake:4.2.0
docker pull snowplow/rdb-loader-databricks:4.2.0
The Snowplow docs site has a full guide to running the RDB Loader.
Changelog
- Transformer kinesis: Recover from losing lease to a new worker (#962)
- Loader: make the part appended to folder monitoring staging path configurable (#969)
- Loader: retry on target initialization (#964)
- Loader: Trim alert message payloads to 4096 characters (#956)
- Snowflake loader: make on_error continue when type of the incoming data is parquet (#970)
- Snowflake loader: make the path used with stage adjustable (#968)
- Snowflake loader: use STS tokens for copying from S3 (#955)
- Snowflake loader: Specify file format in the load statement (#957)
- Databricks loader: Generate STS tokens for copying from S3 (#954)
4.1.0
Concurrent streaming transformers for horizontal scaling
Before version 4.1.0, it was only possible to run a single instance of the streaming transformer at any one time. If you tried to run multiple instances at the same time, then there was a race condition which got described in detail in a previous Discourse thread. The old setup worked great for low volume pipelines, but it meant the streaming solution was not ideal for scaling up to higher volumes.
In version 4.1.0 we have worked around the problem simply by changing the directory names in S3 to contain a UUID unique to each running transformer. Before version 4.1.0, an output directory might be called run=2022-05-01-00-00-14
, but in 4.1.0 the output directory might be called like run=2022-05-01-00-00-14-b4cac3e5-9948-40e3-bd68-38abcf01cdf9
. Directory names for the batch transformer are not affected.
With this simple change, you can now safely scale out your streaming transformer to have multiple instances running in parallel.
Databricks loader supports generated columns
If you load into Databricks, a great way to set up your table is to partition based on the date of the event using a generated column:
CREATE TABLE IF NOT EXISTS snowplow.events (
app_id VARCHAR(255),
collector_tstamp TIMESTAMP NOT NULL,
event_name VARCHAR(1000),
-- Lots of other fields go here
-- Collector timestamp date for partitioning
collector_tstamp_date DATE GENERATED ALWAYS AS (DATE(collector_tstamp))
)
PARTITIONED BY (collector_tstamp_date, event_name);
This partitioning strategy is very efficient for analytic queries that filter by collector_tstamp
. The Snowplow/Databricks dbt web model works particularly well with this partitioning scheme.
In RDB Loader version 4.1.0 we made a small change to the Databricks loading to account for these generated columns.
Upgrading to 4.1.0
If you are already using a recent version of RDB Loader (3.0.0 or higher) then upgrading to 4.1.0 is as simple as pulling the newest docker images. There are no changes needed to your configuration files.
docker pull snowplow/transformer-kinesis:4.1.0
docker pull snowplow/rdb-loader-redshift:4.1.0
docker pull snowplow/rdb-loader-snowflake:4.1.0
docker pull snowplow/rdb-loader-databricks:4.1.0
The Snowplow docs site has a full guide to running the RDB Loader.
Changelog
- Databricks loader: Support for generated columns (#951)
- Loader: Use explicit schema name everywhere (#952)
- Loader: Jars cannot load jsch (#942)
- Snowflake loader: region and account configuration fields should be optional (#947)
- Loader: Include the SQLState when logging a SQLException (#941)
- Loader: Handle run directories with UUID suffix in folder monitoring (#949)
- Add UUID to streaming transformer directory structure (#945)
4.0.4
A patch release to make transformer-kinesis more configurable via the hocon file.
Changelog
Transformer kinesis: make Kinesis consumer more configurable (#865)
Transformer: split batch and streaming configs (#937)
4.0.3
4.0.2
This patch release has several improvements to make the loaders and streaming transformer more resilient against failures. It also patches dependencies to latest versions to mitigate security vulnerabilities.
Common
- Set region in the SQS client builder (#587)
- Common: Snyk action should only run on push to master (#929)
Loaders
Use forked version of jsch lib for ssh (#927)
Recover from exceptions on alerting webhook (#925)
Add logging around using SSH tunnel (#923)
Timeouts on JDBC statements (#914)
Bump snowflake-jdbc to 3.13.9 (#928)
Make ON_ERROR copy option configurable (#912)
Transformer Kinesis
- Bump parquet-hadoop to 1.12.3 (#933)
- Exclude hadoop transitive dependencies (#932)
- Always end up in consistent state (#873)
- No checkpointing until after SQS message is sent (#917)
- Add missing hadoop-aws dependency for s3 parquet files upload (#920)
Batch Transformer
- Add fileFormat field to formats section of example hocon (#848)
4.0.1
4.0.0
In this release, we are introducing our new Databricks Loader. Databricks Loader will be able to load Parquet transformed data therefore we've added wide row Parquet support to both Batch Transformer and Stream Transformer.
Also, we've included various improvements and bug fixes for Stream Transformer in order to get one step closer to making it production-ready.
Common
Loader
- Add Databricks as a destination (#860)
- Check if target is ready before submitting the statement (#846)
- Emit latency statistics on constant intervals (#795)
- Add load_tstamp (#815, #571)
Batch Transformer
- Support Parquet output option (#896)
Transformer Kinesis
- Support Parquet output option (#900)
- Report metrics (#862)
- Add telemetry (#863)
- Write shredding_complete.json to S3 (#867)
- Use output of transformation in updating global state (#824)
- Fix updating total and bad number of events counter in global state (#823)
- Add tests for whole processing pipeline (#835)
- Fix passing checkpoint action during creation of windowed records (#762)