forked from pingcap/docs
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
ticdc: add data integration docs (pingcap#9692)
* ticdc: add data integration docs * translate integrating with kafka * format * lint * fix gramar * Apply suggestions from code review Co-authored-by: Ran <[email protected]> * Update integration-overview.md Co-authored-by: Ran <[email protected]> * fix code format and adjust a note * further fix code format * ci Co-authored-by: Ran <[email protected]>
- Loading branch information
1 parent
5a907a6
commit e6a1e89
Showing
16 changed files
with
470 additions
and
160 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,16 @@ | ||
--- | ||
title: Data Integration Overview | ||
summary: Learn the overview of data integration scenarios. | ||
--- | ||
|
||
# Data Integration Overview | ||
|
||
Data integration means the flow, transfer, and consolidation of data among various data sources. As data grows exponentially in volume and data value is more profoundly explored, data integration has become increasingly popular and urgent. To avoid the situation that TiDB becomes data silos and to integrate data with different platforms, TiCDC offers the capability to replicate TiDB incremental data change logs to other data platforms. This document describes the data integration applications using TiCDC. You can choose an integration solution that suits your business scenarios. | ||
|
||
## Integrate with Confluent Cloud | ||
|
||
You can use TiCDC to replicate incremental data from TiDB to Confluent Cloud, and replicate the data to ksqlDB, Snowflake, and SQL Server via Confluent Cloud. For details, see [Integrate with Confluent Cloud](/ticdc/integrate-confluent-using-ticdc.md). | ||
|
||
## Integrate with Apache Kafka and Apache Flink | ||
|
||
You can use TiCDC to replicate incremental data from TiDB to Apache Kafka, and consume the data using Apache Flink. For details, see [Integrate with Apache Kafka and Apache Flink](/replicate-data-to-kafka.md). |
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,229 @@ | ||
--- | ||
title: Migrate Data from TiDB to MySQL-compatible Databases | ||
summary: Learn how to migrate data from TiDB to MySQL-compatible databases. | ||
--- | ||
|
||
# Migrate Data from TiDB to MySQL-compatible Databases | ||
|
||
This document describes how to migrate data from TiDB clusters to MySQL-compatible databases, such as Aurora, MySQL, and MariaDB. The whole process contains four steps: | ||
|
||
1. Set up the environment. | ||
2. Migrate full data. | ||
3. Migrate incremental data. | ||
4. Switch services to the new TiDB cluster. | ||
|
||
## Step 1. Set up the environment | ||
|
||
1. Deploy a TiDB cluster upstream. | ||
|
||
Deploy a TiDB cluster by using TiUP Playground. For more information, refer to [Deploy and Maintain an Online TiDB Cluster Using TiUP](/tiup/tiup-cluster.md). | ||
|
||
```shell | ||
# Create a TiDB cluster | ||
tiup playground --db 1 --pd 1 --kv 1 --tiflash 0 --ticdc 1 | ||
# View cluster status | ||
tiup status | ||
``` | ||
|
||
2. Deploy a MySQL instance downstream. | ||
|
||
- In a lab environment, you can use Docker to quickly deploy a MySQL instance by running the following command: | ||
|
||
```shell | ||
docker run --name some-mysql -e MYSQL_ROOT_PASSWORD=my-secret-pw -p 3306:3306 -d mysql | ||
``` | ||
|
||
- In a production environment, you can deploy a MySQL instance by following instructions in [Installing MySQL](https://dev.mysql.com/doc/refman/8.0/en/installing.html). | ||
|
||
3. Simulate service workload. | ||
|
||
In the lab environment, you can use `go-tpc` to write data to the TiDB cluster upstream. This is to generate event changes in the TiDB cluster. Run the following command to create a database named `tpcc` in the TiDB cluster, and then use TiUP bench to write data to this database. | ||
|
||
```shell | ||
tiup bench tpcc -H 127.0.0.1 -P 4000 -D tpcc --warehouses 4 prepare | ||
tiup bench tpcc -H 127.0.0.1 -P 4000 -D tpcc --warehouses 4 run --time 300s | ||
``` | ||
|
||
For more details about `go-tpc`, refer to [How to Run TPC-C Test on TiDB](/benchmark/benchmark-tidb-using-tpcc.md). | ||
|
||
## Step 2. Migrate full data | ||
|
||
After setting up the environment, you can use [Dumpling](/dumpling-overview.md) to export the full data from the upstream TiDB cluster. | ||
|
||
> **Note:** | ||
> | ||
> In production clusters, performing a backup with GC disabled might affect cluster performance. It is recommended that you complete this step in off-peak hours. | ||
|
||
1. Disable Garbage Collection (GC). | ||
|
||
To ensure that newly written data is not deleted during incremental migration, you should disable GC for the upstream cluster before exporting full data. In this way, history data is not deleted. | ||
|
||
Run the following command to disable GC: | ||
|
||
```sql | ||
MySQL [test]> SET GLOBAL tidb_gc_enable=FALSE; | ||
``` | ||
|
||
``` | ||
Query OK, 0 rows affected (0.01 sec) | ||
``` | ||
|
||
To verify that the change takes effect, query the value of `tidb_gc_enable`: | ||
|
||
```sql | ||
MySQL [test]> SELECT @@global.tidb_gc_enable; | ||
``` | ||
|
||
``` | ||
+-------------------------+: | ||
| @@global.tidb_gc_enable | | ||
+-------------------------+ | ||
| 0 | | ||
+-------------------------+ | ||
1 row in set (0.00 sec) | ||
``` | ||
|
||
2. Back up data. | ||
|
||
1. Export data in SQL format using Dumpling: | ||
|
||
```shell | ||
tiup dumpling -u root -P 4000 -h 127.0.0.1 --filetype sql -t 8 -o ./dumpling_output -r 200000 -F256MiB | ||
``` | ||
|
||
2. After finishing exporting data, run the following command to check the metadata. `Pos` in the metadata is the TSO of the export snapshot and can be recorded as the BackupTS. | ||
|
||
```shell | ||
cat dumpling_output/metadata | ||
``` | ||
|
||
``` | ||
Started dump at: 2022-06-28 17:49:54 | ||
SHOW MASTER STATUS: | ||
Log: tidb-binlog | ||
Pos: 434217889191428107 | ||
GTID: | ||
Finished dump at: 2022-06-28 17:49:57 | ||
``` | ||
|
||
3. Restore data. | ||
|
||
Use MyLoader (an open-source tool) to import data to the downstream MySQL instance. For details about how to install and use MyLoader, see [MyDumpler/MyLoader](https://github.com/mydumper/mydumper). Run the following command to import full data exported by Dumpling to MySQL: | ||
|
||
```shell | ||
myloader -h 127.0.0.1 -P 3306 -d ./dumpling_output/ | ||
``` | ||
|
||
4. (Optional) Validate data. | ||
|
||
You can use [sync-diff-inspector](/sync-diff-inspector/sync-diff-inspector-overview.md) to check data consistency between upstream and downstream at a certain time. | ||
|
||
```shell | ||
sync_diff_inspector -C ./config.yaml | ||
``` | ||
|
||
For details about how to configure the sync-diff-inspector, see [Configuration file description](/sync-diff-inspector/sync-diff-inspector-overview.md#configuration-file-description). In this document, the configuration is as follows: | ||
|
||
```toml | ||
# Diff Configuration. | ||
######################### Datasource config ######################### | ||
[data-sources] | ||
[data-sources.upstream] | ||
host = "127.0.0.1" # Replace the value with the IP address of your upstream cluster | ||
port = 4000 | ||
user = "root" | ||
password = "" | ||
snapshot = "434217889191428107" # Set snapshot to the actual backup time (BackupTS in the "Back up data" section in [Step 2. Migrate full data](#step-2-migrate-full-data)) | ||
[data-sources.downstream] | ||
host = "127.0.0.1" # Replace the value with the IP address of your downstream cluster | ||
port = 3306 | ||
user = "root" | ||
password = "" | ||
######################### Task config ######################### | ||
[task] | ||
output-dir = "./output" | ||
source-instances = ["upstream"] | ||
target-instance = "downstream" | ||
target-check-tables = ["*.*"] | ||
``` | ||
|
||
## Step 3. Migrate incremental data | ||
|
||
1. Deploy TiCDC. | ||
|
||
After finishing full data migration, deploy and configure a TiCDC cluster to replicate incremental data. In production environments, deploy TiCDC as instructed in [Deploy TiCDC](/ticdc/deploy-ticdc.md). In this document, a TiCDC node has been started upon the creation of the test cluster. Therefore, you can skip the step of deploying TiCDC and proceed with the next step to create a changefeed. | ||
|
||
2. Create a changefeed. | ||
|
||
In the upstream cluster, run the following command to create a changefeed from the upstream to the downstream clusters: | ||
|
||
```shell | ||
tiup ctl:v6.1.0 cdc changefeed create --pd=http://127.0.0.1:2379 --sink-uri="mysql://root:@127.0.0.1:3306" --changefeed-id="upstream-to-downstream" --start-ts="434217889191428107" | ||
``` | ||
|
||
In this command, the parameters are as follows: | ||
|
||
- `--pd`: PD address of the upstream cluster | ||
- `--sink-uri`: URI of the downstream cluster | ||
- `--changefeed-id`: changefeed ID, must be in the format of a regular expression, `^[a-zA-Z0-9]+(\-[a-zA-Z0-9]+)*$` | ||
- `--start-ts`: start timestamp of the changefeed, must be the backup time (or BackupTS in the "Back up data" section in [Step 2. Migrate full data](#step-2-migrate-full-data)) | ||
|
||
For more information about the changefeed configurations, see [Task configuration file](/ticdc/manage-ticdc.md#task-configuration-file). | ||
|
||
3. Enable GC. | ||
|
||
In incremental migration using TiCDC, GC only removes history data that is replicated. Therefore, after creating a changefeed, you need to run the following command to enable GC. For details, see [What is the complete behavior of TiCDC garbage collection (GC) safepoint](/ticdc/ticdc-faq.md#what-is-the-complete-behavior-of-ticdc-garbage-collection-gc-safepoint). | ||
|
||
To enable GC, run the following command: | ||
|
||
```sql | ||
MySQL [test]> SET GLOBAL tidb_gc_enable=TRUE; | ||
``` | ||
|
||
``` | ||
Query OK, 0 rows affected (0.01 sec) | ||
``` | ||
|
||
To verify that the change takes effect, query the value of `tidb_gc_enable`: | ||
|
||
```sql | ||
MySQL [test]> SELECT @@global.tidb_gc_enable; | ||
``` | ||
|
||
``` | ||
+-------------------------+ | ||
| @@global.tidb_gc_enable | | ||
+-------------------------+ | ||
| 1 | | ||
+-------------------------+ | ||
1 row in set (0.00 sec) | ||
``` | ||
|
||
## Step 4. Switch services | ||
|
||
After creating a changefeed, data written to the upstream cluster is replicated to the downstream cluster with low latency. You can migrate read stream to the downstream cluster gradually. Observe the read stream for a period. If the downstream cluster is stable, you can switch write stream to the downstream cluster as well in the following steps: | ||
|
||
1. Stop write services in the upstream cluster. Make sure that all upstream data are replicated to downstream before stopping the changefeed. | ||
|
||
```shell | ||
# Stop the changefeed from the upstream cluster to the downstream cluster | ||
tiup cdc cli changefeed pause -c "upstream-to-downstream" --pd=http://172.16.6.122:2379 | ||
# View the changefeed status | ||
tiup cdc cli changefeed list | ||
``` | ||
|
||
``` | ||
[ | ||
{ | ||
"id": "upstream-to-downstream", | ||
"summary": { | ||
"state": "stopped", # Ensure that the status is stopped | ||
"tso": 434218657561968641, | ||
"checkpoint": "2022-06-28 18:38:45.685", # This time should be later than the time of stopping writing | ||
"error": null | ||
} | ||
} | ||
] | ||
``` | ||
|
||
2. After migrating writing services to the downstream cluster, observe for a period. If the downstream cluster is stable, you can quit the upstream cluster. |
Oops, something went wrong.