diff --git a/TOC.md b/TOC.md index f03fc96981015..9ae5892b04252 100644 --- a/TOC.md +++ b/TOC.md @@ -40,15 +40,21 @@ - [Test TiDB Using TPC-C](/benchmark/benchmark-tidb-using-tpcc.md) - Migrate - [Overview](/migration-overview.md) - - Migrate from MySQL - - [Migrate from Amazon Aurora MySQL Using TiDB Lightning](/migrate-from-aurora-using-lightning.md) - - [Migrate from MySQL SQL Files Using TiDB Lightning](/migrate-from-mysql-dumpling-files.md) - - [Migrate from Amazon Aurora MySQL Using DM](/migrate-from-aurora-mysql-database.md) - - Migrate from CSV Files - - [Use TiDB Lightning](/tidb-lightning/migrate-from-csv-using-tidb-lightning.md) - - [Use `LOAD DATA` Statement](/sql-statements/sql-statement-load-data.md) - - [Migrate from SQL Files](/migrate-from-mysql-dumpling-files.md) - - [Replicate Incremental Data between TiDB Clusters in Real Time](/incremental-replication-between-clusters.md) + - [Migration Tools](/migration-tools.md) + - Migration Scenarios + - [Migrate from Aurora](/migrate-aurora-to-tidb.md) + - [Migrate MySQL of Small Datasets](/migrate-small-mysql-to-tidb.md) + - [Migrate MySQL of Large Datasets](/migrate-large-mysql-to-tidb.md) + - [Migrate and Merge MySQL Shards of Small Datasets](/migrate-small-mysql-shards-to-tidb.md) + - [Migrate and Merge MySQL Shards of Large Datasets](/migrate-large-mysql-shards-to-tidb.md) + - [Migrate from CSV Files](/migrate-from-csv-files-to-tidb.md) + - [Migrate from SQL Files](/migrate-from-sql-files-to-tidb.md) + - [Replicate Incremental Data between TiDB Clusters](/incremental-replication-between-clusters.md) + - Advanced Migration + - [Continuous Replication with gh-ost or pt-osc](/migrate-with-pt-ghost.md) + - [Filter Binlog Events](/filter-binlog-event.md) + - [Filter DML Events Using SQL Expressions](/filter-dml-event.md) + - [Migrate to a Downstream Table with More Columns](/migrate-with-more-columns-downstream.md) - Maintain - Upgrade - [Use TiUP (Recommended)](/upgrade-tidb-using-tiup.md) diff --git a/_index.md b/_index.md index 896dc05ce32a9..859b3e29674e6 100644 --- a/_index.md +++ b/_index.md @@ -49,10 +49,9 @@ Designed for the cloud, TiDB provides flexible scalability, reliability and secu Migrate Data - [Migration Overview](/migration-overview.md) -- [Migrate full data from Aurora](/migrate-from-aurora-using-lightning.md) -- [Migrate continuously from Aurora/MySQL Database](/migrate-from-aurora-mysql-database.md) -- [Migrate from CSV Files](/tidb-lightning/migrate-from-csv-using-tidb-lightning.md) -- [Migrate from MySQL SQL Files](/migrate-from-mysql-dumpling-files.md) +- [Migrate Data from CSV Files to TiDB](/migrate-from-csv-files-to-tidb.md) +- [Migrate Data from SQL Files to TiDB](/migrate-from-sql-files-to-tidb.md) +- [Migrate Data from Amazon Aurora to TiDB](/migrate-aurora-to-tidb.md) diff --git a/filter-binlog-event.md b/filter-binlog-event.md new file mode 100644 index 0000000000000..d5ae8f81e3108 --- /dev/null +++ b/filter-binlog-event.md @@ -0,0 +1,124 @@ +--- +title: Filter Binlog Events +summary: Learn how to filter binlog events when migrating data. +--- + +# Filter Binlog Events + +This document describes how to filter binlog events when you use DM to perform continuous incremental data replication. For the detailed replication instructions, refer to the following documents by scenarios: + +- [Migrate MySQL of Small Datasets to TiDB](/migrate-small-mysql-to-tidb.md) +- [Migrate MySQL of Large Datasets to TiDB](/migrate-large-mysql-to-tidb.md) +- [Migrate and Merge MySQL Shards of Small Datasets to TiDB](/migrate-small-mysql-shards-to-tidb.md) +- [Migrate and Merge MySQL Shards of Large Datasets to TiDB](/migrate-large-mysql-shards-to-tidb.md) + +## Configuration + +To use binlog event filter, add a `filter` to the task configuration file of DM, as shown below: + +```yaml +filters: + rule-1: + schema-pattern: "test_*" + table-pattern: "t_*" + events: ["truncate table", "drop table"] + sql-pattern: ["^DROP\\s+PROCEDURE", "^CREATE\\s+PROCEDURE"] + action: Ignore +``` + +- `schema-pattern`/`table-pattern`: Filters matching schemas or tables +- `events`: Filters binlog events. Supported events are listed in the table below: + + | Event | Category | Description | + | --------------- | ---- | --------------------------| + | all | | Includes all events | + | all dml | | Includes all DML events | + | all ddl | | Includes all DDL events | + | none | | Includes no event | + | none ddl | | Excludes all DDL events | + | none dml | | Excludes all DML events | + | insert | DML | Insert DML event | + | update | DML | Update DML event | + | delete | DML | Delete DML event | + | create database | DDL | Create database event | + | drop database | DDL | Drop database event | + | create table | DDL | Create table event | + | create index | DDL | Create index event | + | drop table | DDL | Drop table event | + | truncate table | DDL | Truncate table event | + | rename table | DDL | Rename table event | + | drop index | DDL | Drop index event | + | alter table | DDL | Alter table event | + +- `sql-pattern`:Filters specified DDL SQL statements. The matching rule supports using a regular expression. +- `action`: `Do` or `Ignore` + + - `Do`: the allow list. A binlog event is replicated if meeting either of the following two conditions: + + - The event matches the rule setting. + - sql-pattern has been specified and the SQL statement of the event matches any of the sql-pattern options. + + - `Ignore`: the block list. A binlog event is filtered out if meeting either of the following two conditions: + + - The event matches the rule setting. + - sql-pattern has been specified and the SQL statement of the event matches any of the sql-pattern options. + + If both `Do` and `Ignore` are configured, `Ignore` has higher priority over `Do`. That is, an event satisfying both `Ignore` and `Do` conditions will be filtered out. + +## Application scenarios + +This section describes the application scenarios of binlog event filter. + +### Filter out all sharding deletion operations + +To filter out all deletion operations, configure a `filter-table-rule` and a `filter-schema-rule`, as shown below: + +``` +filters: + filter-table-rule: + schema-pattern: "test_*" + table-pattern: "t_*" + events: ["truncate table", "drop table", "delete"] + action: Ignore + filter-schema-rule: + schema-pattern: "test_*" + events: ["drop database"] + action: Ignore +``` + +### Migrate only DML operations of sharded schemas and tables + +To replicate only DML statements, configure two `Binlog event filter rule`, as shown below: + +``` +filters: + do-table-rule: + schema-pattern: "test_*" + table-pattern: "t_*" + events: ["create table", "all dml"] + action: Do + do-schema-rule: + schema-pattern: "test_*" + events: ["create database"] + action: Do +``` + +### Filter out SQL statements not supported by TiDB + +To filter out SQL statements not supported by TiDB, configure a `filter-procedure-rule`, as shown below: + +``` +filters: + filter-procedure-rule: + schema-pattern: "*" + sql-pattern: [".*\\s+DROP\\s+PROCEDURE", ".*\\s+CREATE\\s+PROCEDURE", "ALTER\\s+TABLE[\\s\\S]*ADD\\s+PARTITION", "ALTER\\s+TABLE[\\s\\S]*DROP\\s+PARTITION"] + action: Ignore +``` + +> **Warning:** +> +> To avoid filtering out data that needs to be migrated, configure the global filtering rule as strictly as possible. + +## See also + +[Filter Binlog Events Using SQL Expressions](/filter-dml-event.md) diff --git a/filter-dml-event.md b/filter-dml-event.md new file mode 100644 index 0000000000000..5836dbcb48252 --- /dev/null +++ b/filter-dml-event.md @@ -0,0 +1,80 @@ +--- +title: Filter DML Events Using SQL Expressions +summary: Learn how to filter DML events using SQL expressions. +--- + +# Filter DML Events Using SQL Expressions + +This document introduces how to filter binlog events using SQL expressions when you use DM to perform continuous incremental data replication. For the detailed replication instruction, refer to the following documents: + +- [Migrate MySQL of Small Datasets to TiDB](/migrate-small-mysql-to-tidb.md) +- [Migrate MySQL of Large Datasets to TiDB](/migrate-large-mysql-to-tidb.md) +- [Migrate and Merge MySQL Shards of Small Datasets to TiDB](/migrate-small-mysql-shards-to-tidb.md) +- [Migrate and Merge MySQL Shards of Large Datasets to TiDB](/migrate-large-mysql-shards-to-tidb.md) + +When performing incremental data replication, you can use the [Binlog Event Filter](/filter-binlog-event.md) to filter certain types of binlog events. For example, you can choose not to replicate `DELETE` events to the downstream for the purposes like archiving and auditing. However, the Binlog Event Filter cannot determine whether to filter the `DELETE` event of a row that requires finer granularity. + +To address the issue, since v2.0.5, DM supports using `binlog value filter` in incremental data replication to filter data. Among the DM-supported and `ROW`-formatted binlog, the binlog events carry values of all columns, and you can configure SQL expressions based on these values. If the expression calculates a row change as `TRUE`, DM does not replicate this row change to the downstream. + +Similar to [Binlog Event Filter](/filter-binlog-event.md), you need to configure `binlog value filter` in the task configuration file. For details, see the following configuration example. For the advanced task configuration and the description, refer to [DM advanced task configuration file](https://docs.pingcap.com/tidb-data-migration/stable/task-configuration-file-full#task-configuration-file-template-advanced). + +```yaml +name: test +task-mode: all + +mysql-instances: + - source-id: "mysql-replica-01" + expression-filters: ["even_c"] + +expression-filter: + even_c: + schema: "expr_filter" + table: "tbl" + insert-value-expr: "c % 2 = 0" +``` + +In the above configuration example, the `even_c` rule is configured and referenced by the data source `mysql-replica-01`. According to this rule, for the `tb1` table in the `expr_filter` schema, when an even number is inserted into the `c` column (`c % 2 = 0`), this `insert` statement is not replicated to the downstream. The following example shows the effect of this rule. + +Incrementally insert the following data in the upstream data source: + +```sql +INSERT INTO tbl(id, c) VALUES (1, 1), (2, 2), (3, 3), (4, 4); +``` + +Then query the `tb1` table on downstream. You can see that only the rows with odd numbers on `c` are replicated. + +```sql +MySQL [test]> select * from tbl; ++------+------+ +| id | c | ++------+------+ +| 1 | 1 | +| 3 | 3 | ++------+------+ +2 rows in set (0.001 sec) +``` + +## Configuration parameters and description + +- `schema`: The name of the upstream schema to match. Wildcard matching or regular matching is not supported. +- `table`: The name of the upstream table to match. Wildcard matching or regular matching is not supported. +- `insert-value-expr`: Configures an expression that takes effect on values carried by the `INSERT` type binlog events (WRITE_ROWS_EVENT). You cannot use this expression together with `update-old-value-expr`, `update-new-value-expr` or `delete-value-expr` in the same configuration item. +- `update-old-value-expr`: Configures an expression that takes effect on the old values carried by the `UPDATE` type binlog events (UPDATE_ROWS_EVENT). You cannot use this expression together with `insert-value-expr` or `delete-value-expr` in the same configuration item. +- `update-new-value-expr`: Configures an expression that takes effect on the new values carried by the `UPDATE` type binlog events (UPDATE_ROWS_EVENT). You cannot use this expression together with `insert-value-expr` or `delete-value-expr` in the same configuration item. +- `delete-value-expr`: Configures an expression that takes effect on values carried by the `DELETE` type binlog events (DELETE_ROWS_EVENT). You cannot use this expression together with `insert-value-expr`, `update-old-value-expr` or `update-new-value-expr`. + +> **Note:** +> +> - You can configure `update-old-value-expr` and `update-new-value-expr` together. +> - When `update-old-value-expr` and `update-new-value-expr` are configured together, the rows whose "update + old values" meet `update-old-value-expr` **and** whose "update + new values" meet `update-new-value-expr` are filtered. +> - When one of `update-old-value-expr` and `update-new-value-expr` is configured, the configured expression determines whether to filter the **entire row change**, which means that the deletion of old values and the insertion of new values are filtered as a whole. + +You can use the SQL expression on one column or on multiple columns. You can also use the SQL functions supported by TiDB, such as `c % 2 = 0`, `a*a + b*b = c*c`, and `ts > NOW()`. + +The `TIMESTAMP` default time zone is the time zone specified in the task configuration file. The default value is the time zone of the downstream. You can explicitly specify the time zone in a way like `c_timestamp = '2021-01-01 12:34:56.5678+08:00'`. + +You can configure multiple filtering rules under the `expression-filter` configuration item. The upstream data source references the required rule in `expression-filters` to make it effective. When multiple rules are used, if **any** one of the rules are matched, the entire row change is filtered. + +> **Note:** +> +> Configuring too many expression filtering rules increases the calculation overhead of DM and slows down the data replication. diff --git a/media/migrate-shard-tables-within-1tb-en.png b/media/migrate-shard-tables-within-1tb-en.png new file mode 100644 index 0000000000000..769a15c887496 Binary files /dev/null and b/media/migrate-shard-tables-within-1tb-en.png differ diff --git a/media/shard-merge-using-lightning-en.png b/media/shard-merge-using-lightning-en.png new file mode 100644 index 0000000000000..ceee3da550cee Binary files /dev/null and b/media/shard-merge-using-lightning-en.png differ diff --git a/migrate-aurora-to-tidb.md b/migrate-aurora-to-tidb.md new file mode 100644 index 0000000000000..d2f699a6fcc7b --- /dev/null +++ b/migrate-aurora-to-tidb.md @@ -0,0 +1,307 @@ +--- +title: Migrate Data from Amazon Aurora to TiDB +summary: Learn how to migrate data from Amazon Aurora to TiDB using DB snapshot. +aliases: ['/tidb/dev/migrate-from-aurora-using-lightning','/docs/dev/migrate-from-aurora-mysql-database/','/docs/dev/how-to/migrate/from-mysql-aurora/','/docs/dev/how-to/migrate/from-aurora/', '/tidb/dev/migrate-from-aurora-mysql-database'] +--- + +# Migrate Data from Amazon Aurora to TiDB + +This document describes how to migrate data from Amazon Aurora to TiDB. The migration process uses [DB snapshot](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/Aurora.Managing.Backups.html), which saves a lot of space and time. + +The whole migration has two processes: + +- Import full data to TiDB using TiDB Lightning +- Replicate incremental data to TiDB using DM (optional) + +## Prerequisites + +- [Install Dumpling and TiDB Lightning](/migration-tools.md) +- [Get the target database privileges required for TiDB Lightning](/tidb-lightning/tidb-lightning-faq.md#what-are-the-privilege-requirements-for-the-target-database). + +## Import full data to TiDB + +### Step 1. Export an Aurora snapshot to Amazon S3 + +1. In Aurora, query the current binlog position by running the following command: + + ```sql + mysql> SHOW MASTER STATUS; + ``` + + The output is similar to the following. Record the binlog name and position for later use. + + ``` + +------------------+----------+--------------+------------------+-------------------+ + | File | Position | Binlog_Do_DB | Binlog_Ignore_DB | Executed_Gtid_Set | + +------------------+----------+--------------+------------------+-------------------+ + | mysql-bin.000002 | 52806 | | | | + +------------------+----------+--------------+------------------+-------------------+ + 1 row in set (0.012 sec) + ``` + +2. Export the Aurora snapshot. For detailed steps, refer to [Exporting DB snapshot data to Amazon S3](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/USER_ExportSnapshot.html). + +After you obtain the binlog position, export the snapshot within 5 minutes. Otherwise, the recorded binlog position might be outdated and thus cause data conflict during the incremental replication. + +After the two steps above, make sure you have the following information ready: + +- The Aurora binlog name and position at the time of the snapshot creation. +- The S3 path where the snapshot is stored, and the SecretKey and AccessKey with access to the S3 path. + +### Step 2. Export schema + +Because the snapshot file from Aurora does not contain the DDL statements, you need to export the schema using Dumpling and create the schema in the target database using TiDB Lightning. If you want to manually create the schema, you can skip this step. + +Export the schema using Dumpling by running the following command. The command includes the `--filter` parameter to only export the desired table schema: + +{{< copyable "shell-regular" >}} + +```shell +tiup dumpling --host ${host} --port 3306 --user root --password ${password} --filter 'my_db1.table[12]' --no-data --output 's3://my-bucket/schema-backup?region=us-west-2' --filter "mydb.*" +``` + +The parameters used in the command above are as follows. For more parameters, refer to [Dumpling overview](/dumpling-overview.md). + +|Parameter |Description | +|- |- | +|`-u` or `--user` |Aurora MySQL user| +|`-p` or `--password` |MySQL user password| +|`-P` or `--port` |MySQL port| +|`-h` or `--host` |MySQL IP address| +|`-t` or `--thread` |The number of threads used for export| +|`-o` or `--output` |The directory that stores the exported file. Supports local path or [external storage URL](/br/backup-and-restore-storages.md)| +|`-r` or `--row` |The maximum number of rows in a single file| +|`-F` |The maximum size of a single file, in MiB. Recommended value: 256 MiB.| +|`-B` or `--database` |Specifies a database to be exported| +|`-T` or `--tables-list`|Exports the specified tables| +|`-d` or `--no-data` |Does not export data. Only exports schema.| +|`-f` or `--filter` |Exports tables that match the pattern. Do not use `-f` and `-T` at the same time. Refer to [table-filter](/table-filter.md) for the syntax.| + +### Step 3. Create the TiDB Lightning configuration file + +Create the `tidb-lightning.toml` configuration file as follows: + +{{< copyable "shell-regular" >}} + +```shell +vim tidb-lightning.toml +``` + +{{< copyable "" >}} + +```toml +[tidb] + +# The target TiDB cluster information. +host = ${host} # e.g.: 172.16.32.1 +port = ${port} # e.g.: 4000 +user = "${user_name} # e.g.: "root" +password = "${password}" # e.g.: "rootroot" +status-port = ${status-port} # Obtains the table schema information from TiDB status port, e.g.: 10080 +pd-addr = "${ip}:${port}" # The cluster PD address, e.g.: 172.16.31.3:2379. TiDB Lightning obtains some information from PD. When backend = "local", you must specify status-port and pd-addr correctly. Otherwise, the import will be abnormal. + +[tikv-importer] +# "local": Default backend. The local backend is recommended to import large volumes of data (1 TiB or more). During the import, the target TiDB cluster cannot provide any service. +# "tidb": The "tidb" backend is recommended to import data less than 1 TiB. During the import, the target TiDB cluster can provide service normally. +backend = "local" + +# Set the temporary storage directory for the sorted Key-Value files. The directory must be empty, and the storage space must be enough to hold the largest single table in the data source. For better import performance, it is recommended to use a directory different from `data-source-dir` and use flash storage, which can use I/O exclusively. +sorted-kv-dir = "/mnt/ssd/sorted-kv-dir" + +[mydumper] +# The path that stores the snapshot file. +data-source-dir = "${s3_path}" # e.g.: s3://my-bucket/sql-backup?region=us-west-2 + +[[mydumper.files]] +# The expression that parses the parquet file. +pattern = '(?i)^(?:[^/]*/)*([a-z0-9_]+)\.([a-z0-9_]+)/(?:[^/]*/)*(?:[a-z0-9\-_.]+\.(parquet))$' +schema = '$1' +table = '$2' +type = '$3' +``` + +If you need to enable TLS in the TiDB cluster, refer to [TiDB Lightning Configuration](/tidb-lightning/tidb-lightning-configuration.md). + +### Step 4. Import full data to TiDB + +1. Create the tables in the target database using TiDB Lightning: + + {{< copyable "shell-regular" >}} + + ```shell + tiup tidb-lightning -config tidb-lightning.toml -d ./schema -no-schema=false + ``` + +2. Start the import by running `tidb-lightning`. If you launch the program directly in the command line, the process might exit unexpectedly after receiving a SIGHUP signal. In this case, it is recommended to run the program using a `nohup` or `screen` tool. For example: + + Pass the SecretKey and AccessKey that have access to the S3 storage path as environment variables to the Dumpling node. You can also read the credentials from `~/.aws/credentials`. + + {{< copyable "shell-regular" >}} + + ```shell + export AWS_ACCESS_KEY_ID=${access_key} + export AWS_SECRET_ACCESS_KEY=${secret_key} + nohup tiup tidb-lightning -config tidb-lightning.toml -no-schema=true > nohup.out 2>&1 & + ``` + +3. After the import starts, you can check the progress of the import by either of the following methods: + + - `grep` the keyword `progress` in the log. The progress is updated every 5 minutes by default. + - Check progress in [the monitoring dashboard](/tidb-lightning/monitor-tidb-lightning.md). + - Check progress in [the TiDB Lightning web interface](/tidb-lightning/tidb-lightning-web-interface.md). + +4. After TiDB Lightning completes the import, it exits automatically. If you find the last 5 lines of its log print `the whole procedure completed`, the import is successful. + +> **Note:** +> +> Whether the import is successful or not, the last line of the log shows `tidb lightning exit`. It means that TiDB Lightning exits normally, but does not necessarily mean that the import is successful. + +If you encounter any problem during the import, refer to [TiDB Lightning FAQ](/tidb-lightning/tidb-lightning-faq.md) for troubleshooting. + +## Replicate incremental data to TiDB (optional) + +### Prerequisites + +- [Install DM](https://docs.pingcap.com/tidb-data-migration/stable/deploy-a-dm-cluster-using-tiup). +- [Get the source database and target database privileges required for DM](https://docs.pingcap.com/tidb-data-migration/stable/dm-worker-intro). + +### Step 1: Create the data source + +1. Create the `source1.yaml` file as follows: + + {{< copyable "" >}} + + ```yaml + # Must be unique. + source-id: "mysql-01" + # Configures whether DM-worker uses the global transaction identifier (GTID) to pull binlogs. To enable this mode, the upstream MySQL must also enable GTID. If the upstream MySQL service is configured to switch master between different nodes automatically, GTID mode is required. + enable-gtid: false + + from: + host: "${host}" # e.g.: 172.16.10.81 + user: "root" + password: "${password}" # Supported but not recommended to use plaintext password. It is recommended to use `dmctl encrypt` to encrypt the plaintext password before using it. + port: 3306 + ``` + +2. Load the data source configuration to the DM cluster using `tiup dmctl` by running the following command: + + {{< copyable "shell-regular" >}} + + ```shell + tiup dmctl --master-addr ${advertise-addr} operate-source create source1.yaml + ``` + + The parameters used in the command above are described as follows: + + |Parameter |Description | + |- |- | + |`--master-addr` |The `{advertise-addr}` of any DM-master in the cluster where `dmctl` is to be connected, e.g.: 172.16.10.71:8261| + |`operate-source create`|Loads the data source to the DM cluster.| + +### Step 2: Create the migration task + +Create the `task1.yaml` file as follows: + +{{< copyable "" >}} + +```yaml +# Task name. Multiple tasks that are running at the same time must each have a unique name. +name: "test" +# Task mode. Options are: +# - full: only performs full data migration. +# - incremental: only performs binlog real-time replication. +# - all: full data migration + binlog real-time replication. +task-mode: "incremental" +# The configuration of the target TiDB database. +target-database: + host: "${host}" # e.g.: 172.16.10.83 + port: 4000 + user: "root" + password: "${password}" # Supported but not recommended to use a plaintext password. It is recommended to use `dmctl encrypt` to encrypt the plaintext password before using it. + +# Global configuration for block and allow lists. Each instance can reference the configuration by name. +block-allow-list: # If the DM version is earlier than v2.0.0-beta.2, use black-white-list. + listA: # Name. + do-tables: # Allow list for the upstream tables to be migrated. + - db-name: "test_db" # Name of databases to be migrated. + tbl-name: "test_table" # Name of tables to be migrated. + +# Configures the data source. +mysql-instances: + - source-id: "mysql-01" # Data source ID,i.e., source-id in source1.yaml + block-allow-list: "listA" # References the block-allow-list configuration above. +# syncer-config-name: "global" # References the syncers incremental data configuration. + meta: # When task-mode is "incremental" and the downstream database does not have a checkpoint, DM uses the binlog position as the starting point. If the downstream database has a checkpoint, DM uses the checkpoint as the starting point. + binlog-name: "mysql-bin.000004" # The binlog position recorded in "Step 1. Export an Aurora snapshot to Amazon S3". When the upstream database has source-replica switching, GTID mode is required. + binlog-pos: 109227 + # binlog-gtid: "09bec856-ba95-11ea-850a-58f2b4af5188:1-9" + + # (Optional) If you need to incrementally replicate data that has already been migrated in the full data migration, you need to enable the safe mode to avoid the incremental data replication error. + # This scenario is common in the following case: the full migration data does not belong to the data source's consistency snapshot, and after that, DM starts to replicate incremental data from a position earlier than the full migration. + # syncers: # The running configurations of the sync processing unit. + # global: # Configuration name. + # safe-mode: true # If this field is set to true, DM changes INSERT of the data source to REPLACE for the target database, and changes UPDATE of the data source to DELETE and REPLACE for the target database. This is to ensure that when the table schema contains a primary key or unique index, DML statements can be imported repeatedly. In the first minute of starting or resuming an incremental replication task, DM automatically enables the safe mode. +``` + +The YAML file above is the minimum configuration required for the migration task. For more configuration items, refer to [DM Advanced Task Configuration File](https://docs.pingcap.com/tidb-data-migration/stable/task-configuration-file-full). + +### Step 3. Run the migration task + +Before you start the migration task, to reduce the probability of errors, it is recommended to confirm that the configuration meets the requirements of DM by running the `check-task` command: + +{{< copyable "shell-regular" >}} + +```shell +tiup dmctl --master-addr ${advertise-addr} check-task task.yaml +``` + +After that, start the migration task by running `tiup dmctl`: + +{{< copyable "shell-regular" >}} + +```shell +tiup dmctl --master-addr ${advertise-addr} start-task task.yaml +``` + +The parameters used in the command above are described as follows: + +|Parameter |Description | +|- |- | +|`--master-addr` |The `{advertise-addr}` of any DM-master in the cluster where `dmctl` is to be connected, e.g.: 172.16.10.71:8261| +|`start-task` |Starts the migration task.| + +If the task fails to start, check the prompt message and fix the configuration. After that, you can re-run the command above to start the task. + +If you encounter any problem, refer to [DM error handling](https://docs.pingcap.com/tidb-data-migration/stable/error-handling) and [DM FAQ](https://docs.pingcap.com/tidb-data-migration/stable/faq). + +### Step 4. Check the migration task status + +To learn whether the DM cluster has an ongoing migration task and the task status, run the `query-status` command using `tiup dmctl`: + +{{< copyable "shell-regular" >}} + +```shell +tiup dmctl --master-addr ${advertise-addr} query-status ${task-name} +``` + +For a detailed interpretation of the results, refer to [Query Status](https://docs.pingcap.com/tidb-data-migration/stable/query-status). + +### Step 5. Monitor the task and view logs + +To view the history status of the migration task and other internal metrics, take the following steps. + +If you have deployed Prometheus, Alertmanager, and Grafana when you deployed DM using TiUP, you can access Grafana using the IP address and port specified during the deployment. You can then select DM dashboard to view DM-related monitoring metrics. + +When DM is running, DM-worker, DM-master, and dmctl print the related information in logs. The log directories of these components are as follows: + +- DM-master: specified by the DM-master process parameter `--log-file`. If you deploy DM using TiUP, the log directory is `/dm-deploy/dm-master-8261/log/` by default. +- DM-worker: specified by the DM-worker process parameter `--log-file`. If you deploy DM using TiUP, the log directory is `/dm-deploy/dm-worker-8262/log/` by default. + +## What's next + +- [Pause the migration task](https://docs.pingcap.com/tidb-data-migration/stable/pause-task). +- [Resume the migration task](https://docs.pingcap.com/tidb-data-migration/stable/resume-task). +- [Stop the migration task](https://docs.pingcap.com/tidb-data-migration/stable/stop-task). +- [Export and import the cluster data source and task configuration](https://docs.pingcap.com/tidb-data-migration/stable/export-import-config). +- [Handle failed DDL statements](https://docs.pingcap.com/tidb-data-migration/stable/handle-failed-ddl-statements). diff --git a/migrate-from-aurora-mysql-database.md b/migrate-from-aurora-mysql-database.md deleted file mode 100644 index 913cbfff9eef6..0000000000000 --- a/migrate-from-aurora-mysql-database.md +++ /dev/null @@ -1,9 +0,0 @@ ---- -title: Migrate from Amazon Aurora MySQL Using DM -summary: Learn how to migrate from MySQL (using a case of Amazon Aurora) to TiDB by using TiDB Data Migration (DM). -aliases: ['/docs/dev/migrate-from-aurora-mysql-database/','/docs/dev/how-to/migrate/from-mysql-aurora/','/docs/dev/how-to/migrate/from-aurora/'] ---- - -# Migrate from Amazon Aurora MySQL Using DM - -To migrate data from MySQL (Amazon Aurora) to TiDB, refer to [Migrate from MySQL (Amazon Aurora)](/dm/migrate-from-mysql-aurora.md). diff --git a/migrate-from-aurora-using-lightning.md b/migrate-from-aurora-using-lightning.md deleted file mode 100644 index e0d8d298d6d2d..0000000000000 --- a/migrate-from-aurora-using-lightning.md +++ /dev/null @@ -1,116 +0,0 @@ ---- -title: Migrate from Amazon Aurora MySQL Using TiDB Lightning -summary: Learn how to migrate full data from Amazon Aurora MySQL to TiDB using TiDB Lightning. ---- - -# Migrate from Amazon Aurora MySQL Using TiDB Lightning - -This document introduces how to migrate full data from Amazon Aurora MySQL to TiDB using TiDB Lightning. - -## Step 1: Export full data from Aurora to Amazon S3 - -Refer to [AWS Documentation - Exporting DB snapshot data to Amazon S3](https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/USER_ExportSnapshot.html) to export the snapshot data of Aurora to Amazon S3. - -## Step 2: Deploy TiDB Lightning - -For detailed deployment methods, see [Deploy TiDB Lightning](/tidb-lightning/deploy-tidb-lightning.md). - -## Step 3: Configure the data source of TiDB Lightning - -Based on different deployment methods, edit the `tidb-lighting.toml` configuration file as follows: - -1. Configure `data-source-dir` under `[mydumper]` as the S3 Bucket path of exported data in [step 1](#step-1-export-full-data-from-aurora-to-amazon-s3). - - ``` - [mydumper] - # Data source directory - data-source-dir = "s3://bucket-name/data-path" - ``` - -2. Configure the target TiDB cluster as follows: - - ``` - [tidb] - # The target cluster information. Fill in one address of tidb-server. - host = "172.16.31.1" - port = 4000 - user = "root" - password = "" - # The PD address of the cluster. - pd-addr = "127.0.0.1:2379" - ``` - -3. Configure the backend mode: - - ``` - [tikv-importer] - # Uses Local-backend. - backend = "local" - # The storage path of local temporary files. Ensure that the corresponding directory does not exist or is empty and that the disk capacity is large enough for storage. - sorted-kv-dir = "/path/to/local-temp-dir" - ``` - -4. Configure the file routing. - - ``` - [mydumper] - no-schema = true - - [[mydumper.files]] - # Uses single quoted strings to avoid escaping. - pattern = '(?i)^(?:[^/]*/)*([a-z0-9_]+)\.([a-z0-9_]+)/(?:[^/]*/)*(?:[a-z0-9\-_.]+\.(parquet))$' - schema = '$1' - table = '$2' - type = '$3' - ``` - -> **Note:** -> -> - The above example uses the Local-backend for best performance. You can also choose TiDB-backend or Importer-backend according to your need. For detailed introduction of the three backend modes, see [TiDB Lightning Backends](/tidb-lightning/tidb-lightning-backends.md). -> - Because the path for exporting snapshot data from Aurora is different from the default file naming format supported by TiDB Lightning, you need to set additional file routing configuration. -> - If TLS is enabled in the target TiDB cluster, you also need to configure TLS. - -For other configurations, see [TiDB Lightning Configuration](/tidb-lightning/tidb-lightning-configuration.md). - -## Step 4: Create table schema - -Because the snapshot data exported from Aurora to S3 does not contain the SQL statement file used to create database tables, you need to manually export and import the table creation statements corresponding to the database tables into TiDB. You can use Dumpling and TiDB Lightning to create all table schemas: - -1. Use Dumpling to export table schema files: - - ``` - ./dumpling --host database-1.cedtft9htlae.us-west-2.rds.amazonaws.com --port 3306 --user root --password password --consistency none --no-data --output ./schema --filter "mydb.*" - ``` - - > **Note:** - > - > - Set the parameters of the data source address and the path of output files according to your actual situation. For example, `database-1.cedtft9htlae.us-west-2.rds.amazonaws.com` is the address of Aurora MySQL. - > - If you need to export all database tables, you do not need to set the `--filter` parameter. If you only need to export some of the database tables, configure `--filter` according to [table-filter](https://github.com/pingcap/tidb-tools/blob/master/pkg/table-filter/README.md). - -2. Use TiDB Lightning to create table schemas: - - ``` - ./tidb-lightning -config tidb-lightning.toml -d ./schema -no-schema=false - ``` - - In this example, TiDB Lightning is only used to create table schemas, so you need to execute the above command quickly. At a regular speed, ten table creation statements can be executed in one second. - -> **Note:** -> -> If the number of database tables to create is relatively small, you can manually create the corresponding databases and tables in TiDB directly, or use other tools such as mysqldump to export the schema and then import it into TiDB. - -## Step 5: Run TiDB Lightning to import data - -Run TiDB Lightning to start the import operation. If you start TiDB Lightning by using `nohup` directly in the command line, the program might exit because of the `SIGHUP` signal. Therefore, it is recommended to write `nohup` in a script. For example: - -```bash -# !/bin/bash -export AWS_ACCESS_KEY_ID=${AccessKey} -export AWS_SECRET_ACCESS_KEY=${SecretKey} -nohup ./tidb-lightning -config tidb-lightning.toml > nohup.out & -``` - -When the import operation is started, view the progress by the following two ways: - -- `grep` the keyword `progress` in logs, which is updated every 5 minutes by default. -- Access the monitoring dashboard. See [Monitor TiDB Lightning](/tidb-lightning/monitor-tidb-lightning.md) for details. diff --git a/migrate-from-csv-files-to-tidb.md b/migrate-from-csv-files-to-tidb.md new file mode 100644 index 0000000000000..e4898563806a6 --- /dev/null +++ b/migrate-from-csv-files-to-tidb.md @@ -0,0 +1,203 @@ +--- +title: Migrate Data from CSV Files to TiDB +summary: Learn how to migrate data from CSV files to TiDB. +--- + +# Migrate Data from CSV Files to TiDB + +This document describes how to migrate data from CSV files to TiDB. + +TiDB Lightning can read data from CSV files and other delimiter formats, such as tab-separated values (TSV). For other flat file data sources, you can also refer to this document and migrate data to TiDB. + +## Prerequisites + +- [Install TiDB Lightning](/migration-tools.md). +- [Get the target database privileges required for TiDB Lightning](/tidb-lightning/tidb-lightning-faq.md#what-are-the-privilege-requirements-for-the-target-database). + +## Step 1. Prepare the CSV files + +Put all the CSV files in the same directory. If you need TiDB Lightning to recognize all CSV files, the file names should meet the following requirements: + +- If a CSV file contains the data for an entire table, name the file `${db_name}.${table_name}.csv`. +- If the data of one table is separated into multiple CSV files, append a numeric suffix to these CSV files. For example, `${db_name}.${table_name}.003.csv`. The numeric suffixes can be inconsecutive but must be in ascending order. You also need to add extra zeros before the number to ensure all the suffixes are in the same length. + +## Step 2. Create the target table schema + +Because CSV files do not contain schema information, before importing data from CSV files into TiDB, you need to create the target table schema. You can create the target table schema by either of the following two methods: + +* **Method 1**: create the target table schema using TiDB Lightning. + + 1. Write SQL files that contain the required DDL statements. + + - Add `CREATE DATABASE` statements in the `${db_name}-schema-create.sql` files. + - Add `CREATE TABLE` statements in the `${db_name}.${table_name}-schema.sql` files. + + 2. During the migration, add the following configuration in `tidb-lightning.toml`: + + ```toml + [mydumper] + no-schema = false # To create a target table schema using Lightning, set the value to false. + ``` + +* **Method 2**: create the target table schema manually. + + During the migration, add the following configuration in `tidb-lightning.toml`: + + ```toml + [mydumper] + no-schema = true # If you have already created the target table schema, set the value to true, which means skipping the schema creation. + ``` + +## Step 3. Create the configuration file + +Create a `tidb-lightning.toml` file with the following content: + +{{< copyable "shell-regular" >}} + +```toml +[lightning] +# Log +level = "info" +file = "tidb-lightning.log" + +[tikv-importer] +# "local": Default backend. The local backend is recommended to import large volumes of data (1 TiB or more). During the import, the target TiDB cluster cannot provide any service. +# "tidb": The "tidb" backend is recommended to import data less than 1 TiB. During the import, the target TiDB cluster can provide service normally. +backend = "local" +# Set the temporary storage directory for the sorted Key-Value files. The directory must be empty, and the storage space must be enough to hold the largest single table from the data source. For better import performance, it is recommended to use a directory different from `data-source-dir` and use flash storage, which can use I/O exclusively. +sorted-kv-dir = "/mnt/ssd/sorted-kv-dir" + +[mydumper] +# Directory of the data source. +data-source-dir = "${data-path}" # A local path or S3 path. For example, 's3://my-bucket/sql-backup?region=us-west-2'. + +# Configures whether to create the target database and table. +# If you need TiDB Lightning to create the target database and table, set the value to false. +# If you have already created the target database and table, set the value to true. +no-schema = true + +# Defines CSV format. +[mydumper.csv] +# Field separator of the CSV file. Must not be empty. If the source file contains fields that are not string or numeric, such as binary, blob, or bit, it is recommended not to usesimple delimiters such as ",", and use an uncommon character combination like "|+|" instead. +separator = ',' +# Delimiter. Can be zero or multiple characters. +delimiter = '"' +# Configures whether the CSV file has a table header. +# If this item is set to true, TiDB Lightning uses the first line of the CSV file to parse the corresponding relationship of fields. +header = true +# Configures whether the CSV file contains NULL. +# If this item is set to true, any column of the CSV file cannot be parsed as NULL. +not-null = false +# If `not-null` is set to false (CSV contains NULL), +# The following value is parsed as NULL. +null = '\N' +# Whether to treat the backslash ('\') in the string as an escape character. +backslash-escape = true +# Whether to trim the last separator at the end of each line. +trim-last-separator = false + +[tidb] +# The target cluster. +host = ${host} # e.g.: 172.16.32.1 +port = ${port} # e.g.: 4000 +user = "${user_name}" # e.g.: "root" +password = "${password}" # e.g.: "rootroot" +status-port = ${status-port} # During the import, TiCb Lightning needs to obtain the table schema information from the TiDB status port. e.g.: 10080 +pd-addr = "${ip}:${port}" # The address of the PD cluster, e.g.: 172.16.31.3:2379. TiDB Lightning obtains some information from PD. When backend = "local", you must specify status-port and pd-addr correctly. Otherwise, the import will be abnormal. +``` + +For more information on the configuration file, refer to [TiDB Lightning Configuration](/tidb-lightning/tidb-lightning-configuration.md). + +## Step 4. Tune the import performance (optional) + +When you import data from CSV files with a uniform size of about 256 MiB, TiDB Lightning works in the best performance. However, if you import data from a single large CSV file, TiDB Lightning can only use one thread to process the import by default, which might slow down the import speed. + +To speed up the import, you can split a large CSV file into smaller ones. For a CSV file in a common format, before TiDB Lightning reads the entire file, it is hard to quickly locate the beginning and ending positions of each line. Therefore, TiDB Lightning does not automatically split CSV files by default. But if your CSV files to be imported meet certain format requirements, you can enable the `strict-format` mode. In this mode, TiDB Lightning automatically splits a single large CSV file into multiple files, each in about 256 MiB, and processes them in parallel. + +> **Note:** +> +> If a CSV file is not in a strict format but the `strict-format` mode is set to `true` by mistake, a field that spans multiple lines will be split into two fields. This causes the parsing to fail, and TiDB Lightning might import the corrupted data without reporting any error. + +In a strict-format CSV file, each field only takes up one line. It must meet the following requirements: + +- The delimiter is empty. +- Each field does not contain CR (\r) or LF (\n). + +If your CSV file meets the above requirements, you can speed up the import by enabling the `strict-format` mode as follows: + +```toml +[mydumper] +strict-format = true +``` + +## Step 5. Import the data + +To start the import, run `tidb-lightning`. If you launch the program in the command line, the process might exit unexpectedly after receiving a SIGHUP signal. In this case, it is recommended to run the program using a `nohup` or `screen` tool. For example: + +{{< copyable "shell-regular" >}} + +```shell +nohup tiup tidb-lightning -config tidb-lightning.toml > nohup.out 2>&1 & +``` + +After the import starts, you can check the progress of the import by either of the following methods: + +- `grep` the keyword `progress` in the log. The progress is updated every 5 minutes by default. +- Check progress in [the monitoring dashboard](/tidb-lightning/monitor-tidb-lightning.md). +- Check progress in [the TiDB Lightning web interface](/tidb-lightning/tidb-lightning-web-interface.md). + +After TiDB Lightning completes the import, it exits automatically. If you find the last 5 lines of its log print `the whole procedure completed`, the import is successful. + +> **Note:** +> +> Whether the import is successful or not, the last line of the log shows `tidb lightning exit`. It means that TiDB Lightning exits normally, but does not necessarily mean that the import is successful. + +If the import fails, refer to [TiDB Lightning FAQ](/tidb-lightning/tidb-lightning-faq.md) for troubleshooting. + +## Other file formats + +If your data source is in other formats, to migrate data from your data source, you must end the file name with `.csv` and make corresponding changes in the `[mydumper.csv]` section of the `tidb-lightning.toml` configuration file. Here are example changes for common formats: + +**TSV:** + +```toml +# Format example +# ID Region Count +# 1 East 32 +# 2 South NULL +# 3 West 10 +# 4 North 39 + +# Format configuration +[mydumper.csv] +separator = "\t" +delimiter = '' +header = true +not-null = false +null = 'NULL' +backslash-escape = false +trim-last-separator = false +``` + +**TPC-H DBGEN:** + +```toml +# Format example +# 1|East|32| +# 2|South|0| +# 3|West|10| +# 4|North|39| + +# Format configuration +[mydumper.csv] +separator = '|' +delimiter = '' +header = false +not-null = true +backslash-escape = false +trim-last-separator = true +``` + +## What's next + +- [CSV Support and Restrictions](/tidb-lightning/migrate-from-csv-using-tidb-lightning.md). diff --git a/migrate-from-mysql-dumpling-files.md b/migrate-from-mysql-dumpling-files.md deleted file mode 100644 index a15af57be99f1..0000000000000 --- a/migrate-from-mysql-dumpling-files.md +++ /dev/null @@ -1,81 +0,0 @@ ---- -title: Migrate from MySQL SQL Files Using TiDB Lightning -summary: Learn how to migrate data from MySQL SQL files to TiDB using TiDB Lightning. -aliases: ['/docs/dev/migrate-from-mysql-mydumper-files/','/tidb/dev/migrate-from-mysql-mydumper-files/'] ---- - -# Migrate from MySQL SQL Files Using TiDB Lightning - -This document describes how to migrate data from MySQL SQL files to TiDB using TiDB Lightning. For details on how to generate MySQL SQL files, refer to [Dumpling](/dumpling-overview.md). - -The data migration process described in this document uses TiDB Lightning. The steps are as follows. - -## Step 1: Deploy TiDB Lightning - -Before you start the migration, [deploy TiDB Lightning](/tidb-lightning/deploy-tidb-lightning.md). - -> **Note:** -> -> - If you choose the Local-backend to import data, the TiDB cluster cannot provide services during the import process. This mode imports data quickly, which is suitable for importing a large amount of data (above the TB level). -> - If you choose the TiDB-backend, the cluster can provide services normally during the import, at a slower import speed. -> - For detailed differences between the two backend modes, see [TiDB Lightning Backends](/tidb-lightning/tidb-lightning-backends.md). - -## Step 2: Configure data source of TiDB Lightning - -This document takes the TiDB-backend as an example. Create the `tidb-lightning.toml` configuration file and add the following major configurations in the file: - -1. Set the `data-source-dir` under `[mydumper]` to the path of the MySQL SQL file. - - ``` - [mydumper] - # Data source directory - data-source-dir = "/data/export" - ``` - - > **Note:** - > - > If a corresponding schema already exists in the downstream, set `no-schema=true` to skip the creation of the schema. - -2. Add the configuration of the target TiDB cluster. - - ``` - [tidb] - # The target cluster information. Fill in one address of tidb-server. - host = "172.16.31.1" - port = 4000 - user = "root" - password = "" - ``` - -3. Add the necessary parameter for the backend. This document uses the TiDB-backend mode. Here, "backend" can also be set to "local" or "importer" according to your actual application scenario. For details, refer to [Backend Mode](/tidb-lightning/tidb-lightning-backends.md). - - ``` - [tikv-importer] - backend = "tidb" - ``` - -4. Add necessary parameters for importing the TiDB cluster. - - ``` - [tidb] - host = "{{tidb-host}}" - port = {{tidb-port}} - user = "{{tidb-user}}" - password = "{{tidb-password}}" - ``` - -For other configurations, see [TiDB Lightning Configuration](/tidb-lightning/tidb-lightning-configuration.md). - -## Step 3: Run TiDB Lightning to import data - -Run TiDB Lightning to start the import operation. If you start TiDB Lightning by using `nohup` directly in the command line, the program might exit because of the `SIGHUP` signal. Therefore, it is recommended to write `nohup` in a script. For example: - -```bash -# !/bin/bash -nohup ./tidb-lightning -config tidb-lightning.toml > nohup.out & -``` - -When the import operation is started, view the progress by the following two ways: - -- `grep` the keyword `progress` in logs, which is updated every 5 minutes by default. -- Access the monitoring dashboard. See [Monitor TiDB Lightning](/tidb-lightning/monitor-tidb-lightning.md) for details. diff --git a/migrate-from-sql-files-to-tidb.md b/migrate-from-sql-files-to-tidb.md new file mode 100644 index 0000000000000..a642af790164b --- /dev/null +++ b/migrate-from-sql-files-to-tidb.md @@ -0,0 +1,115 @@ +--- +title: Migrate Data from SQL Files to TiDB +summary: Learn how to migrate data from SQL files to TiDB. +aliases: ['/docs/dev/migrate-from-mysql-mydumper-files/','/tidb/dev/migrate-from-mysql-mydumper-files/','/tidb/dev/migrate-from-mysql-dumpling-files'] +--- + +# Migrate Data from SQL Files to TiDB + +This document describes how to migrate data from MySQL SQL files to TiDB using TiDB Lightning. For how to generate MySQL SQL files, refer to [Export to SQL files using Dumpling](/dumpling-overview.md#export-to-sql-files). + +## Prerequisites + +- [Install TiDB Lightning using TiUP](/migration-tools.md) +- [Grant the required privileges to the target database for TiDB Lightning](/tidb-lightning/tidb-lightning-faq.md#what-are-the-privilege-requirements-for-the-target-database) + +## Step 1. Prepare SQL files + +Put all the SQL files in the same directory, like `/data/my_datasource/` or `s3://my-bucket/sql-backup?region=us-west-2`. TiDB Lighting recursively searches for all `.sql` files in this directory and its subdirectories. + +## Step 2. Define the target table schema + +To import data to TiDB, you need to create the table schema for the target database. + +If you use Dumpling to export data, the table schema file is automatically exported. For the data exported in other ways, you can create the table schema in one of the following methods: + ++ **Method 1**: Create the target table schema using TiDB Lightning. + + 1. Write SQL files that contain the required DDL statements. + + - The format of the file name is `${db_name}-schema-create.sql`, and this file should have the `CREATE DATABASE` statements. + - The format of the file name is `${db_name}.${table_name}-schema.sql`, and this file should have the `CREATE TABLE` statements. + + 2. During the migration, add the following configuration in `tidb-lightning.toml`: + + ```toml + [mydumper] + no-schema = false # To create the table schema in the target database using TiDB Lightning, set the value to false + ``` + ++ **Method 2**: Create the target table schema manually. + + Before the migration, add the following configuration in `tidb-lightning.toml`: + + ```toml + [mydumper] + no-schema = true # If you have already created the target table schema, set the value to true, which means skipping the schema creation. + ``` + +## Step 3. Create the configuration file + +Create a `tidb-lightning.toml` file with the following content: + +{{< copyable "" >}} + +```toml +[lightning] +# Log +level = "info" +file = "tidb-lightning.log" + +[tikv-importer] +# "local":Default. The local backend is used to import large volumes of data (around or more than 1 TiB). During the import, the target TiDB cluster cannot provide any service. +# "tidb":The "tidb" backend can also be used to import small volumes of data (less than 1 TiB). During the import, the target TiDB cluster can provide service normally. For the information about backend mode, refer to https://docs.pingcap.com/tidb/stable/tidb-lightning-backends. + +backend = "local" +# Sets the temporary storage directory for the sorted key-value files. The directory must be empty, and the storage space must be enough to store the largest single table from the data source. For better import performance, it is recommended to use a directory different from `data-source-dir` and use flash storage and exclusive I/O for the directory. +sorted-kv-dir = "${sorted-kv-dir}" + +[mydumper] +# Directory of the data source +data-source-dir = "${data-path}" # Local or S3 path, such as 's3://my-bucket/sql-backup?region=us-west-2' + +# If you have manually created the target table schema in #Step 2, set it to true; otherwise, it is false. +# no-schema = true + +[tidb] +# The information of target cluster +host = ${host} # For example, 172.16.32.1 +port = ${port} # For example, 4000 +user = "${user_name}" # For example, "root" +password = "${password}" # For example, "rootroot" +status-port = ${status-port} # During the import process, TiDB Lightning needs to obtain table schema information from the "Status Port" of TiDB, such as 10080. +pd-addr = "${ip}:${port}" # The address of the cluster's PD. TiDB Lightning obtains some information through PD, such as 172.16.31.3:2379. When backend = "local", you must correctly specify status-port and pd-addr. Otherwise, the import will encounter errors. +``` + +For more information about the configuration file, refer to [TiDB Lightning Configuration](/tidb-lightning/tidb-lightning-configuration.md). + +## Step 4. Import the data + +To start the import, run `tidb-lightning`. If you launch the program in the command line, the program might exit because of the `SIGHUP` signal. In this case, it is recommended to run the program with a `nohup` or `screen` tool. + +If you import the data from S3, you need to pass in `SecretKey` and `AccessKey` of the account as environment variables. The account has the permission to access the S3 backend storage. + +{{< copyable "shell-regular" >}} + +```shell +export AWS_ACCESS_KEY_ID=${access_key} +export AWS_SECRET_ACCESS_KEY=${secret_key} +nohup tiup tidb-lightning -config tidb-lightning.toml -no-schema=true > nohup.out 2>&1 & +``` + +TiDB Lightning also supports reading credential files from `~/.aws/credentials`. + +After the import is started, you can check the progress in one of the following ways: + +- Search the `progress` keyword in the `grep` log, which is updated every 5 minutes by default. +- Use the Grafana dashboard. For details, see [TiDB Lightning Monitoring](/tidb-lightning/monitor-tidb-lightning.md). +- Use web interface. For details, see [TiDB Lightning Web Interface](/tidb-lightning/tidb-lightning-web-interface.md). + +After the import is completed, TiDB Lightning automatically exits. If `the whole procedure completed` is in the last 5 lines of the log, it means that the import is successfully completed. + +> **Note:** +> +> No matter whether the import is successful or not, the last line displays `tidb lightning exit`. It only means that TiDB Lightning has exited normally, not the completion of the task. +If you encounter problems during the import process, refer to [TiDB Lightning FAQ](/tidb-lightning/tidb-lightning-faq.md) for troubleshooting. diff --git a/migrate-large-mysql-shards-to-tidb.md b/migrate-large-mysql-shards-to-tidb.md new file mode 100644 index 0000000000000..6fad7a78b910a --- /dev/null +++ b/migrate-large-mysql-shards-to-tidb.md @@ -0,0 +1,431 @@ +--- +title: Migrate and Merge MySQL Shards of Large Datasets to TiDB +summary: Learn how to migrate and merge large datasets of shards from MySQL into TiDB using Dumpling and TiDB Lightning, as well as how to configure the DM task to replicate incremental data changes from different MySQL shards into TiDB. +--- + +# Migrate and Merge MySQL Shards of Large Datasets to TiDB + +If you want to migrate a large MySQL dataset (for example, more than 1 TiB) from different partitions into TiDB, and you are able to suspend all the TiDB cluster write operations from your business during the migration, you can use TiDB Lightning to do the migration quickly. After migration, you can also use TiDB DM to perform incremental replication according to your business needs. "Large datasets" in this document usually mean data around one TiB or more. + +This document uses an example to walk through the whole procedure of such kind of migration. + +If the data size of the MySQL shards is less than 1 TiB, you can follow the procedure described in [Migrate and Merge MySQL Shards of Small Datasets to TiDB](/migrate-small-mysql-shards-to-tidb.md), which supports both full and incremental migration and the steps are easier. + +The following diagram shows how to migrate and merge MySQL sharded tables to TiDB using Dumpling and TiDB Lightning. + +![Use Dumpling and TiDB Lightning to migrate and merge MySQL shards to TiDB](/media/shard-merge-using-lightning-en.png) + +This example assumes that you have two databases, `my_db1` and `my_db2`. You use Dumpling to export two tables `table1` and `table2` from `my_db1`, and two tables `table3` and `table4` from `my_db2`, respectively. After that, you use TiDB Lighting to import and merge the four exported tables into the same `table5` from `mydb` in the target TiDB. + +In this document, you can migrate data following this procedure: + +1. Use Dumpling to export full data. In this example, you export 2 tables respectively from 2 upstream databases: + + - Export `table1` and `table2` from `my_db1` + - Export `table3` and `table4` from `my_db2` + +2. Start TiDB Lightning to migrate data to `mydb.table5` in TiDB. + +3. (Optional) Use TiDB DM to perform incremental replication. + +## Prerequisites + +Before getting started, see the following documents to prepare for the migration task. + +- [Deploy a DM Cluster Using TiUP](https://docs.pingcap.com/tidb-data-migration/stable/deploy-a-dm-cluster-using-tiup) +- [Use TiUP to Deploy Dumpling and Lightning](/migration-tools.md) +- [Privileges required by DM-worker](https://docs.pingcap.com/tidb-data-migration/stable/dm-worker-intro#privileges-required-by-dm-worker) +- [Upstream Permissions for Lightning](/tidb-lightning/tidb-lightning-faq.md#what-are-the-privilege-requirements-for-the-target-database) +- [Downstream Permissions for Dumpling](/dumpling-overview.md#export-data-from-tidbmysql) + +### Resource requirements + +**Operating system**: Examples in this document use new, clean CentOS 7 instances. You can deploy a virtual machine on your own host locally, or on a vendor-provided cloud platform. TiDB Lightning consumes as much CPU resources as needed by default, so it is recommended to deploy TiDB Lightning on a dedicated machine. If you do not have a dedicated machine for TiDB Lightning, you can deploy TiDB Lightning on a shared machine with other components (such as `tikv-server`) and limit TiDB Lightning's CPU usage by configuring `region-concurrency` to 75% of the number of logical CPUs. + +**Memory and CPU**: TiDB Lightning consumes high resources, so it is recommended to allocate more than 64 GB of memory and 32-core CPU for TiDB Lightning. To get the best performance, make sure the CPU core to memory (GB) ratio is more than 1:2. + +**Disk space**: + +- Dumpling requires enough disk space to store the whole data source. SSD is recommended. +- During the import, TiDB Lightning needs temporary space to store the sorted key-value pairs. The disk space should be enough to hold the largest single table from the data source. +- If the full data volume is large, you can increase the binlog storage time in the upstream. This is to ensure that the binlogs are not lost during the incremental replication. + +**Note**: You cannot calculate the exact data volume exported by Dumpling from MySQL, but you can estimate the data volume by using the following SQL statement to summarize the `data-length` field in the `information_schema.tables` table: + +{{< copyable "" >}} + +```sql +/* Calculate the size of all schemas, in MiB. Replace ${schema_name} with your schema name. */ +SELECT table_schema,SUM(data_length)/1024/1024 AS data_length,SUM(index_length)/1024/1024 AS index_length,SUM(data_length+index_length)/1024/1024 AS SUM FROM information_schema.tables WHERE table_schema = "${schema_name}" GROUP BY table_schema; + +/* Calculate the size of the largest table, in MiB. Replace ${schema_name} with your schema name. */ +SELECT table_name,table_schema,SUM(data_length)/1024/1024 AS data_length,SUM(index_length)/1024/1024 AS index_length,SUM(data_length+index_length)/1024/1024 AS SUM from information_schema.tables WHERE table_schema = "${schema_name}" GROUP BY table_name,table_schema ORDER BY SUM DESC LIMIT 5; +``` + +### Disk space for the target TiKV cluster + +The target TiKV cluster must have enough disk space to store the imported data. In addition to [the standard hardware requirements](/hardware-and-software-requirements.md), the storage space of the target TiKV cluster must be larger than **the size of the data source x [the number of replicas](/faq/deploy-and-maintain-faq.md#is-the-number-of-replicas-in-each-region-configurable-if-yes-how-to-configure-it) x 2**. For example, if the cluster uses 3 replicas by default, the target TiKV cluster must have a storage space larger than 6 times the size of the data source. The formula has `x 2` because: + +- Index might take extra space. +- RocksDB has a space amplification effect. + +### Check conflicts for Sharded Tables + +If the migration involves merging data from different sharded tables, primary key or unique index conflicts may occur during the merge. Therefore, before migration, you need to take a deep look at the current sharding scheme from the business point of view, and find a way to avoid the conflicts. For more details, see [Handle conflicts between primary keys or unique indexes across multiple sharded tables](https://docs.pingcap.com/tidb-data-migration/stable/shard-merge-best-practices#handle-conflicts-between-primary-keys-or-unique-indexes-across-multiple-sharded-tables). The following is a brief description. + +Assume that tables 1~4 have the same table structure as follows. + +```sql +CREATE TABLE `table1` ( + `id` bigint(20) NOT NULL AUTO_INCREMENT, + `sid` bigint(20) NOT NULL, + `pid` bigint(20) NOT NULL, + `comment` varchar(255) DEFAULT NULL, + PRIMARY KEY (`id`), + UNIQUE KEY `sid` (`sid`) +) ENGINE=InnoDB DEFAULT CHARSET=latin1 +``` + +For those four tables, the `id` column is the primary key. It is auto-incremental, which will cause different sharded tables to generate duplicated `id` ranges and cause the primary key conflict on the target table during the migration. On the other hand, the `sid` column is the sharding key, which ensures that the index is unique globally. So you can remove the unique constraint of the `id` column in the target `table5` to avoid the data merge conflicts. + +```sql +CREATE TABLE `table5` ( + `id` bigint(20) NOT NULL, + `sid` bigint(20) NOT NULL, + `pid` bigint(20) NOT NULL, + `comment` varchar(255) DEFAULT NULL, + INDEX (`id`), + UNIQUE KEY `sid` (`sid`) +) ENGINE=InnoDB DEFAULT CHARSET=latin1 +``` + +## Step1. Use Dumpling to export full data + +If those multiple sharded tables to be exported are in the same upstream MySQL instance, you can directly use the `-f` parameter of Dumpling to export them in a single operation. + +If the sharded tables are stored in different MySQL instances, you can use Dumpling to export them respectively and place the exported results in the same parent directory. + +In the following example, both methods are used, and then the exported data is stored in the same parent directory. + +First, run the following command to use Dumpling to export `table1` and `table2` from `my_db1`: + +{{< copyable "shell-regular" >}} + +```shell +tiup dumpling -h ${ip} -P 3306 -u root -t 16 -r 200000 -F 256MB -B my_db1 -f 'my_db1.table[12]' -o ${data-path}/my_db1 +``` + +The following table describes parameters in the command above. For more information about Dumpling parameters, see [Dumpling Overview](/dumpling-overview.md). + +| Parameter | Description | +|- |- | +| `-u` or `--user` | Specifies the user name to be used. | +| `-p` or `--password` | Specifies the password to be used. | +| `-p` or `--port` | Specifies the port to be used.| +| `-h` or `--host` | Specifies the IP address of the data source. | +| `-t` or `--thread` | Specifies the number of threads for the export. Increasing the number of threads improves the concurrency of Dumpling and the export speed, and increases the database's memory consumption. Therefore, it is not recommended to set the number too large. Usually, it's less than 64.| +| `-o` or `--output` | Specifies the export directory of the storage, which supports a local file path or a [URL of an external storage](/br/backup-and-restore-storages.md).| +| `-r` or `--row` | Specifies the maximum number of rows in a single file. If you use this parameter, Dumpling enables the in-table concurrency to speed up the export and reduce the memory usage.| +| `-F` | Specifies the maximum size of a single file. The unit is `MiB`. It is recommended to keep the value to 256 MiB. | +| `-B` or `--database` | Specifies databases to be exported. | +| `-f` or `--filter` | Sexport tables that match the filter pattern. For the filter syntax, see [table-filter](/table-filter.md) | + +Ensure that there is enough free space in `${data-path}`. It is strongly recommended to use the `-F` option to avoid interruptions in the backup process due to oversized single tables. + +Then, run the following command to use Dumpling to export `table3` and `table4` from `my_db2`. Note that the path is `${data-path}/my_db2` instead of `${data-path}/my_db1`. + +{{< copyable "shell-regular" >}} + +```shell +tiup dumpling -h ${ip} -P 3306 -u root -t 16 -r 200000 -F 256MB -B my_db2 -f 'my_db2.table[34]' -o ${data-path}/my_db2 +``` + +After the preceding procedures, all source data tables are now exported to the `${data-path}` directory. Putting all the exported data on the same directory makes subsequent import by TiDB Lightning convenient. + +The starting position information needed for incremental replication is in the `metadata` files in `my_db1` and `my_db2` sub-directories of `${data-path}` directory respectively. They are meta-information files automatically generated by Dumpling. To perform incremental replication, you need to record the binlog locations information in these files. + +## Step 2. Start TiDB Lightning to import full exported data + +Before starting TiDB Lightning for migration, it is recommended that you understand how to handle checkpoints, and then choose the appropriate way to proceed according to your needs. + +### Checkpoints + +Migrating a large volume of data usually takes hours or even days. There is a certain chance that the long-running process is interrupted unexpectedly. It can be very frustrating to redo everything from scratch, even if some part of data has already been imported. + +Fortunately, TiDB Lightning provides a feature called `checkpoints`, which makes TiDB Lightning save the import progress as `checkpoints` from time to time, so that an interrupted import task can be resumed from the latest checkpoint upon restart. + +If the TiDB Lightning task crashes due to unrecoverable errors (for example, data corruption), it will not pick up from the checkpoint, but will report an error and quit the task. To ensure the safety of the imported data, you must resolve these errors by using the `tidb-lightning-ctl` command before proceeding with other steps. The options include: + +* --checkpoint-error-destroy: This option allows you to restart importing data into failed target tables from scratch by destroying all the existing data in those tables first. +* --checkpoint-error-ignore: If migration has failed, this option clears the error status as if no errors ever happened. +* --checkpoint-remove: This option simply clears all checkpoints, regardless of errors. + +For more information, see [TiDB Lightning Checkpoints](https://docs.pingcap.com/tidb/stable/tidb-lightning-checkpoints). + +### Create the target schema + +After you make changes in the aforementioned [Check conflicts for sharded tables](/migrate-large-mysql-shards-to-tidb.md#check-conflicts-for-sharded-tables), you can now manually create the `my_db` schema and `table5` in downstream TiDB. After that, you need to configure `tidb-lightning.toml`. + +```toml +[mydumper] +no-schema = true # If you have created the downstream schema and tables, setting `true` tells TiDB Lightning not to create the downstream schema. +``` + +### Start the migration task + +Follow these steps to start `tidb-lightning`: + +1. Edit the toml file. `tidb-lightning.toml` is used in the following example: + + ```toml + [lightning] + # Logs + level = "info" + file = "tidb-lightning.log" + + [tikv-importer] + # Choose a local backend. + # "local": The default mode. It is used for large data volumes greater than 1 TiB. During migration, downstream TiDB cannot provide services. + # "tidb": Used for data volumes less than 1 TiB. During migration, downstream TiDB can provide services normally. + # For more information, see [TiDB Lightning Backends](https://docs.pingcap.com/tidb/stable/tidb-lightning-backends) + backend = "local" + # Set the temporary directory for the sorted key value pairs. It must be empty. + # The free space must be greater than the largest single table of the data source. + # It is recommended that you use a directory different from `data-source-dir` to get better migration performance by consuming I/O resources exclusively. + sorted-kv-dir = "${sorted-kv-dir}" + + # Set the renaming rules ('routes') from source to target tables, in order to support merging different table shards into a single target table. Here you migrate `table1` and `table2` in `my_db1`, and `table3` and `table4` in `my_db2`, to the target `table5` in downstream `my_db`. + [[routes]] + schema-pattern = "my_db1" + table-pattern = "table[1-2]" + target-schema = "my_db" + target-table = "table5" + + [[routes]] + schema-pattern = "my_db2" + table-pattern = "table[3-4]" + target-schema = "my_db" + target-table = "table5" + + [mydumper] + # The source data directory. Set this to the path of the Dumpling exported data. + # If there are several Dumpling-exported data directories, you need to place all these directories in the same parent directory, and use the parent directory here. + data-source-dir = "${data-path}" # The local or S3 path, for example, 's3://my-bucket/sql-backup?region=us-west-2' + # Because table1~table4 from source are merged into another table5 in the target, you should tell TiDB Lightning no need to create schemas, so that table1 ~ table4 won't be created automatically according to the exported schema information + no-schema = true + + # Information of the target TiDB cluster. For example purposes only. Replace the IP address with your IP address. + [tidb] + # Information of the target TiDB cluster. + # Values here are only for illustration purpose. Replace them with your own values. + host = ${host} # For example: "172.16.31.1" + port = ${port} # For example: 4000 + user = "${user_name}" # For example: "root" + password = "${password}" # For example: "rootroot" + status-port = ${status-port} # The table information is read from the status port. For example: 10080 + # the IP address of the PD cluster. TiDB Lightning gets some information through the PD cluster. + # For example: "172.16.31.3:2379". + # When backend = "local", make sure that the values of status-port and pd-addr are correct. Otherwise an error will occur. + pd-addr = "${ip}:${port}" + ``` + +2. Run `tidb-lightning`. If you run the program by directly invoking the program name in a shell, the process may quit unexpectedly after receiving a SIGHUP signal. It is recommended that you run the program using tools such as `nohup` or `screen` or `tiup`, and put the process to the shell background. If you migrate from S3, the SecretKey and AccessKey of the account that has access to the Amazon S3 backend store needs to be passed into the Lightning node as environment variables. Reading credential files from `~/.aws/credentials` is also supported. For example: + + {{< copyable "shell-regular" >}} + + ```shell + export AWS_ACCESS_KEY_ID=${access_key} + export AWS_SECRET_ACCESS_KEY=${secret_key} + nohup tiup tidb-lightning -config tidb-lightning.toml -no-schema=true > nohup.out 2>&1 & + ``` + +3. After starting the migration task, you can check the progress by using either of the following methods: + + - Use `grep` tool to search the keyword `progress` in the log. By default, a message reporting the progress is flushed into the log file every 5 minutes. + - View progress via the monitoring dashboard. For more information, see [TiDB Lightning Monitoring]( /tidb-lightning/monitor-tidb-lightning.md). + - View the progress via the Web page. See [Web Interface](/tidb-lightning/tidb-lightning-web-interface.md). + +After the importing finishes, TiDB Lightning will exit automatically. To make sure that the data is imported successfully, check for `the whole procedure completed` among the last 5 lines in the log. + +> **Note:** +> +> Whether the migration is successful or not, the last line in the log will always be `tidb lightning exit`. It just means that TiDB Lightning quits normally, and does not guarantee that the importing task is completed successfully. + +If you encounter any problems during migration, see [TiDB Lightning FAQs](/tidb-lightning/tidb-lightning-faq.md). + +## Step 3. (Optional) Use DM to perform incremental replication + +To replicate the data changes based on binlog from a specified position in the source database to TiDB, you can use TiDB DM to perform incremental replication. + +### Add the data source + +Create a new data source file called `source1.yaml`, which configures an upstream data source into DM, and add the following content: + +{{< copyable "" >}} + +```yaml +# Configuration. +source-id: "mysql-01" # Must be unique. + +# Specifies whether DM-worker pulls binlogs with GTID (Global Transaction Identifier). +# The prerequisite is that you have already enabled GTID in the upstream MySQL. +# If you have configured the upstream database service to switch master between different nodes automatically, you must enable GTID. +enable-gtid: true + +from: + host: "${host}" # For example: 172.16.10.81 + user: "root" + password: "${password}" # Plaintext passwords are supported but not recommended. It is recommended that you use dmctl encrypt to encrypt plaintext passwords. + port: ${port} # For example: 3306 +``` + +Run the following command in a terminal. Use `tiup dmctl` to load the data source configuration into the DM cluster: + +{{< copyable "shell-regular" >}} + +```shell +tiup dmctl --master-addr ${advertise-addr} operate-source create source1.yaml +``` + +The parameters are described as follows. + +|Parameter | Description | +|- |- | +|--master-addr | {advertise-addr} of any DM-master node in the cluster that dmctl connects to. For example: 172.16.10.71:8261| +| operate-source create | Load data sources to DM clusters. | + +Repeat the above steps until all MySQL upstream instances are added to the DM as data sources. + +### Create a replication task + +Edit a task configuration file called `task.yaml`, to configure the incremental replication mode and replication starting point for each data source. + +{{< copyable "" >}} + +```yaml +name: task-test # The name of the task. Should be globally unique. +task-mode: incremental # The mode of the task. "incremental" means full data migration is skipped and only incremental replication is performed. +# Required for incremental replication from sharded tables. By default, the "pessimistic" mode is used. +# If you have a deep understanding of the principles and usage limitations of the optimistic mode, you can also use the "optimistic" mode. +# For more information, see [Merge and Migrate Data from Sharded Tables](https://docs.pingcap.com/zh/tidb-data-migration/stable/feature-shard-merge). + +shard-mode: "pessimistic" + +# Configure the access information of the target TiDB database instance: +target-database: # The target database instance + host: "${host}" # For example: 127.0.0.1 + port: 4000 + user: "root" + password: "${password}" # It is recommended to use a dmctl encrypted password. + +# Use block-allow-list to configure tables that require sync: +block-allow-list: # The set of filter rules on matching tables in the data sources, to decide which tables need to migrate and which not. Use the black-white-list if the DM version is earlier than or equal to v2.0.0-beta.2. + bw-rule-1: # The ID of the block and allow list rule. + do-dbs: ["my_db1"] # The databases to be migrated. Here, my_db1 of instance 1 and my_db2 of instance 2 are configured as two separate rules to demonstrate how to prevent my_db2 of instance 1 from being replicated. + bw-rule-2: + do-dbs: ["my_db2"] +routes: # Table renaming rules ('routes') from upstream to downstream tables, in order to support merging different sharded table into a single target table. + route-rule-1: # Rule name. Migrate and merge table1 and table2 from my_db1 to the downstream my_db.table5. + schema-pattern: "my_db1" # Rule for matching upstream schema names. It supports the wildcards "*" and "?". + table-pattern: "table[1-2]" # Rule for matching upstream table names. It supports the wildcards "*" and "?". + target-schema: "my_db" # Name of the target schema. + target-table: "table5" # Name of the target table. + route-rule-2: # Rule name. Migrate and merge table3 and table4 from my_db2 to the downstream my_db.table5. + schema-pattern: "my_db2" + table-pattern: "table[3-4]" + target-schema: "my_db" + target-table: "table5" + +# Configure data sources. The following uses two data sources as an example. +mysql-instances: + - source-id: "mysql-01" # Data source ID. It is the source-id in source1.yaml. + block-allow-list: "bw-rule-1" # Use the block and allow list configuration above. Replicate `my_db1` in instance 1. + route-rules: ["route-rule-1"] # Use the configured routing rule above to merge upstream tables. +# syncer-config-name: "global" # Use the syncers configuration below. + meta: # The migration starting point of binlog when task-mode is incremental and there is no checkpoint in the downstream database. If there is a checkpoint, the checkpoint will be used. + binlog-name: "${binlog-name}" # The log location recorded in ${data-path}/my_db1/metadata in Step 1. You can either specify binlog-name + binlog-pos or binlog-gtid. When the upstream database service is configured to switch master between different nodes automatically, use binlog GTID here. + binlog-pos: ${binlog-position} + # binlog-gtid: " For example: 09bec856-ba95-11ea-850a-58f2b4af5188:1-9" + - source-id: "mysql-02" # Data source ID. It is the source-id in source1.yaml. + block-allow-list: "bw-rule-2" # Use the block and allow list configuration above. Replicate `my_db2` in instance2. + route-rules: ["route-rule-2"] # Use the routing rule configured above. + +# syncer-config-name: "global" # Use the syncers configuration below. + meta: # The migration starting point of binlog when task-mode is incremental and there is no checkpoint in the downstream database. If there is a checkpoint, the checkpoint will be used. + # binlog-name: "${binlog-name}" # The log location recorded in ${data-path}/my_db2/metadata in Step 1. You can either specify binlog-name + binlog-pos or binlog-gtid. When the upstream database service is configured to switch master between different nodes automatically, use binlog GTID here. + # binlog-pos: ${binlog-position} + binlog-gtid: "09bec856-ba95-11ea-850a-58f2b4af5188:1-9" +# (Optional) If you need to incrementally replicate some data changes that have been covered in the full migration, you need to enable the safe mode to avoid data migration errors during incremental replication. +# This scenario is common when the fully migrated data is not part of a consistent snapshot of the data source, and the incremental data is replicated from a location earlier than the fully migrated data. +# syncers: # The running parameters of the sync processing unit. +# global: # Configuration name. +# If set to true, DM changes INSERT to REPLACE, and changes UPDATE to a pair of DELETE and REPLACE for data source replication operations. +# Thus, it can apply DML repeatedly during replication when primary keys or unique indexes exist in the table structure. +# TiDB DM automatically starts safe mode within 1 minute before starting or resuming an incremental replication task. +# safe-mode: true +``` + +For more configurations, see [DM Advanced Task Configuration File](https://docs.pingcap.com/tidb-data-migration/stable/task-configuration-file-full/). + +Before you start the data migration task, it is recommended to use the `check-task` subcommand in `tiup dmctl` to check if the configuration meets the DM configuration requirements. + +{{< copyable "shell-regular" >}} + +```shell +tiup dmctl --master-addr ${advertise-addr} check-task task.yaml +``` + +Use `tiup dmctl` to run the following command to start the data migration task. + +{{< copyable "shell-regular" >}} + +```shell +tiup dmctl --master-addr ${advertise-addr} start-task task.yaml +``` + +The parameters in this command are described as follows. + +| Parameter | Description| +|-|-| +|--master-addr| {advertise-addr} of any DM-master node in the cluster that dmctl connects to. For example: 172.16.10.71:8261 | +|start-task | Starts the data migration task. | + +If the task fails to start, first make configuration changes according to the prompt messages from the returned result, and then run the `start-task task.yaml` subcommand in `tiup dmctl` to restart the task. If you encounter problems, see [Handle Errors](https://docs.pingcap.com/tidb-data-migration/stable/error-handling) and [TiDB Data Migration FAQ](https://docs.pingcap.com/tidb-data-migration/stable/faq). + +### Check the migration status + +You can check if there are running migration tasks in the DM cluster and their status by running the `query-status` command in `tiup dmctl`. + +{{< copyable "shell-regular" >}} + +```shell +tiup dmctl --master-addr ${advertise-addr} query-status ${task-name} +``` + +For more information, see [Query Status](https://docs.pingcap.com/zh/tidb-data-migration/stable/query-status). + +### Monitor tasks and view logs + +You can view the history of a migration task and internal operational metrics through Grafana or logs. + +- Via Grafana + + If Prometheus, Alertmanager, and Grafana are correctly deployed when you deploy the DM cluster using TiUP, you can view DM monitoring metrics in Grafana. Specifically, enter the IP address and port specified during deployment in Grafana and select the DM dashboard. + +- Via logs + + When DM is running, DM-master, DM-worker, and dmctl output logs, which includes information about migration tasks. The log directory of each component is as follows. + + - DM-master log directory: It is specified by the DM-master command line parameter `--log-file`. If DM is deployed using TiUP, the log directory is `/dm-deploy/dm-master-8261/log/`. + - DM-worker log directory: It is specified by the DM-worker command line parameter `--log-file`. If DM is deployed using TiUP, the log directory is `/dm-deploy/dm-worker-8262/log/`. + +## See also + +- [Dumpling](/dumpling-overview.md) +- [TiDB Lightning](/tidb-lightning/tidb-lightning-overview.md) +- [Pessimistic mode and optimistic mode](https://docs.pingcap.com/tidb-data-migration/stable/feature-shard-merge) +- [Pause a Data Migration Task](https://docs.pingcap.com/tidb-data-migration/stable/pause-task) +- [Resume a Data Migration Task](https://docs.pingcap.com/tidb-data-migration/stable/resume-task) +- [Stop a Data Migration Task](https://docs.pingcap.com/tidb-data-migration/stable/stop-task) +- [Export and Import Data Sources and Task Configuration of Clusters](https://docs.pingcap.com/tidb-data-migration/stable/export-import-config) +- [Handle Failed DDL Statements](https://docs.pingcap.com/tidb-data-migration/stable/handle-failed-ddl-statements) diff --git a/migrate-large-mysql-to-tidb.md b/migrate-large-mysql-to-tidb.md new file mode 100644 index 0000000000000..2e9245b948a1f --- /dev/null +++ b/migrate-large-mysql-to-tidb.md @@ -0,0 +1,287 @@ +--- +title: Migrate MySQL of Large Datasets to TiDB +summary: Learn how to migrate MySQL of large datasets to TiDB. +--- + +# Migrate MySQL of Large Datasets to TiDB + +When the data volume to be migrated is small, you can easily [use DM to migrate data](/migrate-small-mysql-to-tidb.md), both for full migration and incremental replication. However, because DM imports data at a slow speed (30~50 GiB/h), when the data volume is large, the migration might take a long time. "Large datasets" in this document usually mean data around one TiB or more. + +This document describes how to migrate large datasets from MySQL to TiDB. The whole migration has two processes: + +1. *Full migration*. Use Dumpling and TiDB Lightning to perform the full migration. TiDB Lightning's **local backend** mode can import data at a speed of up to 500 GiB/h. +2. *Incremental replication*. After the full migration is completed, you can replicate the incremental data using DM. + +## Prerequisites + +- [Install DM](https://docs.pingcap.com/tidb-data-migration/stable/deploy-a-dm-cluster-using-tiup). +- [Install Dumpling and TiDB Lightning](/migration-tools.md). +- [Grant the source database and target database privileges required for DM](https://docs.pingcap.com/tidb-data-migration/stable/dm-worker-intro). +- [Grant the target database privileges required for TiDB Lightning](/tidb-lightning/tidb-lightning-faq.md#what-are-the-privilege-requirements-for-the-target-database). +- [Grant the source database privileges required for Dumpling](/dumpling-overview.md#export-data-from-tidbmysql). + +## Resource requirements + +**Operating system**: The example in this document uses fresh CentOS 7 instances. You can deploy a virtual machine either on your local host or in the cloud. Because TiDB Lightning consumes as much CPU resources as needed by default, it is recommended that you deploy it on a dedicated server. If this is not possible, you can deploy it on a single server together with other TiDB components (for example, `tikv-server`) and then configure `region-concurrency` to limit the CPU usage from TiDB Lightning. Usually, you can configure the size to 75% of the logical CPU. + +**Memory and CPU**: Because TiDB Lightning consumes high resources, it is recommended to allocate more than 64 GiB of memory and more than 32 CPU cores. To get the best performance, make sure that the CPU core to memory (GiB) ratio is greater than 1:2. + +**Disk space**: + +- Dumpling requires enough disk space to store the whole data source. SSD is recommended. +- During the import, TiDB Lightning needs temporary space to store the sorted key-value pairs. The disk space should be enough to hold the largest single table from the data source. +- If the full data volume is large, you can increase the binlog storage time in the upstream. This is to ensure that the binlogs are not lost during the incremental replication. + +**Note**: It is difficult to calculate the exact data volume exported by Dumpling from MySQL, but you can estimate the data volume by using the following SQL statement to summarize the `data-length` field in the `information_schema.tables` table: + +{{< copyable "" >}} + +```sql +/* Calculate the size of all schemas, in MiB. Replace ${schema_name} with your schema name. */ +SELECT table_schema,SUM(data_length)/1024/1024 AS data_length,SUM(index_length)/1024/1024 AS index_length,SUM(data_length+index_length)/1024/1024 AS SUM FROM information_schema.tables WHERE table_schema = "${schema_name}" GROUP BY table_schema; + +/* Calculate the size of the largest table, in MiB. Replace ${schema_name} with your schema name. */ +SELECT table_name,table_schema,SUM(data_length)/1024/1024 AS data_length,SUM(index_length)/1024/1024 AS index_length,SUM(data_length+index_length)/1024/1024 AS SUM from information_schema.tables WHERE table_schema = "${schema_name}" GROUP BY table_name,table_schema ORDER BY SUM DESC LIMIT 5; +``` + +### Disk space for the target TiKV cluster + +The target TiKV cluster must have enough disk space to store the imported data. In addition to [the standard hardware requirements](/hardware-and-software-requirements.md), the storage space of the target TiKV cluster must be larger than **the size of the data source x [the number of replicas](/faq/deploy-and-maintain-faq.md#is-the-number-of-replicas-in-each-region-configurable-if-yes-how-to-configure-it) x 2**. For example, if the cluster uses 3 replicas by default, the target TiKV cluster must have a storage space larger than 6 times the size of the data source. The formula has `x 2` because: + +- Index might take extra space. +- RocksDB has a space amplification effect. + +## Step 1. Export all data from MySQL + +1. Export all data from MySQL by running the following command: + + {{< copyable "shell-regular" >}} + + ```shell + tiup dumpling -h ${ip} -P 3306 -u root -t 16 -r 200000 -F 256MiB -B my_db1 -f 'my_db1.table[12]' -o 's3://my-bucket/sql-backup?region=us-west-2' + ``` + + Dumpling exports data in SQL files by default. You can specify a different file format by adding the `--filetype` option. + + The parameters used above are as follows. For more Dumpling parameters, refer to [Dumpling Overview](/dumpling-overview.md). + + |parameters |Description| + |- |-| + |`-u` or `--user` |MySQL user| + |`-p` or `--password` |MySQL user password| + |`-P` or `--port` |MySQL port| + |`-h` or `--host` |MySQL IP address| + |`-t` or `--thread` |The number of threads used for export| + |`-o` or `--output` |The directory that stores the exported file. Supports a local path or an [external storage URL](/br/backup-and-restore-storages.md)| + |`-r` or `--row` |The maximum number of rows in a single file| + |`-F` |The maximum size of a single file, in MiB. Recommended value: 256 MiB.| + |-`B` or `--database` |Specifies a database to be exported| + |`-f` or `--filter` |Exports tables that match the pattern. Refer to [table-filter](/table-filter.md) for the syntax.| + + Make sure `${data-path}` has enough space to store the exported data. To prevent the export from being interrupted by a large table consuming all the spaces, it is strongly recommended to use the `-F` option to limit the size of a single file. + +2. View the `metadata` file in the `${data-path}` directory. This is a Dumpling-generated metadata file. Record the binlog position information, which is required for the incremental replication in Step 3. + + ``` + SHOW MASTER STATUS: + Log: mysql-bin.000004 + Pos: 109227 + GTID: + ``` + +## Step 2. Import full data to TiDB + +1. Create the `tidb-lightning.toml` configuration file: + + {{< copyable "" >}} + + ```toml + [lightning] + # log. + level = "info" + file = "tidb-lightning.log" + + [tikv-importer] + # "local": Default backend. The local backend is recommended to import large volumes of data (1 TiB or more). During the import, the target TiDB cluster cannot provide any service. + # "tidb": The "tidb" backend is recommended to import data less than 1 TiB. During the import, the target TiDB cluster can provide service normally. For more information on the backends, refer to https://docs.pingcap.com/tidb/stable/tidb-lightning-backends. + backend = "local" + # Sets the temporary storage directory for the sorted Key-Value files. The directory must be empty, and the storage space must be enough to hold the largest single table in the data source. For better import performance, it is recommended to use a directory different from `data-source-dir` and use flash storage, which can use I/O exclusively. + sorted-kv-dir = "${sorted-kv-dir}" + + [mydumper] + # The data source directory. The same directory where Dumpling exports data in "Step 1. Export all data from MySQL". + data-source-dir = "${data-path}" # A local path or S3 path. For example, 's3://my-bucket/sql-backup?region=us-west-2'. + + [tidb] + # The target TiDB cluster information. + host = ${host} # e.g.: 172.16.32.1 + port = ${port} # e.g.: 4000 + user = "${user_name}" # e.g.: "root" + password = "${password}" # e.g.: "rootroot" + status-port = ${status-port} # During the import, TiCb Lightning needs to obtain the table schema information from the TiDB status port. e.g.: 10080 + pd-addr = "${ip}:${port}" # The address of the PD cluster, e.g.: 172.16.31.3:2379. TiDB Lightning obtains some information from PD. When backend = "local", you must specify status-port and pd-addr correctly. Otherwise, the import will be abnormal. + ``` + + For more information on TiDB Lightning configuration, refer to [TiDB Lightning Configuration](/tidb-lightning/tidb-lightning-configuration.md). + +2. Start the import by running `tidb-lightning`. If you launch the program directly in the command line, the process might exit unexpectedly after receiving a SIGHUP signal. In this case, it is recommended to run the program using a `nohup` or `screen` tool. For example: + + If you import data from S3, pass the SecretKey and AccessKey that have access to the S3 storage path as environment variables to the TiDB Lightning node. You can also read the credentials from `~/.aws/credentials`. + + {{< copyable "shell-regular" >}} + + ```shell + export AWS_ACCESS_KEY_ID=${access_key} + export AWS_SECRET_ACCESS_KEY=${secret_key} + nohup tiup tidb-lightning -config tidb-lightning.toml -no-schema=true > nohup.out 2>&1 & + ``` + +3. After the import starts, you can check the progress of the import by one of the following methods: + + - `grep` the keyword `progress` in the log. The progress is updated every 5 minutes by default. + - Check progress in [the monitoring dashboard](/tidb-lightning/monitor-tidb-lightning.md). + - Check progress in [the TiDB Lightning web interface](/tidb-lightning/tidb-lightning-web-interface.md). + +4. After TiDB Lightning completes the import, it exits automatically. If you find the last 5 lines of its log print `the whole procedure completed`, the import is successful. + +> **Note:** +> +> Whether the import is successful or not, the last line of the log shows `tidb lightning exit`. It means that TiDB Lightning exits normally, but does not necessarily mean that the import is successful. + +If the import fails, refer to [TiDB Lightning FAQ](/tidb-lightning/tidb-lightning-faq.md) for troubleshooting. + +## Step 3. Replicate incremental data to TiDB + +### Add the data source + +1. Create a `source1.yaml` file as follows: + + {{< copyable "" >}} + + ```yaml + # Must be unique. + source-id: "mysql-01" + + # Configures whether DM-worker uses the global transaction identifier (GTID) to pull binlogs. To enable this mode, the upstream MySQL must also enable GTID. If the upstream MySQL service is configured to switch master between different nodes automatically, GTID mode is required. + enable-gtid: true + + from: + host: "${host}" # e.g.: 172.16.10.81 + user: "root" + password: "${password}" # Supported but not recommended to use a plaintext password. It is recommended to use `dmctl encrypt` to encrypt the plaintext password before using it. + port: 3306 + ``` + +2. Load the data source configuration to the DM cluster using `tiup dmctl` by running the following command: + + {{< copyable "shell-regular" >}} + + ```shell + tiup dmctl --master-addr ${advertise-addr} operate-source create source1.yaml + ``` + + The parameters used in the command above are described as follows: + + |Parameter |Description | + |- |- | + |`--master-addr` |The `{advertise-addr}` of any DM-master in the cluster where `dmctl` is to be connected, e.g.: 172.16.10.71:8261| + |`operate-source create`|Loads the data source to the DM cluster.| + +### Add a replication task + +1. Edit the `task.yaml` file. Configure the incremental replication mode and the starting point of each data source: + + {{< copyable "shell-regular" >}} + + ```yaml + name: task-test # Task name. Must be globally unique. + task-mode: incremental # Task mode. The "incremental" mode only performs incremental data replication. + + # Configures the target TiDB database. + target-database: # The target database instance. + host: "${host}" # e.g.: 127.0.0.1 + port: 4000 + user: "root" + password: "${password}" # It is recommended to use `dmctl encrypt` to encrypt the plaintext password before using it. + + # Use block and allow lists to specify the tables to be replicated. + block-allow-list: # The collection of filtering rules that matches the tables in the source database instance. If the DM version is earlier than v2.0.0-beta.2, use black-white-list. + bw-rule-1: # The block-allow-list configuration item ID. + do-dbs: ["${db-name}"] # Name of databases to be replicated. + + # Configures the data source. + mysql-instances: + - source-id: "mysql-01" # Data source ID,i.e., source-id in source1.yaml + block-allow-list: "bw-rule-1" # You can use the block-allow-list configuration above. + # syncer-config-name: "global" # You can use the syncers incremental data configuration below. + meta: # When task-mode is "incremental" and the target database does not have a checkpoint, DM uses the binlog position as the starting point. If the target database has a checkpoint, DM uses the checkpoint as the starting point. + # binlog-name: "mysql-bin.000004" # The binlog position recorded in "Step 1. Export all data from MySQL". If the upstream database service is configured to switch master between different nodes automatically, GTID mode is required. + # binlog-pos: 109227 + binlog-gtid: "09bec856-ba95-11ea-850a-58f2b4af5188:1-9" + + # (Optional) If you need to incrementally replicate data that has already been migrated in the full data migration, you need to enable the safe mode to avoid the incremental data replication error. + # This scenario is common in the following case: the full migration data does not belong to the data source's consistency snapshot, and after that, DM starts to replicate incremental data from a position earlier than the full migration. + # syncers: # The running configurations of the sync processing unit. + # global: # Configuration name. + # safe-mode: true # If this field is set to true, DM changes INSERT of the data source to REPLACE for the target database, and changes UPDATE of the data source to DELETE and REPLACE for the target database. This is to ensure that when the table schema contains a primary key or unique index, DML statements can be imported repeatedly. In the first minute of starting or resuming an incremental replication task, DM automatically enables the safe mode. + ``` + + The YAML above is the minimum configuration required for the migration task. For more configuration items, refer to [DM Advanced Task Configuration File](https://docs.pingcap.com/tidb-data-migration/stable/task-configuration-file-full). + + Before you start the migration task, to reduce the probability of errors, it is recommended to confirm that the configuration meets the requirements of DM by running the `check-task` command: + + {{< copyable "shell-regular" >}} + + ```shell + tiup dmctl --master-addr ${advertise-addr} check-task task.yaml + ``` + +2. Start the migration task by running the following command: + + {{< copyable "shell-regular" >}} + + ```shell + tiup dmctl --master-addr ${advertise-addr} start-task task.yaml + ``` + + The parameters used in the command above are described as follows: + + |Parameter |Description | + |- |- | + |`--master-addr` |The {advertise-addr} of any DM-master in the cluster where `dmctl` is to be connected, e.g.: 172.16.10.71:8261| + |`start-task` |Starts the migration task.| + + If the task fails to start, check the prompt message and fix the configuration. After that, you can re-run the command above to start the task. + + If you encounter any problem, refer to [DM error handling](https://docs.pingcap.com/tidb-data-migration/stable/error-handling) and [DM FAQ](https://docs.pingcap.com/tidb-data-migration/stable/faq). + +### Check the migration task status + +To learn whether the DM cluster has an ongoing migration task and view the task status, run the `query-status` command using `tiup dmctl`: + +{{< copyable "shell-regular" >}} + +```shell +tiup dmctl --master-addr ${advertise-addr} query-status ${task-name} +``` + +For a detailed interpretation of the results, refer to [Query Status](https://docs.pingcap.com/tidb-data-migration/stable/query-status). + +### Monitor the task and view logs + +To view the history status of the migration task and other internal metrics, take the following steps. + +If you have deployed Prometheus, Alertmanager, and Grafana when you deployed DM using TiUP, you can access Grafana using the IP address and port specified during the deployment. You can then select DM dashboard to view DM-related monitoring metrics. + +When DM is running, DM-worker, DM-master, and dmctl print the related information in logs. The log directories of these components are as follows: + +- DM-master: specified by the DM-master process parameter `--log-file`. If you deploy DM using TiUP, the log directory is `/dm-deploy/dm-master-8261/log/` by default. +- DM-worker: specified by the DM-worker process parameter `--log-file`. If you deploy DM using TiUP, the log directory is `/dm-deploy/dm-worker-8262/log/` by default. + +## What's next + +- [Pause the migration task](https://docs.pingcap.com/tidb-data-migration/stable/pause-task). +- [Resume the migration task](https://docs.pingcap.com/tidb-data-migration/stable/resume-task). +- [Stop the migration task](https://docs.pingcap.com/tidb-data-migration/stable/stop-task). +- [Export and import the cluster data source and task configuration](https://docs.pingcap.com/tidb-data-migration/stable/export-import-config). +- [Handle failed DDL statements](https://docs.pingcap.com/tidb-data-migration/stable/handle-failed-ddl-statements). diff --git a/migrate-small-mysql-shards-to-tidb.md b/migrate-small-mysql-shards-to-tidb.md new file mode 100644 index 0000000000000..141b54706ca20 --- /dev/null +++ b/migrate-small-mysql-shards-to-tidb.md @@ -0,0 +1,239 @@ +--- +title: Migrate and Merge MySQL Shards of Small Datasets to TiDB +summary: Learn how to migrate and merge small datasets of shards from MySQL to TiDB. +--- + +# Migrate and Merge MySQL Shards of Small Datasets to TiDB + +If you want to migrate and merge multiple MySQL database instances upstream to one TiDB database downstream, and the amount of data is not too large, you can use DM to migrate MySQL shards. "Small datasets" in this document usually mean data around or less than one TiB. Through examples in this document, you can learn the operation steps, precautions, and troubleshooting of the migration. + +This document applies to migrating MySQL shards less than 1 TiB in total. If you want to migrate MySQL shards with a total of more than 1 TiB of data, it will take a long time to migrate only using DM. In this case, it is recommended that you follow the operation introduced in [Migrate and Merge MySQL Shards of Large Datasets to TiDB](/migrate-large-mysql-shards-to-tidb.md) to perform migration. + +This document takes a simple example to illustrate the migration procedure. The MySQL shards of the two data source MySQL instances in the example are migrated to the downstream TiDB cluster. The diagram is shown as follows. + +![Use DM to Migrate Sharded Tables](/media/migrate-shard-tables-within-1tb-en.png) + +Both MySQL Instance 1 and MySQL Instance 2 contain the following schemas and tables. In this example, you migrate and merge tables from `store_01` and `store_02` schemas with a `sale` prefix in both instances, into the downstream `sale` table in the `store` schema. + +| Schema | Table | +|:------|:------| +| store_01 | sale_01, sale_02 | +| store_02 | sale_01, sale_02 | + +Target schemas and tables: + +| Schema | Table | +|:------|:------| +| store | sale | + +## Prerequisites + +Before starting the migration, make sure you have completed the following tasks: + +- [Deploy a DM Cluster Using TiUP](https://docs.pingcap.com/tidb-data-migration/stable/deploy-a-dm-cluster-using-tiup) +- [Privileges required by DM-worker](https://docs.pingcap.com/tidb-data-migration/stable/dm-worker-intro#privileges-required-by-dm-worker) + +### Check conflicts for the sharded tables + +If the migration involves merging data from different sharded tables, primary key or unique index conflicts may occur during the merge. Therefore, before migration, you need to take a deep look at the current sharding scheme from the business point of view, and find a way to avoid the conflicts. For more details, see [Handle conflicts between primary keys or unique indexes across multiple sharded tables](https://docs.pingcap.com/tidb-data-migration/stable/shard-merge-best-practices#handle-conflicts-between-primary-keys-or-unique-indexes-across-multiple-sharded-tables). The following is a brief description. + +In this example, `sale_01` and `sale_02` have the same table structure as follows + +{{< copyable "sql" >}} + +```sql +CREATE TABLE `sale_01` ( + `id` bigint(20) NOT NULL AUTO_INCREMENT, + `sid` bigint(20) NOT NULL, + `pid` bigint(20) NOT NULL, + `comment` varchar(255) DEFAULT NULL, + PRIMARY KEY (`id`), + UNIQUE KEY `sid` (`sid`) +) ENGINE=InnoDB DEFAULT CHARSET=latin1 +``` + +The `id` column is the primary key, and the `sid` column is the sharding key. The `id` column is auto-incremental, and duplicated multiple sharded table ranges will cause data conflicts. The `sid` can ensure that the index is globally unique, so you can follow the steps in [Remove the primary key attribute of the auto-incremental primary key](https://docs.pingcap.com/tidb-data-migration/stable/shard-merge-best-practices#remove-the-primary-key-attribute-from-the-column) to bypasses the `id` column. + +{{< copyable "sql" >}} + +```sql +CREATE TABLE `sale` ( + `id` bigint(20) NOT NULL, + `sid` bigint(20) NOT NULL, + `pid` bigint(20) NOT NULL, + `comment` varchar(255) DEFAULT NULL, + INDEX (`id`), + UNIQUE KEY `sid` (`sid`) +) ENGINE=InnoDB DEFAULT CHARSET=latin1 +``` + +## Step 1. Load data sources + +Create a new data source file called `source1.yaml`, which configures an upstream data source into DM, and add the following content: + +{{< copyable "shell-regular" >}} + +```yaml +# Configuration. +source-id: "mysql-01" # Must be unique. +# Specifies whether DM-worker pulls binlogs with GTID (Global Transaction Identifier). +# The prerequisite is that you have already enabled GTID in the upstream MySQL. +# If you have configured the upstream database service to switch master between different nodes automatically, you must enable GTID. +enable-gtid: true +from: + host: "${host}" # For example: 172.16.10.81 + user: "root" + password: "${password}" # Plaintext passwords are supported but not recommended. It is recommended that you use dmctl encrypt to encrypt plaintext passwords. + port: ${port} # For example: 3306 +``` + +Run the following command in a terminal. Use `tiup dmctl` to load the data source configuration into the DM cluster: + +{{< copyable "shell-regular" >}} + +```shell +tiup dmctl --master-addr ${advertise-addr} operate-source create source1.yaml +``` + +The parameters are described as follows. + +|Parameter | Description | +|- |- | +|--master-addr | {advertise-addr} of any DM-master node in the cluster that dmctl connects to. For example: 172.16.10.71:8261| +|operate-source create | Load data sources to the DM clusters. | + +Repeat the above steps until all data sources are added to the DM cluster. + +## Step 2. Configure the migration task + +Create a task configuration file named `task1.yaml` and writes the following content to it: + +{{< copyable "shell-regular" >}} + +```yaml +name: "shard_merge" # The name of the task. Should be globally unique. +# Task mode. You can set it to the following: +# - full: Performs only full data migration (incremental replication is skipped) +# - incremental: Only performs real-time incremental replication using binlog. (full data migration is skipped) +# - all: Performs both full data migration and incremental replication. For migrating small to medium amount of data here, use this option. +task-mode: all +# Required for the MySQL shards. By default, the "pessimistic" mode is used. +# If you have a deep understanding of the principles and usage limitations of the optimistic mode, you can also use the "optimistic" mode. +# For more information, see [Merge and Migrate Data from Sharded Tables](https://docs.pingcap.com/tidb-data-migration/stable/feature-shard-merge) +shard-mode: "pessimistic" +meta-schema: "dm_meta" # A schema will be created in the downstream database to store the metadata +ignore-checking-items: ["auto_increment_ID"] # In this example, there are auto-incremental primary keys upstream, so you do not need to check this item. + +target-database: + host: "${host}" # For example: 192.168.0.1 + port: 4000 + user: "root" + password: "${password}" # Plaintext passwords are supported but not recommended. It is recommended that you use dmctl encrypt to encrypt plaintext passwords. + +mysql-instances: + - + source-id: "mysql-01" # ID of the data source, which is source-id in source1.yaml + route-rules: ["sale-route-rule"] # Table route rules applied to the data source + filter-rules: ["store-filter-rule", "sale-filter-rule"] # Binlog event filter rules applied to the data source + block-allow-list: "log-bak-ignored" # Block & Allow Lists rules applied to the data source + - + source-id: "mysql-02" + route-rules: ["sale-route-rule"] + filter-rules: ["store-filter-rule", "sale-filter-rule"] + block-allow-list: "log-bak-ignored" + +# Configurations for merging MySQL shards +routes: # Table renaming rules ('routes') from upstream to downstream tables, in order to support merging different sharded tables into a single target table. + sale-route-rule: # Rule name. Migrate and merge tables from upstream to the downstream. + schema-pattern: "store_*" # Rule for matching upstream schema names. It supports the wildcards "*" and "?". + table-pattern: "sale_*" # Rule for matching upstream table names. It supports the wildcards "*" and "?". + target-schema: "store" # Name of the target schema. + target-table: "sale" # Name of the target table. + +# Filters out some DDL events. +filters: + sale-filter-rule: # Filter name. + schema-pattern: "store_*" # The binlog events or DDL SQL statements of upstream MySQL instance schemas that match schema-pattern are filtered by the rules below. + table-pattern: "sale_*" # The binlog events or DDL SQL statements of upstream MySQL instance tables that match table-pattern are filtered by the rules below. + events: ["truncate table", "drop table", "delete"] # The binlog event array. + action: Ignore # The string (`Do`/`Ignore`). `Do` is the allow list. `Ignore` is the block list. + store-filter-rule: + schema-pattern: "store_*" + events: ["drop database"] + action: Ignore + +# Block and allow list +block-allow-list: # filter or only migrate all operations of some databases or some tables. + log-bak-ignored: # Rule name. + do-dbs: ["store_*"] # The allow list of the schemas to be migrated, similar to replicate-do-db in MySQL. +``` + +The above example is the minimum configuration to perform the migration task. For more information, see [DM Advanced Task Configuration File](https://docs.pingcap.com/tidb-data-migration/stable/task-configuration-file-full). + +For more information on `routes`, `filters` and other configurations in the task file, see the following documents: + +- [Table routing](https://docs.pingcap.com/tidb-data-migration/stable/key-features#table-routing) +- [Block & Allow Table Lists](https://docs.pingcap.com/tidb-data-migration/stable/key-features#block-and-allow-table-lists) +- [Binlog event filter](https://docs.pingcap.com/tidb-data-migration/stable/key-features#binlog-event-filter) +- [Filter Certain Row Changes Using SQL Expressions](https://docs.pingcap.com/tidb-data-migration/stable/feature-expression-filter) + +## Step 3. Start the task + +Before starting a migration task, run the `check-task` subcommand in `tiup dmctl` to check whether the configuration meets the requirements of DM so as to avoid possible errors. + +{{< copyable "shell-regular" >}} + +```shell +tiup dmctl --master-addr ${advertise-addr} check-task task.yaml +``` + +Run the following command in `tiup dmctl` to start a migration task: + +{{< copyable "shell-regular" >}} + +```shell +tiup dmctl --master-addr ${advertise-addr} start-task task.yaml +``` + +| Parameter | Description| +|-|-| +|--master-addr| {advertise-addr} of any DM-master node in the cluster that dmctl connects to. For example: 172.16.10.71:8261 | +|start-task | Starts the data migration task. | + +If the migration task fails to start, modify the configuration information according to the error information, and then run `start-task task.yaml` again to start the migration task. If you encounter problems, see [Handle Errors](https://docs.pingcap.com/tidb-data-migration/stable/error-handling) and [FAQ](https://docs.pingcap.com/tidb-data-migration/stable/faq). + +## Step 4. Check the task + +After starting the migration task, you can use `dmtcl tiup` to run `query-status` to view the status of the task. + +{{< copyable "shell-regular" >}} + +```shell +tiup dmctl --master-addr ${advertise-addr} query-status ${task-name} +``` + +If you encounter errors, use `query-status ` to view more detailed information. For details about the query results, task status and sub task status of the `query-status` command, see [TiDB Data Migration Query Status](https://docs.pingcap.com/tidb-data-migration/stable/query-status). + +## Step 5. Monitor tasks and check logs (optional) + +You can view the history of a migration task and internal operational metrics through Grafana or logs. + +- Via Grafana + + If Prometheus, Alertmanager, and Grafana are correctly deployed when you deploy the DM cluster using TiUP, you can view DM monitoring metrics in Grafana. Specifically, enter the IP address and port specified during deployment in Grafana and select the DM dashboard. + +- Via logs + + When DM is running, DM-master, DM-worker, and dmctl output logs, which includes information about migration tasks. The log directory of each component is as follows. + + - DM-master log directory: It is specified by the DM-master process parameter `--log-file`. If DM is deployed using TiUP, the log directory is `/dm-deploy/dm-master-8261/log/`. + - DM-worker log directory: It is specified by the DM-worker process parameter `--log-file`. If DM is deployed using TiUP, the log directory is `/dm-deploy/dm-worker-8262/log/`. + +## See also + +- [Migrate and Merge MySQL Shards of Large Datasets to TiDB](/migrate-large-mysql-shards-to-tidb.md)。 +- [Merge and Migrate Data from Sharded Tables](https://docs.pingcap.com/tidb-data-migration/stable/feature-shard-merge) +- [Best Practices of Data Migration in the Shard Merge Scenario](https://docs.pingcap.com/tidb-data-migration/stable/shard-merge-best-practices) +- [Handle Errors](https://docs.pingcap.com/tidb-data-migration/stable/error-handling) +- [Handle Performance Issues](https://docs.pingcap.com/zh/tidb-data-migration/stable/handle-performance-issues) +- [FAQ](https://docs.pingcap.com/tidb-data-migration/stable/faq) diff --git a/migrate-small-mysql-to-tidb.md b/migrate-small-mysql-to-tidb.md new file mode 100644 index 0000000000000..1660493e87f9a --- /dev/null +++ b/migrate-small-mysql-to-tidb.md @@ -0,0 +1,148 @@ +--- +title: Migrate MySQL of Small Datasets to TiDB +summary: Learn how to migrate MySQL of small datasets to TiDB. +--- + +# Migrate MySQL of Small Datasets to TiDB + +This document describes how to use TiDB Data Migration (DM) to migrate MySQL of small datasets to TiDB in the full migration mode and incremental replication mode. "Small datasets" in this document mean data size less than 1 TiB. + +The migration speed varies from 30 GB/h to 50 GB/h, depending on multiple factors such as the number of indexes in the table schema, hardware, and network environment. + + + +## Prerequisites + +- [Deploy a DM Cluster Using TiUP](https://docs.pingcap.com/tidb-data-migration/stable/deploy-a-dm-cluster-using-tiup) +- [Grant the required privileges for the source database and the target database of DM](https://docs.pingcap.com/tidb-data-migration/stable/dm-worker-intro) + +## Step 1. Create the data source + +First, create the `source1.yaml` file as follows: + +{{< copyable "" >}} + +```yaml +# The ID must be unique. +source-id: "mysql-01" + +# Configures whether DM-worker uses the global transaction identifier (GTID) to pull binlogs. To enable GTID, the upstream MySQL must have enabled GTID. If the upstream MySQL has automatic source-replica switching, the GTID mode is required. +enable-gtid: true + +from: + host: "${host}" # For example: 172.16.10.81 + user: "root" + password: "${password}" # Plaintext password is supported but not recommended. It is recommended to use dmctl encrypt to encrypt the plaintext password before using the password. + port: 3306 +``` + +Then, load the data source configuration to the DM cluster using `tiup dmctl` by running the following command: + +{{< copyable "shell-regular" >}} + +```shell +tiup dmctl --master-addr ${advertise-addr} operate-source create source1.yaml +``` + +The parameters used in the command above are described as follows: + +|Parameter |Description| +| :- | :- | +|`--master-addr` |The {advertise-addr} of any DM-master node in the cluster where `dmctl` is to connect. For example, 172.16.10.71:8261. +|`operate-source create`|Load the data source to the DM cluster.| + +## Step 2. Create the migration task + +Create the `task1.yaml` file as follows: + +{{< copyable "" >}} + +```yaml +# Task name. Each of the multiple tasks running at the same time must have a unique name. +name: "test" +# Task mode. Options are: +# full: only performs full data migration. +# incremental: only performs binlog real-time replication. +# all: full data migration + binlog real-time replication. +task-mode: "all" +# The configuration of the target TiDB database. +target-database: + host: "${host}" # For example: 172.16.10.83 + port: 4000 + user: "root" + password: "${password}" # Plaintext password is supported but not recommended. It is recommended to use dmctl encrypt to encrypt the plaintext password before using the password. + +# The configuration of all MySQL instances of source database required for the current migration task. +mysql-instances: +- + # The ID of an upstream instance or a replication group + source-id: "mysql-01" + # The names of the block list and allow list configuration of the schema name or table name that is to be migrated. These names are used to reference the global configuration of the block and allowlist. For the global configuration, refer to the `block-allow-list` configuration below. + block-allow-list: "listA" + +# The global configuration of blocklist and allowlist. Each instance is referenced by a configuration item name. +block-allow-list: + listA: # name + do-tables: # The allowlist of upstream tables that need to be migrated. + - db-name: "test_db" # The schema name of the table to be migrated. + tbl-name: "test_table" # The name of the table to be migrated. + +``` + +The above is the minimum task configuration to perform the migration. For more configuration items regarding the task, refer to [DM task complete configuration file introduction](https://docs.pingcap.com/zh/tidb-data-migration/stable/task-configuration-file-full/). + +## Step 3. Start the migration task + +To avoid errors, before starting the migration task, it is recommended to use the `check-task` command to check whether the configuration meets the requirements of DM configuration. + +{{< copyable "shell-regular" >}} + +```shell +tiup dmctl --master-addr ${advertise-addr} check-task task.yaml +``` + +Start the migration task by running the following command with `tiup dmctl`. + +{{< copyable "shell-regular" >}} + +```shell +tiup dmctl --master-addr ${advertise-addr} start-task task.yaml +``` + +The parameters used in the command above are described as follows: + +|Parameter|Description| +| - | - | +|`--master-addr`| The {advertise-addr} of any DM-master node in the cluster where `dmctl` is to connect. For example: 172.16.10.71:8261. | +|`start-task`| Start the migration task | + +If the task fails to start, after changing the configuration according to the returned result, you can run the `start-task task.yaml` command to restart the task. If you encounter problems, refer to [Handle Errors](https://docs.pingcap.com/tidb-data-migration/stable/error-handling/) and [FAQ](https://docs.pingcap.com/tidb-data-migration/stable/faq). + +## Step 4: Check the migration task status + +To learn whether the DM cluster has an ongoing migration task, the task status and some other information, run the `query-status` command using `tiup dmctl`: + +{{< copyable "shell-regular" >}} + +```shell +tiup dmctl --master-addr ${advertise-addr} query-status ${task-name} +``` + +For a detailed interpretation of the results, refer to [Query Status](https://docs.pingcap.com/tidb-data-migration/stable/query-status). + +## Step 5. Monitor the task and view logs (optional) + +To view the historical status of the migration task and other internal metrics, take the following steps. + +If you have deployed Prometheus, Alertmanager, and Grafana when deploying DM using TiUP, you can access Grafana using the IP address and port specified during the deployment. You can then select the DM dashboard to view DM-related monitoring metrics. + +- The log directory of DM-master: specified by the DM-master process parameter `--log-file`. If you have deployd DM using TiUP, the log directory is `/dm-deploy/dm-master-8261/log/` by default. +- The log directory of DM-worker: specified by the DM-worker process parameter `--log-file`. If you have deployd DM using TiUP, the log directory is `/dm-deploy/dm-worker-8262/log/` by default. + +## What's next + +- [Pause the migration task](https://docs.pingcap.com/tidb-data-migration/stable/pause-task) +- [Resume the migration task](https://docs.pingcap.com/tidb-data-migration/stable/resume-task) +- [Stop the migration task](https://docs.pingcap.com/tidb-data-migration/stable/stop-task) +- [Export and import the cluster data source and task configuration](https://docs.pingcap.com/tidb-data-migration/stable/export-import-config) +- [Handle failed DDL statements](https://docs.pingcap.com/tidb-data-migration/stable/handle-failed-ddl-statements) diff --git a/migrate-with-more-columns-downstream.md b/migrate-with-more-columns-downstream.md new file mode 100644 index 0000000000000..84c99f3ec7baa --- /dev/null +++ b/migrate-with-more-columns-downstream.md @@ -0,0 +1,108 @@ +--- +title: Migrate Data to a Downstream TiDB Table with More Columns +summary: Learn how to migrate data to a downstream TiDB table with more columns than the corresponding upstream table. +--- + +# Migrate Data to a Downstream TiDB Table with More Columns + +This document provides the additional steps to be taken when you migrate data to a downstream TiDB table with more columns than the corresponding upstream table. For regular migration steps, see the following migration scenarios: + +- [Migrate MySQL of Small Datasets to TiDB](/migrate-small-mysql-to-tidb.md) +- [Migrate MySQL of Large Datasets to TiDB](/migrate-large-mysql-to-tidb.md) +- [Migrate and Merge MySQL Shards of Small Datasets to TiDB](/migrate-small-mysql-shards-to-tidb.md) +- [Migrate and Merge MySQL Shards of Large Datasets to TiDB](/migrate-large-mysql-shards-to-tidb.md) + +## Use DM to migrate data to a downstream TiDB table with more columns + +When replicating the upstream binlog, DM tries to use the current table schema of the downstream to parse the binlog and generate the corresponding DML statements. If the column number of the table in the upstream binlog does not match the column number in the downstream table schema, the following error occurs: + +```json +"errors": [ + { + "ErrCode": 36027, + "ErrClass": "sync-unit", + "ErrScope": "internal", + "ErrLevel": "high", + "Message": "startLocation: [position: (mysql-bin.000001, 2022), gtid-set:09bec856-ba95-11ea-850a-58f2b4af5188:1-9 ], endLocation: [ position: (mysql-bin.000001, 2022), gtid-set: 09bec856-ba95-11ea-850a-58f2b4af5188:1-9]: gen insert sqls failed, schema: log, table: messages: Column count doesn't match value count: 3 (columns) vs 2 (values)", + "RawCause": "", + "Workaround": "" + } +] +``` + +The following is an example upstream table schema: + +```sql +# Upstream table schema +CREATE TABLE `messages` ( + `id` int(11) NOT NULL, + PRIMARY KEY (`id`) +) +``` + +The following is an example downstream table schema: + +```sql +# Downstream table schema +CREATE TABLE `messages` ( + `id` int(11) NOT NULL, + `message` varchar(255) DEFAULT NULL, # This is the additional column that only exists in the downstream table. + PRIMARY KEY (`id`) +) +``` + +When DM tries to use the downstream table schema to parse the binlog event generated by the upstream, DM reports the above `Column count doesn't match` error. + +In such cases, you can use the `operate-schema` command to set a table schema for the table to be migrated from the data source. The specified table schema needs to correspond to the binlog event data to be replicated by DM. If you are migrating sharded tables, for each sharded table, you need to set a table schema in DM to parse binlog event data. The steps are as follows: + +1. Create a SQL file in DM and add the `CREATE TABLE` statement that corresponds to the upstream table schema to the file. For example, save the following table schema to `log.messages.sql`. + + ```sql + # Upstream table schema + CREATE TABLE `messages` ( + `id` int(11) NOT NULL, + PRIMARY KEY (`id`) + ) + ``` + +2. Use the `operate-schema` command to set the table schema for the table to be migrated from the data source (At this time, the data migration task should be in the Paused state due to the above `Column count doesn't match` error). + + {{< copyable "shell-regular" >}} + + ``` + tiup dmctl --master-addr ${advertise-addr} operate-schema set -s ${source-id} ${task-name} -d ${database-name} -t ${table-name} ${schema-file} + ``` + + The descriptions of parameters in this command are as follows: + + |Parameter |Description| + |- |-| + |-master-addr |Specifies the `${advertise-addr}` of any DM-master node in the cluster where dmctl is to be connected. `${advertise-addr}` indicates the address that DM-master advertises to the outside world.| + |operate-schema set| Manually set the schema information.| + |-s | Specifies the source. `${source-id}` indicates the source ID of MySQL data. `${task-name}` indicates the name of the synchronization task defined in the `task.yaml` configuration file of the data migration task.| + |-d | Specifies the database. `${database-name}` indicates the name of the upstream database. | + |-t |Specifies table. `${table-name}` indicates the name of the upstream data table. `${schema-file}` indicates the table schema file to be set.| + + For example: + + {{< copyable "shell-regular" >}} + + ``` + tiup dmctl --master-addr 172.16.10.71:8261 operate-schema set -s mysql-01 task-test -d log -t message log.message.sql + ``` + +3. Use the `resume-task` command to resume the migration task in the Paused state. + + {{< copyable "shell-regular" >}} + + ``` + tiup dmctl --master-addr ${advertise-addr} resume-task ${task-name} + ``` + +4. Use the `query-status` command to confirm that the data migration task is running correctly. + + {{< copyable "shell-regular" >}} + + ``` + tiup dmctl --master-addr ${advertise-addr} query-status resume-task ${task-name} + ``` diff --git a/migrate-with-pt-ghost.md b/migrate-with-pt-ghost.md new file mode 100644 index 0000000000000..094ba67ff7984 --- /dev/null +++ b/migrate-with-pt-ghost.md @@ -0,0 +1,67 @@ +--- +title: Continuous Replication from Databases that Use gh-ost or pt-osc +summary: Learn how to use DM to replicate incremental data from databases that use online DDL tools gh-ost or pt-osc +--- + +# Continuous Replication from Databases that Use gh-ost or pt-osc + +In production scenarios, table locking during DDL execution can block the reads from or writes to the database to a certain extent. Therefore, online DDL tools are often used to execute DDLs to minimize the impact on reads and writes. Common DDL tools are [gh-ost](https://github.com/github/gh-ost) and [pt-osc](https://www.percona.com/doc/percona-toolkit/3.0/pt-online-schema-change.html). + +When using DM to migrate data from MySQL to TiDB, you can enbale `online-ddl` to allow collaboration of DM and gh-ost or pt-osc. + +For the detailed replication instructions, refer to the following documents by scenarios: + +- [Migrate MySQL of Small Datasets to TiDB](/migrate-small-mysql-to-tidb.md) +- [Migrate MySQL of Large Datasets to TiDB](/migrate-large-mysql-to-tidb.md) +- [Migrate and Merge MySQL Shards of Small Datasets to TiDB](/migrate-small-mysql-shards-to-tidb.md) +- [Migrate and Merge MySQL Shards of Large Datasets to TiDB](/migrate-large-mysql-shards-to-tidb.md) + +## Enable online-ddl on DM + +In the task configuration file of DM, set the global parameter `online-ddl` to `true`, as shown below: + +```yaml +# ----------- Global configuration ----------- +## ********* Basic configuration ********* +name: test # The name of the task. Should be globally unique. +task-mode: all # The task mode. Can be set to `full`, `incremental`, or `all`. +shard-mode: "pessimistic" # The shard merge mode. Optional modes are `pessimistic` and `optimistic`. The `pessimistic` mode is used by default. After understanding the principles and restrictions of the "optimistic" mode, you can set it to the "optimistic" mode. +meta-schema: "dm_meta" # The downstream database that stores the `meta` information. +online-ddl: true # Enable online-ddl support on DM to support automatic processing of "gh-ost" and "pt-osc" for the upstream database. +``` + +## Workflow after enabling online-ddl + +After online-ddl is enabled on DM, the DDL statements generated by DM replicating gh-ost or pt-osc will change. + +The workflow of gh-ost or pt-osc: + +- Create a ghost table according to the table schema of the DDL real table. + +- Apply DDLs on the ghost table. + +- Replicate the data of the DDL real table to the ghost table. + +- After the data are consistent between the two tables, use the rename statement to replace the real table with the ghost table. + +The workflow of DM: + +- Skip creating the ghost table downstream. + +- Record DDLs applied to the ghost table. + +- Replicate data only from the ghost table. + +- Apply DDLs recorded downstream. + +![dm-online-ddl](/media/dm/dm-online-ddl.png) + +The change in the workflow brings the following advantages: + +- The downstream TiDB does not need to create and replicate the ghost table, saving the storage space and network transmission overhead. + +- When you migrate and merge data from sharded tables, the RENAME operation is ignored for each sharded ghost table to ensure the correctness of the replication. + +## See also + +[Working details for DM with online DDL tools](https://docs.pingcap.com/tidb-data-migration/stable/feature-online-ddl/#working-details-for-dm-with-online-ddl-tools) diff --git a/migration-overview.md b/migration-overview.md index 1bcf37be0adba..adede1c88d1b0 100644 --- a/migration-overview.md +++ b/migration-overview.md @@ -1,68 +1,61 @@ --- -title: Migration Overview -summary: This document describes how to migrate data from databases or data formats (CSV/SQL). -aliases: ['/docs/dev/migration-overview/'] +title: Data Migration Overview +summary: Learn the overview of data migration scenarios and the solutions. --- -# Migration Overview +# Data Migration Overview -This document describes how to migrate data to TiDB, including migrating data from MySQL and from CSV/SQL files. +This document gives an overview of the data migration solutions that you can use with TiDB. The data migration solutions are as follows: -## Migrate from Aurora to TiDB +- Full data migration. + - To import Amazon Aurora snapshots, CSV files, or Mydumper SQL files into TiDB, you can use TiDB Lightning to perform the full migration. + - To export all TiDB data as CSV files or Mydumper SQL files, you can use Dumpling to perform the full migration, which makes data migration from MySQL or MariaDB easier. + - To migrate all data from a database with a small data size volume (for example, less than 1 TiB), you can also use TiDB Data Migration (DM). -In a cloud environment, you can directly migrate full data to TiDB by exporting snapshot from Aurora. For details, see [Migrate from Amazon Aurora MySQL Using TiDB Lightning](/migrate-from-aurora-using-lightning.md). +- Quick initialization of TiDB. TiDB Lightning supports quickly importing data and can quickly initialize a specific table in TiDB. Before you use this feature, pay attention that the quick initialization has a great impact on TiDB and the cluster does not provide services during the initialization period. -## Migrate from MySQL to TiDB - -To migrate data from MySQL to TiDB, it is recommended to use one of the following methods: - -- [Use Dumpling and TiDB Lightning](#use-dumpling-and-tidb-lightning-full-data) to migrate full data. -- [Use TiDB Data Migration (DM)](#use-dm) to migrate full and incremental data. - -### Use Dumpling and TiDB Lightning (full data) - -#### Scenarios +- Incremental replication. You can use TiDB DM to replicate binlogs from MySQL, MariaDB, or Aurora to TiDB, which greatly reduces the window downtime during the replication period. -You can use Dumpling and TiDB Lightning to migrate full data when the data size is greater than 1 TB. If you need to replicate incremental data, it is recommended to [use DM](#use-dm) to create an incremental replication task. +- Data replication between TiDB clusters. TiDB supports backup and restore. This feature can initialize a snapshot in an existing TiDB cluster to a new TiDB cluster. -#### Migration method +You might choose different migration solutions according to the database type, deployment location, application data size, and application needs. The following sections introduce some common migration scenarios, and you can refer to these sections to determine the most suitable solution according to your needs. -1. Use Dumpling to export the full MySQL data. -2. Use TiDB Lightning to import the full data to TiDB. For details, refer to [Migrate data using Dumpling and TiDB Lightning](/migrate-from-mysql-dumpling-files.md). +## Migrate from Aurora MySQL to TiDB -### Use DM +When you migrate data from Aurora to a TiDB cluster deployed on AWS, your data migration takes two operations: full data migration and incremental replication. You can choose the corresponding operation according to your application needs. -#### Scenarios +- [Migrate Data from Amazon Aurora to TiDB](/migrate-aurora-to-tidb.md). -You can use DM to migrate full MySQL data and to replicate incremental data. It is suggested that the size of the full data is less than 1 TB. Otherwise, it is recommended to use Dumpling and TiDB Lightning to import the full data, and then use DM to replicate the incremental data. - -#### Migration method +## Migrate from MySQL to TiDB -For details, refer to [Migrate from MySQL (Amazon Aurora)](/dm/migrate-from-mysql-aurora.md). +If cloud storage (S3) service is not used, the network connectivity is good, and the network latency is low, you can use the following method to migrate data from MySQL to TiDB. -## Migrate data from files to TiDB +- [Migrate MySQL of Small Datasets to TiDB](/migrate-small-mysql-to-tidb.md) -You can migrate data from CSV/SQL files to TiDB. +If you have a high demand on migration speed, or if the data size is large (for example, larger than 1 TiB), and you do not allow other applications to write to TiDB during the migration period, you can use TiDB Lightning to quickly import data. Then, you can use DM to replicate incremental data (binlog) based on your application needs. -### Migrate data from CSV files to TiDB +- [Migrate MySQL of Large Datasets to TiDB](/migrate-large-mysql-to-tidb.md) -#### Scenarios +## Migrate and merge MySQL shards into TiDB -You can migrate data from heterogeneous databases that are not compatible with the MySQL protocol to TiDB. +Suppose that your application uses MySQL shards for data storage, and you need to migrate these shards into TiDB as one table. In this case, you can use DM to perform the shard merge and migration. -#### Migration method +- [Migrate and Merge MySQL Shards of Small Datasets to TiDB](/migrate-small-mysql-shards-to-tidb.md) -1. Export full data to CSV files. -2. Import CSV files to TiDB using one of the following methods: +If the data size of the sharded tables is large (for example, larger than 1 TiB), and you do not allow other applications to write to TiDB during the migration period, you can use TiDB Lightning to quickly merge and import the sharded tables. Then, you can use DM to replicate incremental sharding data (binlog) based on your application needs. - - Use TiDB Lightning. +- [Migrate and Merge MySQL Shards of Large Datasets to TiDB](/migrate-large-mysql-shards-to-tidb.md) - Its import speed is fast. It is recommended to use TiDB Lightning in the case of large amounts of data in CSV files. For details, refer to [TiDB Lightning CSV Support](/tidb-lightning/migrate-from-csv-using-tidb-lightning.md). +## Migrate data from files to TiDB - - Use the `LOAD DATA` statement. +- [Migrate data from CSV files to TiDB](/migrate-from-csv-files-to-tidb.md) +- [Migrate data from SQL files to TiDB](/migrate-from-sql-files-to-tidb.md) - Execute the `LOAD DATA` statement in TiDB to import CSV files. This is more convenient, but if an error or interruption occurs during the import, manual intervention is required to check the consistency and integrity of the data. Therefore, it is **not recommended** to use this method in the production environment. For details, refer to [LOAD DATA](/sql-statements/sql-statement-load-data.md). +## More complex migration solutions -### Migrate data from SQL files to TiDB +The following features can improve the migration process and might meet more needs in your application. -Use Mydumper and TiDB Lightning to migrate data from SQL files to TiDB. For details, refer to [Use Dumpling and TiDB Lightning](#use-dumpling-and-tidb-lightning-full-data). +- [Migrate with pt/gh-host](/migrate-with-pt-ghost.md) +- [Migrate with Binlog Event Filter](/filter-binlog-event.md) +- [Migrate with Filter Binlog Events Using SQL Expressions](/filter-dml-event.md) +- [Migrate with More Columns in Downstream](/migrate-with-more-columns-downstream.md) diff --git a/migration-tools.md b/migration-tools.md new file mode 100644 index 0000000000000..629ea7d798230 --- /dev/null +++ b/migration-tools.md @@ -0,0 +1,105 @@ +--- +title: TiDB Ecosystem Tools Overview +summary: Learn an overview of the TiDB ecosystem tools. +--- + +# TiDB Ecosystem Tools Overview + +TiDB provides multiple data migration tools for different scenarios such as full data migration, incremental data migration, backup and restore, and data replication. + +This document introduces the user scenarios, advantages, and limitations of these tools. You can choose the right tool according to your needs. + + + +The following table introduces the user scenarios, the supported upstreams and downstreams of migration tools. + +| Tool name | User scenario | Upstream (or the imported source file) | Downstream (or the output file) | Advantages | Limitation | +|:---|:---|:---|:---|:---|:---| +| [TiDB Data Migration (DM)](https://docs.pingcap.com/tidb-data-migration/stable/overview)| Data migration from MySQL-compatible databases to TiDB | MySQL, MariaDB, Aurora, MySQL| TiDB | | Data import speed is roughly the same as that of TiDB Lighting's TiDB-backend, and much lower than that of TiDB Lighting's Local-backend. So it is recommended to use DM to migrate full data with a size of less than 1 TiB. | +| [Dumpling](/dumpling-overview.md) | Full data export from MySQL or TiDB | MySQL, TiDB| SQL, CSV | | | +| [TiDB Lightning](/tidb-lightning/tidb-lightning-overview.md)| Full data import into TiDB | | TiDB | | | +|[Backup & Restore (BR)](/br/backup-and-restore-tool.md) | Backup and restore for TiDB clusters with a huge data size | TiDB| SST, backup.meta files, backup.lock files| | | +| [TiCDC](/ticdc/ticdc-overview.md)| This tool is implemented by pulling TiKV change logs. It can restore data to a consistent state with any upstream TSO, and support other systems to subscribe to data changes.|TiDB | TiDB, MySQL, Apache Pulsar, Kafka, Confluent| Provide TiCDC Open Protocol | TiCDC only replicates tables that have at least one valid index. The following scenarios are not supported:| +|[TiDB Binlog](/tidb-binlog/tidb-binlog-overview.md) | Incremental replication between TiDB clusters, such as using one TiDB cluster as the secondary cluster of another TiDB cluster | TiDB | TiDB, MySQL, Kafka, incremental backup files | Support real-time backup and restore. Back up TiDB cluster data to be restored for disaster recovery | Incompatible with some TiDB versions | +|[sync-diff-inspector](/sync-diff-inspector/sync-diff-inspector-overview.md) | Comparing data stored in the databases with the MySQL protocol |TiDB, MySQL | TiDB, MySQL| Can be used to repair data in the scenario where a small amount of data is inconsistent | | + +## Install tools using TiUP + +Since TiDB v4.0, TiUP acts as a package manager that helps you manage different cluster components in the TiDB ecosystem. Now you can manage any cluster component using a single command. + +### Step 1. Install TiUP + +{{< copyable "shell-regular" >}} + +```shell +curl --proto '=https' --tlsv1.2 -sSf https://tiup-mirrors.pingcap.com/install.sh | sh +``` + +Redeclare the global environment variable: + +{{< copyable "shell-regular" >}} + +```shell +source ~/.bash_profile +``` + +### Step 2. Install components + +You can use the following command to see all the available components: + +{{< copyable "shell-regular" >}} + +```shell +tiup list +``` + +The command output lists all the available components: + +```bash +Available components: +Name Owner Description +---- ----- ----------- +bench pingcap Benchmark database with different workloads +br pingcap TiDB/TiKV cluster backup restore tool +cdc pingcap CDC is a change data capture tool for TiDB +client pingcap Client to connect playground +cluster pingcap Deploy a TiDB cluster for production +ctl pingcap TiDB controller suite +dm pingcap Data Migration Platform manager +dmctl pingcap dmctl component of Data Migration Platform +errdoc pingcap Document about TiDB errors +pd-recover pingcap PD Recover is a disaster recovery tool of PD, used to recover the PD cluster which cannot start or provide services normally +playground pingcap Bootstrap a local TiDB cluster for fun +tidb pingcap TiDB is an open source distributed HTAP database compatible with the MySQL protocol +tidb-lightning pingcap TiDB Lightning is a tool used for fast full import of large amounts of data into a TiDB cluster +tiup pingcap TiUP is a command-line component management tool that can help to download and install TiDB platform components to the local system +``` + +Choose the components to install: + +{{< copyable "shell-regular" >}} + +```shell +tiup install dumpling tidb-lightning +``` + +> **Note:** +> +> To install a component of a specific version, use the `tiup install [:version]` command. + +### Step 3. Update TiUP and its components (optional) + +It is recommended to see the release log and compatibility notes of the new version. + +{{< copyable "shell-regular" >}} + +```shell +tiup update --self && tiup update dm +``` + +## See also + +- [Deploy TiUP offline](/production-deployment-using-tiup.md#method-2-deploy-tiup-offline) +- [Download and install tools in binary](/download-ecosystem-tools.md) diff --git a/releases/release-5.0.0.md b/releases/release-5.0.0.md index dfe16e1c726b6..4720885fff4b4 100644 --- a/releases/release-5.0.0.md +++ b/releases/release-5.0.0.md @@ -381,7 +381,7 @@ TiDB data migration tools support using Amazon S3 (and other S3-compatible stora To use this feature, refer to the following documents: - [Export data to Amazon S3 cloud storage](/dumpling-overview.md#export-data-to-amazon-s3-cloud-storage), [#8](https://github.com/pingcap/dumpling/issues/8) -- [Migrate from Amazon Aurora MySQL Using TiDB Lightning](/migrate-from-aurora-using-lightning.md), [#266](https://github.com/pingcap/tidb-lightning/issues/266) +- [Migrate from Amazon Aurora MySQL Using TiDB Lightning](/migrate-aurora-to-tidb.md), [#266](https://github.com/pingcap/tidb-lightning/issues/266) ### Optimize the data import performance of TiDB Cloud diff --git a/tidb-lightning/tidb-lightning-faq.md b/tidb-lightning/tidb-lightning-faq.md index 972d4d5d96ef9..6f8391cfdf366 100644 --- a/tidb-lightning/tidb-lightning-faq.md +++ b/tidb-lightning/tidb-lightning-faq.md @@ -14,7 +14,7 @@ The version of TiDB Lightning should be the same as the cluster. If you use the Yes. -## What is the privilege requirements for the target database? +## What are the privilege requirements for the target database? TiDB Lightning requires the following privileges: diff --git a/tidb-lightning/tidb-lightning-overview.md b/tidb-lightning/tidb-lightning-overview.md index 94a0c0676c5d4..d6ee1ef01160a 100644 --- a/tidb-lightning/tidb-lightning-overview.md +++ b/tidb-lightning/tidb-lightning-overview.md @@ -15,7 +15,7 @@ Currently, TiDB Lightning can mainly be used in the following two scenarios: Currently, TiDB Lightning supports: -- The data source of the [Dumpling](/dumpling-overview.md), CSV or [Amazon Aurora Parquet](/migrate-from-aurora-using-lightning.md) exported formats. +- The data source of the [Dumpling](/dumpling-overview.md), CSV or [Amazon Aurora Parquet](/migrate-aurora-to-tidb.md) exported formats. - Reading data from a local disk or from the Amazon S3 storage. For details, see [External Storages](/br/backup-and-restore-storages.md). ## TiDB Lightning architecture