These scripts let you migrate metadata between a Hive metastore and the AWS Glue Data Catalog. You can migrate in three different directions:
- From Hive to AWS Glue
- From AWS Glue to Hive
- From one AWS Glue Data Catalog to another
There are two modes of migration:
- Direct migration
- Migration using Amazon S3 as an intermediary
-
Direct Migration: Set up an AWS Glue ETL job which extracts metadata from your Hive metastore (MySQL) and loads it into your AWS Glue Data Catalog. This method uses an AWS Glue connection to the Hive metastore as a JDBC source. An ETL script is provided to extract metadata from the Hive metastore and write it to AWS Glue Data Catalog.
-
Migration through Amazon S3: Extract your database, table, and partition objects from your Hive metastore into Amazon S3 objects. Then set up an AWS Glue ETL job using the script provided here to load the metadata from S3 into the AWS Glue Data Catalog.
- Direct Migration: A single job extracts metadata from specified databases in AWS Glue Data Catalog and loads it into a Hive metastore. This job is run on the AWS Glue console, and requires an AWS Glue connection to the Hive metastore as a JDBC source.
- Migration through Amazon S3: Two AWS Glue jobs are used. The first extracts metadata from specified databases in AWS Glue Data Catalog and loads them into S3. The second loads data from S3 into the Hive Metastore. The first job is run on AWS Glue Console. The second can be run either on the AWS Glue Console or on a cluster with Spark installed.
- Migration through Amazon S3: Two jobs are run on the AWS Glue console. The first extracts metadata from specified databases in an AWS Glue Data Catalog and loads them into S3. The second loads data from S3 into an AWS Glue Data Catalog. This is the only way to migrate between Data Catalogs in different accounts. It combines the workflow for AWS Glue to Hive through S3 with the workflow for Hive to AWS Glue through S3.
There are some limitations to what the migration scripts can do currently:
-
Only databases, tables and partitions can be migrated. Other entities such as column statistics, privileges, roles, functions, and transactions cannot be migrated.
-
The script only supports a Hive metastore stored in a MySQL-compatible JDBC source. Other Hive metastore database engines such as PostgreSQL are not supported.
-
There is no isolation guarantee, which means that if Hive is doing concurrent modifications to the metastore while the migration job is running, inconsistent data can be introduced in AWS Glue Data Catalog.
-
There is no streaming support. Hive metastore migration is done as a batch job.
-
If there is a naming conflict with existing objects in the target Data Catalog, then the existing data is overwritten by the new data.
-
Your account must have access to AWS Glue.
-
Your Hive metastore must reside in a MySQL database accessible to AWS Glue. Currently, AWS Glue is able to connect to the JDBC data sources in a VPC subnet, such as RDS, EMR local Hive metastore, or a self-managed database on EC2. If your Hive metastore is not directly accessible to AWS Glue, then you must use Amazon S3 as intermediate staging area for migration.
Below are instructions for using each of the migration workflows described above.
-
Set up AWS Glue as described in the Getting Started section of the developer guide. Make sure you've configured the security group to give access to the Hive metastore. If the Hive metastore is in a local database in the EMR master instance, just configure the EMR VPC subnet and EMR master security group, similar to the RDS configuration.
-
Gather your Hive metastore JDBC connection information. You can find the Hive metastore JDBC URL, username, and password in the file named
hive-site.xml
. In EMR, this file is located at:/etc/hive/conf/hive-site.xml
. For example:
<property>
<name>javax.jdo.option.ConnectionURL</name>
<value>jdbc:mysql://ip-10-0-12-34.ec2.internal:3306/hive?createDatabaseIfNotExist=true</value>
</property>
<property>
<name>javax.jdo.option.ConnectionUserName</name>
<value>hive</value>
</property>
<property>
<name>javax.jdo.option.ConnectionPassword</name>
<value>MySecretPassword</value>
</property>
-
In AWS Glue, create a connection to the Hive metastore as a JDBC data source. Follow the instructions in the developer guide to create a connection that references your Hive metastore. Use the connection JDBC URL, username, and password you gathered in a previous step. Specify the VPC, subnet, and security group associated with your Hive metastore. You can find these on the EMR console if the Hive metastore is on the EMR master node, or on the RDS console, if the metastore is an RDS instance.
-
(Optional) Test the connection. Choose "Test connection" on the AWS Glue connections page. If the connection test fails, your ETL job might not be able to connect to your Hive metastore data source. Fix the network configuration before moving on.
-
Upload the ETL job scripts to S3. Upload these 2 scripts to an S3 bucket:
hive_metastore_migration.py import_into_datacatalog.py
If you configured AWS Glue to access S3 from a VPC endpoint, you must upload the script to a bucket in the same region where your job runs.
-
Create a new job on the AWS Glue console to extract metadata from your Hive metastore and write it to AWS Glue Data Catalog. Specify the following job parameters. These parameters are defined in
import_into_datacatalog.py
.-
--mode
should be set to 'from-jdbc'`, which means the migration is from a JDBC source into an AWS Glue Data Catalog. -
--connection-name
should be set to the name of the AWS Glue connection you created to point to the Hive metastore. It is used to extract the Hive JDBC connection information using the native Spark library. -
--database-prefix
should be set to a string prefix that is applied to the database name created in AWS Glue Data Catalog. You can use it as a way to track the origin of the metadata, and avoid naming conflicts. The default is the empty string. -
--table-prefix
should be set to a string prefix that is applied to the table names created in AWS Glue Data Catalog. Again, you can use it as a way to track the origin of the metadata, and avoid naming conflicts. The default is the empty string.
In the Connections page, add the Hive metastore connection you created. Finally, review the job and create it.
-
-
Run the job on the AWS Glue console. When the job is finished, the metadata from the Hive metastore is visible in the Glue console. Check the databases and tables listed to verify that they were migrated correctly.
If your Hive metastore cannot connect to AWS Glue directly (for example, if it's on a private corporate network), you can use AWS Direct Connect to establish private connectivity to a AWS VPC, or use AWS Database Migration Service to migrate your Hive metastore to a database on AWS.
If the above solutions don't apply to your situation, you can choose to first migrate your Hive metastore to Amazon S3 as a staging area, then run an ETL job to import the metadata from S3 to the Data Catalog. To do this, you need to have a Spark 2.1.x cluster that can connect to your Hive metastore and export metadata to plain files on S3.
-
Make the MySQL connector jar available to the Spark cluster on master and all worker nodes, and include the jar in the Spark driver class path as well as with the
--jars
and--driver-class-path
parameters in thespark-submit
command. You can download the MySql connector here at MySql.com.If you use EMR to do this configuration, you can run the EMR bootstrap script
emr_bootstrap_action.sh
included in theshell
folder, and then provide--jars /usr/lib/hadoop/mysql-connector-java-5.1.42-bin.jar
and--driver-class-path <spark-default-driver-classpath>:/usr/lib/hadoop/mysql-connector-java-5.1.42-bin.jar
in the spark-submit script. See EMR documentation for how to run Bootstrap script on EMR. -
Submit the
hive_metastore_migration.py
Spark script to your Spark cluster using the following parameters:-
Set
--direction
tofrom_metastore
, or omit the argument sincefrom_metastore
is the default. -
Provide the JDBC connection information through these arguments:
--jdbc-url
,--jdbc-username
, and--jdbc-password
. -
The argument
--output-path
is required. It is either a local file system location or an S3 location. If the output path is a local directory, you can upload the data to an S3 location manually. If it is an S3 path, you need to make sure that the Spark cluster has EMRFS library on its class path. The script will export the metadata to a subdirectory of the output-path you provided. -
Example spark-submit command to migrate Hive metastore to S3, tested on EMR-4.7.1:
MYSQL_JAR_PATH=/usr/lib/hadoop/mysql-connector-java-5.1.42-bin.jar DRIVER_CLASSPATH=/home/hadoop/*:/etc/hadoop/conf:/etc/hive/conf:/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/share/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:$MYSQL_JAR_PATH spark-submit --driver-class-path $DRIVER_CLASSPATH \ --jars $MYSQL_JAR_PATH \ /home/hadoop/hive_metastore_migration.py \ --mode from-metastore \ --jdbc-url jdbc:mysql://metastore.foo.us-east-1.rds.amazonaws.com:3306 \ --jdbc-user hive \ --jdbc-password myJDBCPassword \ --database-prefix myHiveMetastore_ \ --table-prefix myHiveMetastore_ \ --output-path s3://mybucket/myfolder/
-
-
Create an AWS Glue ETL job similar to the one described in the section to connect using JDBC to your Hive metastore. In the job, set the following parameters.
--mode
should be set tofrom-S3
--database-input-path
should be set to the S3 path containing only databases.--table-input-path
should be set to the S3 path containing only tables.--partition-input-path
should be set to the S3 path containing only partitions.
Also, because there is no need to connect to any JDBC source, the job doesn't require any connections.
Finally, if you don't have access to Spark from an on-premises network where your Hive metastore resides, you may implement your own script to load data into S3 using any script and platform, as long as the intermediary data on S3 conforms to the pre-defined format and schema.
The migration script requires a separate S3 folder for each entity type: database, table, and partition. Each folder contains one to many files. Each line of a file is a single JSON object. No comma or any other separation character should appear at the end of the line. Each JSON object has a field
type
with a value ofdatabase
,table
, orpartition
, and a fielditem
that contains the metadata payload. The table entity also contains adatabase
field to represent the database it belongs to. The partition entity contains adatabase
and atable
field. The exact schema of the entities are defined in thehive_metastore_migration.py
script.Below is a prettified example of a table entity; in practice, it must all appear on a single line with the whitespace removed:
{ "type" : "table", "database" : "my_metastore_default", "item" : { "createTime" : "Apr 04, 2017 08:58:43 AM", "lastAccessTime" : "Jan 01, 1970 00:00:00 AM", "owner" : "hadoop", "retention" : 0, "name" : " my_metastore_tbl_mytable", "tableType" : "EXTERNAL_TABLE", "parameters" : { "EXTERNAL" : "TRUE", "transient_lastDdlTime" : "1491296323" }, "partitionKeys" : [ { "name" : "key1", "type" : "int" }, { "name" : "key2", "type" : "string" }, { "name" : "key3", "type" : "double" } ], "storageDescriptor" : { "inputFormat" : "org.apache.hadoop.mapred.TextInputFormat", "compressed" : false, "storedAsSubDirectories" : false, "location" : "S3://mydefaultgluetest/test3keys", "numberOfBuckets" : -1, "outputFormat" : "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat", "columns" :[ { "name" : "col1", "type" : "int" } ], "serdeInfo" : { "serializationLibrary" : "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe", "parameters" : { "serialization.format" : "1" } } } } }
-
Follow steps 1 through 4 in the instructions for Migrate Directly from Hive to AWS Glue.
-
Upload the the following job scripts to an S3 bucket:
hive_metastore_migration.py export_from_datacatalog.py
If you configured AWS Glue to access S3 from VPC endpoint, you must upload the script to a bucket in the same region where your job runs. If you configured AWS Glue to access the public internet through a NAT gateway or NAT instance, cross-region S3 access is allowed.
-
Create the AWS Glue Job that performs the direct migration from Glue to the JDBC Hive metastore. Specify the following job parameters as key-value pairs. Note that these parameters are pre-defined in the script.
--mode
should be set toto-jdbc
, which means the migration is directly to a jdbc Hive Metastore--connection-name
should be set to the name of the AWS Glue connection you created to point to the Hive metastore. It is the destination of the migration.--database-names
should be set to a semi-colon(;) separated list of database names to export from Data Catalog.
In the Connections page, add the Hive metastore connection you created:
-
Run the job on the Glue Console. When the job is finished, run Hive queries on a cluster connected to the metastore to verify that metadata was successfully migrated.
-
Follow step 1 in Migrate from Hive to AWS Glue Through Amazon S3.
-
Create an AWS Glue ETL job similar to the one described in the Direct Migration instructions above. Since the destination is now an S3 bucket instead of a Hive metastore, no connections are required. In the job, set the following parameters:
--mode
should be set toto-S3
, which means the migration is to S3.--database-names
should be set to a semi-colon(;) separated list of database names to export from Data Catalog.--output-path
should be set to the S3 destination path.
-
Submit the
hive_metastore_migration.py
Spark script to your Spark cluster.-
Set
--direction
toto_metastore
. -
Provide the JDBC connection information through the arguments:
--jdbc-url
,--jdbc-username
, and--jdbc-password
. -
The argument
--input-path
is required. This can be a local directory or an S3 path. The path should point to the directory containingdatabases
,partitions
,tables
folders. This path should be the same as in the--output-path
parameter of step 2, with date and time information appended at the end.For example, if
--output-path
was:s3://gluemigrationbucket/export_output/
The result of the AWS Glue job in step 2 would create
databases
,partitions
, andtables
folders in:s3://gluemigrationbucket/export_output/<year-month-day-hour-minute-seconds>
Then
--input-path
here should be:s3://gluemigrationbucket/export_output/<year-month-day-hour-minute-seconds>/
-