Compatibility with Apache Hive

Deploying in Existing Hive Warehouses

Shark is designed to be "out of the box" compatible with existing Hive installations. When you deploy Shark in a cluster, you do not need to modify your existing Hive Metastore or change the data placement or partitioning of your tables.

Supported Hive Features

Shark supports the vast majority of Hive features, such as:

Hive query statements, including:
SELECT
GROUP_BY
ORDER_BY
CLUSTER_BY
SORT_BY
All Hive operators, including:
Relational operators (=, ⇔, ==, <>, <, >, >=, <=, etc)
Arthimatic operators (+, -, *, /, %, etc)
Logical operators (AND, &&, OR, ||, etc)
Complex type constructors
Mathemtatical functions (sign, ln, cos, etc)
String functions (instr, length, prinf, etc)
User defined functions (UDF)
User defined aggregation functions (UDAF)
User defined serialization formats (SerDe's)
Joins
JOIN
{LEFT|RIGHT|FULL} OUTER JOIN
LEFT SEMI JOIN
CROSS JOIN
Unions
Sub queries
SELECT col FROM ( SELECT a + b AS col from t1) t2
Sampling
Explain
Partitioned tables
All Hive DDL Functions, including:
CREATE TABLE
CREATE TABLE AS SELECT
ALTER TABLE
All Hive Data types, including:
TINYINT
SMALLINT
INT
BIGINT
BOOLEAN
FLOAT
DOUBLE
STRING
BINARY
TIMESTAMP
ARRAY<>
MAP<>
STRUCT<>
UNIONTYPE<>

Unsupported Hive Functionality

Below is a list of Hive features that we don't support yet. The vast majority of these features are rarely used in Hive deployments.

Major Hive Features

Tables with buckets: bucket is the hash partitioning within a Hive table partition. Shark doesn't support buckets.

Esoteric Hive Features

Tables with partitions using different input formats: In Shark, all table partitions need to have the same input format.
Non-equi outer join: For the uncommon use case of using outer joins with non-equi join conditions (e.g. condition "key < 10"), Shark will output wrong result for the NULL tuple.
Unique join
Single query multi insert
Column statistics collecting: Shark does not piggyback scans to collect column statistics at the moment.

Hive Input/Output Formats

File format for CLI: For results showing back to the CLI, Shark only supports TextOutputFormat.
Hadoop archive

Hive Optimizations A handful of Hive optimizations are not yet included in Spark. Some of these (such as indexes) are not necessary due to Shark's in-memory computational model. Others are slotted for future releases of Spark.

Block level bitmap indexes and virtual columns (used to build indexes)
Automatically convert a join to map join: For joining a large table with multiple small tables, Hive automatically converts the join into a map join. We are adding this auto conversion in the next release.
Automatically determine the number of reducers for joins and groupbys: Currently in Shark, you need to control the degree of parallelism post-shuffle using "set mapred.reduce.tasks=[num_tasks];". We are going to add auto-setting of parallelism in the next release.
Meta-data only query: For queries that can be answered by using only meta data, Shark still launches tasks to compute the result.
Skew data flag: Shark does not follow the skew data flags in Hive.
STREAMTABLE hint in join: Shark does not follow the STREAMTABLE hint.
Merge multiple small files for query results: If the result output contains multiple small files, Hive can optionally merge the small files into fewer large files to avoid overflowing the HDFS metadata. Shark does not support that.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compatibility with Apache Hive

Deploying in Existing Hive Warehouses

Supported Hive Features

Unsupported Hive Functionality

Clone this wiki locally