Skip to content
rxin edited this page Oct 16, 2012 · 30 revisions

Shark 0.2 is the first Shark release since the original 0.1 prototype release. The new version brings new features and performance improvements to Shark.

The major changes are documented below:

Hive Compatibility

  • We have upgraded Shark to work with Hive 0.9, which introduces numerous features over the original Hive 0.7.
  • Hive UDFs and UDAFs are fully supported now.
  • Shark 0.2 also supports distributing resource files (e.g. jars) to the slaves using Hive's ADD FILE command.

Simpler Deployment

  • We have significantly simplified the deployment process.
  • As documented on the Wiki page, you can download a binary distribution of Shark 0.2 and set it up and running locally in ~ 5 mins.
  • In addition to running on Mesos, Shark now supports Spark's standalone deploy mode that lets you quickly launch a cluster without installing an external cluster manager. The standalone mode only needs Java installed on each machine, and Spark deployed to it.

Hive Thrift Server

  • Ram Sriharsha from Yahoo contributed a patch for the Shark Thrift server.
  • Shark's Thrift server is compatible with Hive's Thrift server and can support multiple clients connecting to the same server to access the same list of cached tables.

Query Execution

  • Map side aggregation is now turned on by default, and if not enough reduction is observed, Shark will turn the map side aggregation off automatically. The user no longer needs to explitictly set hive.map.aggr.

Performance Improvements

  • We have rewritten Shark's join and group by code. For queries that have a large number of distinct keys, join and group by performance can increase by 2X.

Spark Compatibility

  • Shark 0.2 requires Spark 0.6 as it takes advantage of the new features and performance improvements from the new Spark release.

Credits

Shark 0.2 was the work of a large set of new contributors from Berkeley and outside.

  • Ram Sriharsha from Yahoo contributed a patch for the Shark Thrift server.
  • Harvey Feng contributed the Hive 0.9 upgrade and improved map join implementation.
  • Antonio Lupher contributed the map side aggregation tuning implementation.
  • Denny Britz contributed support for ADD FILE and UDF/UDAF dynamic class loading.
  • Patrick Wendell contributed the revamped documentation and extensive testing.
  • Paul Ruan helped with testing.