Skip to content
fjy edited this page May 30, 2013 · 62 revisions

Druid

Druid is an open-source analytics database designed for fast ad-hoc queries on large-scale data sets. Druid supports streaming data ingestion and complex data exploration.

Key Features

  • Designed for Analytics – Druid is built for exploratory analytics for OLAP workflows. It supports a variety of filters, aggregators and query types and provides a framework for plugging in new functionality. Users have leveraged Druid’s infrastructure to develop features such as top K queries and histograms.
  • Interactive Queries – Druid’s low latency data ingestion architecture allows events to be queried milliseconds after they are created. Druid’s query latency is optimized by only reading and scanning exactly what is needed. Aggregate and filter on data without sitting around waiting for results.
  • Highly Available – Druid is used to back SaaS implementations that need to be up all the time. Your data is still available and queryable during system updates. Scale up or down without data loss.
  • Scalable – Existing Druid deployments handle billions of events and terabytes of data per day. Druid is designed to be petabyte scale.

Why Druid?

Druid was originally created to resolve query latency issues seen with trying to use Hadoop to power an interactive service. Hadoop has shown the world that it’s possible to house your data warehouse on commodity hardware for a fraction of the price of typical solutions. As people adopt Hadoop for their data warehousing needs, they find two things.

  1. They can now query all of their data in a fairly flexible manner and answer any question they have
  2. The queries take a long time

The first one is the joy that everyone feels the first time they get Hadoop running. The latter is what they realize after they have used Hadoop interactively for a while because Hadoop is optimized for throughput, not latency. Druid is a system that you can set up in your organization next to Hadoop. It provides the ability to access your data in an interactive slice-and-dice fashion. It trades off some query flexibility and takes over the storage format in order to provide the speed.

Druid is especially useful if you are summarizing your data sets and then querying the summarizations. If you put your summarizations into Druid, you will get quick queryability out of a system that you can be confident will scale up as your data volumes increase. Deployments have scaled up to 2TB of data per hour at peak ingested and aggregated in real-time.

We have more details about the general design of the system and why you might want to use it in our White Paper or in our Design doc.

Getting Started, Configuration, and Setup

The best place to get started, and for demonstration purposes, is the demo of a Realtime node running standalone. The query process, index building, and realtime ingestion features are exercised without the complication of multiple processes on multiple machines. See the Realtime Examples page for how to get started with this key part. If you want to learn more about what is going on in the examples and how to query druid, check out our Getting Started Tutorial.

The most basic cluster setup of Druid requires the Compute, Broker and Master servers listed above, a MySQL metadata server, and a ZooKeeper cluster. A Realtime server can be added to enable real time queries. See Cluster Setup for more details.

For basic server configuration see the configuration page and for a discussion of how to query a server, see the Querying page.

Versioning and Releases

Libraries

Data Ingestion

Data ingestion can occur in either of two ways:

  1. Realtime
    Rapid aggregation, indexing, and querying of recent data. The indexed segments eventually become historical data.
  2. Batch Ingestion
    Using Hadoop batch jobs to aggregate and index historical data.

Contributing and learning more

If you would like to contribute to Druid or have any questions about usage or code, you can write to the mailing list(Google Group):

[email protected]
https://groups.google.com/d/forum/druid-development

We are also squatting in channel #druid-dev on irc.freenode.net.

On Twitter we are #druidio

If you are interested in contributing to the code, we accept pull requests. Note: we have only just completed decoupling our Metamarkets-specific code from the code base and we took some short-cuts in interface design to make it happen. So, there are a number of interfaces that exist right now which are likely to be in flux. If you are embedding Druid in your system, it will be safest for the time being to only extend/implement interfaces that this wiki describes, as those are intended as stable (unless otherwise mentioned).

For issue tracking, we are using the github issue tracker. Please fill out an issue from the Issues tab on the github screen.

We also have a Libraries page that lists external libraries that people have created for working with Druid.

Be sure to look at the Concepts and Terminology page.

Composition of a Druid Cluster

The durable/persisted data kept by Druid is in a form termed Segments which can be on local disk and/or in a key-value store like S3 or HDFS.

A druid cluster is composed of various node types or services:

  1. Compute
    The base node, processes queries at the segment level, supports replication
  2. Broker
    The query broker, understands the data topology and routes queries to the correct compute nodes
  3. Master
    The coordinator node, makes sure that all segments are being served, replicated and balanced. General custodian for the cluster
  4. Realtime
    The realtime node is the data pathway for rapid indexing and querying of new data in realtime

The following diagram shows the data flow for queries without showing batch indexing:

Simple Data Flow

Clone this wiki locally