Skip to content
Steven Harris edited this page May 6, 2013 · 62 revisions

Druid

Druid is an open source, real-time analytical data store that supports fast ad-hoc queries on large-scale data sets

What Druid cares about

  • Analytics – Druid is for exploratory analytics. It supports a variety of filters, aggregators and query types as well as pluggable analytics. Different users have developed things like top N, Histograms etc but the key is that they all scale linearly and return quickly by leveraging Druid’s core.
  • Interactivity – Druid users leverage it for exploratory querying of real-time ingested data.
  • Uptime – Druid is used to back SaaS implementations that need to be up all the time. No downtime for upgrades or scaling.
  • Predictable/Fast – Druid gives fast results every time
  • Scalability – Druid is handles billions of events and terabytes of data per day and growing
  • Efficiency – Druid uses as little storage and CPU as possible. Due to the volumes referenced above efficiency is key to being able to achieve the speeds and manage the volumes cost effectively.

Why Druid Exists

Buzzwords aside, it was originally created to resolve query latency issues seen with trying to use Hadoop to power an interactive service. Hadoop has shown the world that it’s possible to house your data warehouse on commodity hardware for a fraction of the price of typical solutions. As people adopt Hadoop for their data warehousing needs, they find two things.

  1. They can now query all of their data in a fairly flexible manner and answer any question they have
  2. The queries take a long time

The first one is the joy that everyone feels the first time they get Hadoop running. The latter is what they realize after they have used Hadoop interactively for a while because Hadoop is optimized for throughput, not latency. Druid is a system that you can set up in your organization next to Hadoop. It provides the ability to access your data in an interactive slice-and-dice fashion. It trades off some query flexibility and takes over the storage format in order to provide the speed.

Druid is especially useful if you are summarizing your data sets and then querying the summarizations. If you put your summarizations into Druid, you will get quick queryability out of a system that you can be confident will scale up as your data volumes increase. Deployments have scaled up to 2TB of data per hour at peak ingested and aggregated in real-time.

We have more details about the general design of the system and why you might want to use it in our Design doc.

Getting Started, Configuration, and Setup

The best place to get started, and for demonstration purposes, is the demo of a Realtime node running standalone. The query process, index building, and realtime ingestion features are exercised without the complication of multiple processes on multiple machines. See the Realtime Examples page for how to get started with this key part.

The most basic cluster setup of Druid requires the Compute, Broker and Master servers listed above, a MySQL metadata server, and a ZooKeeper cluster. A Realtime server can be added to enable real time queries. See Cluster Setup for more details.

For basic server configuration see the configuration page and for a discussion of how to query a server, see the Querying page.

Versioning and Releases

Libraries

Data Ingestion

Data ingestion can occur in either of two ways:

  1. Realtime
    Rapid aggregation, indexing, and querying of recent data. The indexed segments eventually become historical data.
  2. Batch Ingestion
    Using Hadoop batch jobs to aggregate and index historical data.

Contributing and learning more

If you would like to contribute to Druid or have any questions about usage or code, you can write to the mailing list(Google Group):

[email protected]
https://groups.google.com/d/forum/druid-development

We are also squatting in channel #druid-dev on irc.freenode.net.

On Twitter we are #druidio

If you are interested in contributing to the code, we accept pull requests. Note, that we have only just completed decoupling our Metamarkets-specific code from the code base and we took some short-cuts in interface design to make it happen. So, there are a number of interfaces that exist right now which are likely to be in flux. If you are embedding Druid in your system, it will be safest for the time being to only extend/implement interfaces that this wiki describes, as those are intended as stable (unless otherwise mentioned).

For issue tracking, we are using the github issue tracker. Please fill out an issue from the Issues tab on the github screen.

We also have a Libraries page that lists external libraries that people have created for working with Druid.

Be sure to look at the Concepts and Terminology page.

Composition of a Druid Cluster

The durable/persisted data kept by Druid is in a form termed Segments which can be on local disk and/or in a key-value store like S3 or HDFS.

A druid cluster is composed of various node types or services:

  1. Compute
    The base node, processes queries at the segment level, supports replication
  2. Broker
    The query broker, understands the data topology and routes queries to the correct compute nodes
  3. Master
    The coordinator node, makes sure that all segments are being served, replicated and balanced. General custodian for the cluster
  4. Realtime
    The realtime node is the data pathway for rapid indexing and querying of new data in realtime

The following diagram shows the data flow for queries without showing batch indexing:

Simple Data Flow

Clone this wiki locally