Druid Blog

Druid 0.10.1 release

Druid PMC · August 22, 2017

We're excited to announce the general availability of our latest release, Druid 0.10.1!

Read More

Druid 0.10.0 release

Druid PMC · April 18, 2017

We're excited to announce the general availability of our latest release, Druid 0.10.0!

Read More

Druid 0.9.2 release

Druid PMC · December 1, 2016

We're excited to announce the general availability of our latest release, Druid 0.9.2!

Read More

Druid 0.9.1.1 release

Druid PMC · June 28, 2016

We're excited to announce the general availability of our latest release, Druid 0.9.1.1!

Read More

Announcing New Committers

Druid PMC · January 6, 2016

Happy New Year everyone! We’re excited to announce that we’ve added 8 new committers to Druid. These committers have been making sustained contributions to the project, and we look forward to working with them to continue to develop the project in 2016.

Read More

Seeking New Committers

Druid PMC · November 3, 2015

We are excited to announce that we have formalized the governance of Druid to be a community led project! Druid has been informally community led for some time, with committers from various organizations regularly adding new features, improving performance, and making things easier to use. Project committers vote on proposals, review/write pull requests, provide community support, and help guide the technical direction of the project. You can find more information on the project’s goals and governance on our recently updated Druid webpage. Druid depends upon its vibrant community of users for their feedback with respect to features, documentation and very helpful bug reports.

Read More

Towards a Community Led Druid

Fangjin Yang, Xavier Léauté, and Eric Tschetter · February 20, 2015

We are very happy to announce that Druid has changed its license to Apache 2.0. We believe this is a change the community will welcome. As engineers, we love to see the things we make get used and attempt to provide value to the broader open source world that we have benefitted from for so long. By switching to the Apache license, we believe this change will better promote the growth of the Druid community. We hope to send a clear message that we are all equal participants in the Druid community, a sentiment that is very important to us.

Read More

Five Tips for a F’ing Great Logo

David Hertog & Fangjin Yang · July 23, 2014

Everyone wants a great logo, but it’s notoriously difficult work—prone to miscommunications, heated debates and countless revisions. Still, after three years we couldn’t put it off any longer. Druid needed a visual identity, so we partnered with the talented folks at Focus Lab for help.

Read More

Open Source Leaders Sound Off on the Rise of the Real-Time Data Stack

Fangjin Yang & Gian Merlino · May 7, 2014

In February we were honored to speak at the O’Reilly Strata conference about building a robust, flexible, and completely open source data analytics stack. If you couldn’t make it, you can watch the video here. Preparing for our talk got us thinking about all the brilliant folks working on similar problems, so we organized a panel that same night to continue the conversation.

Read More

Introduction to pydruid

Igal Levy · April 15, 2014

We've already written about pairing R with RDruid, but Python has powerful and free open-source analysis tools too. Collectively, these are often referred to as the SciPy Stack. To pair SciPy's analytic power with the advantages of querying time-series data in Druid, we created the pydruid connector. This allows Python users to query Druid—and export the results to useful formats—in a way that makes sense to them.

Read More

Benchmarking Druid

Xavier Léauté · March 17, 2014

We often get asked how fast Druid is. Despite having published some benchmark numbers in previous blog posts, as well as in our talks, until now, we have not actually published any data to back those claims up in in a reproducible way. This post intends to address this and make it easier for anyone to evaluate Druid and compare it to other systems out there.

Read More

Batch-Loading Sensor Data into Druid

Igal Levy · March 12, 2014

Sensors are everywhere these days, and that means sensor data is big data. Ingesting and analyzing sensor data at speed is an interesting problem, especially when scale is desired. In this post, we'll access some real-world sensor data, and show how Druid can be used to store that data and make it available for immediate querying.

Read More

How We Scaled HyperLogLog: Three Real-World Optimizations

NELSON RAY AND FANGJIN YANG · February 18, 2014

At Metamarkets, we specialize in converting mountains of programmatic ad data into real-time, explorable views. Because these datasets are so large and complex, we’re always looking for ways to maximize the speed and efficiency of how we deliver them to our clients.  In this post, we’re going to continue our discussion of some of the techniques we use to calculate critical metrics such as unique users and device IDs with maximum performance and accuracy.

Read More

RDruid and Twitterstream

Igal Levy · February 3, 2014

What if you could combine a statistical analysis language with the power of an analytics database for instant insights into realtime data? You'd be able to draw conclusions from analyzing data streams at the speed of now. That's what combining the prowess of a Druid database with the power of R can do.

Read More

Querying Your Data

Russell Jurney · November 4, 2013

Before we start querying druid, we're going to finish setting up a complete cluster on localhost. In our previous posts, we setup a Realtime node. In this tutorial we will also setup the other Druid node types: Compute, Master and Broker.

Read More

Druid at XLDB

Russell Jurney · September 20, 2013

We recently attended Stanford XLDB and the experience was a blast. Once a year, XLDB invites speakers from different organizations to discuss the challenges of and solutions to dealing with Xtreme (with an X!) data sets. This year, Jeff Dean dropped knowledge bombs about architecting scalable systems, Michael Stonebraker provided inspiring advice about growing open source projects, CERN explained how they found the Higgs Boson, and several organizations spoke about their technology. We definitely recommend checking out the slides from the conference.

Read More

Launching Druid With Apache Whirr

Russell Jurney · September 19, 2013

Without Whirr, to launch a Druid cluster, you'd have to provision machines yourself, and then install each node type manually. This process is outlined here. With Whirr, you can boot a druid cluster by editing a simple configuration file and then issuing a single command!

Read More

Upcoming Events

September 16, 2013

Druid hits the road this fall, with presentations in the Bay Area, North Carolina and New York!

Read More

The Art of Approximating Distributions: Histograms and Quantiles at Scale

Nelson Ray · September 12, 2013

I’d like to acknowledge Xavier Léauté for his extensive contributions (in particular, for suggesting several algorithmic improvements and work on implementation), helpful comments, and fruitful discussions. Featured image courtesy of CERN.

Read More

Understanding Druid Real-time Ingestion

Russell Jurney · August 30, 2013

In our last post, we got a realtime node working with example Twitter data. Now it's time to load our own data to see how Druid performs. Druid can ingest data in three ways: via Kafka and a realtime node, via the indexing service, and via the Hadoop batch loader. Data is ingested in realtime using a Firehose. In this post we'll outline how to ingest data from Kafka in realtime using the Kafka Firehose.

Read More

Understanding Druid Via Twitter Data

Russell Jurney · August 6, 2013

Druid is a rockin' exploratory analytical data store capable of offering interactive query of big data in realtime - as data is ingested. Druid drives 10's of billions of events per day for the Metamarkets platform, and Metamarkets is committed to building Druid in open source.

Read More

Real Real-Time. For Realz.

Eric Tschetter · May 10, 2013

Danny Yuan, Cloud System Architect at Netflix, and I recently co-presented at the Strata Conference in Santa Clara. The presentation discussed how Netflix engineers leverage Druid, Metamarkets’ open-source, distributed, real-time, analytical data store, to ingest 150,000 events per second (billions per day), equating to about 500MB/s of data at peak (terabytes per hour) while still maintaining real-time, exploratory querying capabilities. Before and after the presentation, we had some interesting chats with conference attendees. One common theme from those discussions was curiosity around the definition of “real-time” in the real world and how Netflix could possibly achieve it at those volumes. This post is a summary of the learnings from those conversations and a response to some of those questions.

Read More

Meet the Druid and Find Out Why We Set Him Free

Steve Harris · April 26, 2013

Before jumping straight into why Metamarkets open sourced Druid, I thought I would give a brief dive into what Druid is and how it came about. For more details, check out the Druid white paper.

Read More

Druid, R, Pizza and massively large data sets (Video)

Xavier Léauté · April 3, 2013

On April 3rd 2013 we held our first Meetup hosted by Metamarkets. The description and video follows.

Read More

15 Minutes to Live Druid

Jaypal Sethi · April 3, 2013

Big Data reflects today’s world where data generating events are measured in the billions and business decisions based on insight derived from this data is measured in seconds. There are few tools that provide deep insight into both live and stationary data as business events are occurring; Druid was designed specifically to serve this purpose.

Read More

Druid: Interactive Queries Meet Real-time Data (Video)

Eric Tschetter · February 28, 2013

Eric Tschetter (lead architect of Druid) and Danny Yuan (Netflix Platform Engineering Team) co-presented at the 2013 Strata conference in Santa Clara, CA.

Read More

Introducing Druid

Eric Tschetter · October 24, 2012

In April 2011, we introduced Druid, our distributed, real-time data store. Today I am extremely proud to announce that we are releasing the Druid data store to the community as an open source project. To mark this special occasion, I wanted to recap why we built Druid, and why we believe there is broader utility for Druid beyond Metamarkets' analytical SaaS offering.

Read More

Beyond Hadoop: Fast Ad-Hoc Queries on Big Data (Video)

Eric Tschetter · October 24, 2012

Eric Tschetter (lead architect of Druid)

Read More

Maximum Performance with Minimum Storage: Data Compression in Druid

Fangjin Yang · September 21, 2012

The Metamarkets solution allows for arbitrary exploration of massive data sets. Powered by Druid, our in-house distributed data store and processor, users can filter time series and top list queries based on Boolean expressions of dimension values. Given that some of our dataset dimensions contain millions of unique values, the subset of things that may match a particular filter expression may be quite large. To design for these challenges, we needed a fast and accurate (not a fast and approximate) solution, and we once again found ourselves buried under a stack of papers, looking for an answer.

Read More

Fast, Cheap, and 98% Right: Cardinality Estimation for Big Data

Fangjin Yang · May 4, 2012

The nascent era of big data brings new challenges, which in turn require new tools and algorithms. At Metamarkets, one such challenge focuses on cardinality estimation: efficiently determining the number of distinct elements within a dimension of a large-scale data set. Cardinality estimations have a wide range of applications from monitoring network traffic to data mining. If leveraged correctly, these algorithms can also be used to provide insights into user engagement and growth, via metrics such as “daily active users.”

Read More

Scaling the Druid Data Store

Eric Tschetter · January 19, 2012

“Give me a lever long enough… and I shall move the world” — Archimedes

Read More

Druid, Part Deux: Three Principles for Fast, Distributed OLAP

Eric Tschetter · May 20, 2011

In a previous blog post we introduced the distributed indexing and query processing infrastructure we call Druid. In that post, we characterized the performance and scaling challenges that motivated us to build this system in the first place. Here, we discuss three design principles underpinning its architecture.

Read More

Introducing Druid: Real-Time Analytics at a Billion Rows Per Second

Eric Tschetter · April 30, 2011

Here at Metamarkets we have developed a web-based analytics console that supports drill-downs and roll-ups of high dimensional data sets – comprising billions of events – in real-time. This is the first of two blog posts introducing Druid, the data store that powers our console. Over the last twelve months, we tried and failed to achieve scale and speed with relational databases (Greenplum, InfoBright, MySQL) and NoSQL offerings (HBase). So instead we did something crazy: we rolled our own database. Druid is the distributed, in-memory OLAP data store that resulted.

Read More