Given that it is impossible to process data with a modern processor without first loading the data into memory, yes, it is in-memory. The earliest iterations of Druid didn’t allow for data to be paged in from and out to disk, so we often called it an “in-memory” system. However, we very quickly realized that RAM hasn’t become cheap enough to actually store all data in RAM and sell a product at a price-point that customers are willing to pay. Since the Summer of 2011, we have leveraged memory-mapping to allow us to page data between disk and memory and extend the amount of data a single node can load up to the size of its disks.
That said, as we made the shift, we didn’t want to compromise on the ability to configure the system to run such that everything is essentially in memory. To this end, individual Historical nodes can be configured with the maximum amount of data they should be given. Couple that with the Coordinator’s ability to assign data blocks to different “tiers” based on differing query requirements and Druid essentially becomes a system that can be configured across the whole spectrum of performance requirements. Configuration can be such that all data can be in memory and processed, it can be such that data is heavily over-committed compared to the amount of memory available, and it can also be that the most recent month of data is in memory, while everything else is over-committed.
Druid has three external dependencies that must be running in order for the Druid cluster to operate
Simply put, deep storage is some storage infrastructure that Druid depends on for data availability. If this infrastructure goes down, then Druid cannot load new data into the system, nor can it recover from failures in the cluster. As long as this infrastructure is operational, Druid Historical nodes cannot lose data.
A more technical description of Deep Storage and the options available exists in our docs.
Druid requires some structure to the data it ingests. In general data should consist of a timestamp, dimensions and metrics. These are discussed in a bit more detail in our original blog post: Introducing Druid: Real-Time Analytics at a Billion Rows Per Second
No, although timestamped data will trigger certain optimizations and queries will generally perform better.
No, Zookeeper can be deployed to withstand a configurable number of individual node failures. Also, if ZooKeeper goes down, the cluster will continue to operate.
Losing ZooKeeper does, however, mean that the cluster cannot add new data segments, nor can it effectively react to the lose of one of the nodes. So, while it is safe to lose access to ZooKeeper, it is definitely a degraded state.
No, the “Coordinator” process is merely used for cluster coordination. It is not involved in the query path. Losing all coordinators means that new segments will not be loaded by historical nodes, i.e. no new data will enter the cluster. Also, segment balancing decisions will not be made, so it stops the cluster from being able to effectively scale up and down. Just like ZooKeeper, losing the coordinators puts the cluster in a degraded state, but it will keep operating just fine.
Also, there can be multiple coordinator nodes deployed. Extra coordinators will act as “hot spares” in case the active coordinator is lost.
Queries never touch the coordinator. Ever.
Yes, the coordinator has no impact on queries.
Updating data means that you have to re-generate the segments for the time period that you need to update. This can be done by reindexing the data for that time period. Once the indexer finishes, the Druid cluster will swap the new segment in and stop serving the old segment.
The Indexing Service is a job scheduling service for various Druid ingestion tasks.
More information can be found in the docs: