Druid can connect to any streaming data source through Tranquility, a package for pushing streams to Druid in real-time. Druid does not come bundled with Tranquility, and you will have to download the distribution.
Note that with all streaming ingestion options, you must ensure that incoming data is recent enough (within a configurable windowPeriod of the current time). Older messages will not be processed in real-time. Historical data is best processed with batch ingestion.
Druid can use Tranquility Server, which lets you send data to Druid without developing a JVM app. You can run Tranquility server colocated with Druid middleManagers and historical processes.
Tranquility server is started by issuing:
bin/tranquility server -configFile <path_to_config_file>/server.json
To customize Tranquility Server:
server.json, customize the
Tranquility can also be embedded in JVM-based applications as a library. You can do this directly in your own program using the Core API, or you can use the connectors bundled in Tranquility for popular JVM-based stream processors such as Storm, Samza, Spark Streaming, and Flink.
Tranquility Kafka lets you load data from Kafka into Druid without writing any code. You only need a configuration file.
Tranquility server is started by issuing:
bin/tranquility kafka -configFile <path_to_config_file>/kafka.json
To customize Tranquility Kafka in the single-machine quickstart configuration:
kafka.json, customize the
For tips on customizing
kafka.json, see the
Tranquility Kafka documentation.
Tranquility automates creation of Druid realtime indexing tasks, handling partitioning, replication, service discovery, and schema rollover for you, seamlessly and without downtime. You never have to write code to deal with individual tasks directly. But, it can be helpful to understand how Tranquility creates tasks.
Tranquility spawns relatively short-lived tasks periodically, and each one handles a small number of Druid segments. Tranquility coordinates all task creation through ZooKeeper. You can start up as many Tranquility instances as you like with the same configuration, even on different machines, and they will send to the same set of tasks.
See the Tranquility overview for more details about how Tranquility manages tasks.
The segmentGranularity is the time period covered by the segments produced by each task. For example, a segmentGranularity of "hour" will spawn tasks that create segments covering one hour each.
The windowPeriod is the slack time permitted for events. For example, a windowPeriod of ten minutes (the default) means that any events with a timestamp older than ten minutes in the past, or more than ten minutes in the future, will be dropped.
These are important configurations because they influence how long tasks will be alive for, and how long data stays in the realtime system before being handed off to the historical nodes. For example, if your configuration has segmentGranularity "hour" and windowPeriod ten minutes, tasks will stay around listening for events for an hour and ten minutes. For this reason, to prevent excessive buildup of tasks, it is recommended that your windowPeriod be less than your segmentGranularity.
Druid streaming ingestion is append-only, meaning you cannot use streaming ingestion to update or delete individual records after they are inserted. If you need to update or delete individual records, you need to use a batch reindexing process. See the batch ingest page for more details.
Druid does support efficient deletion of entire time ranges without resorting to batch reindexing. This can be done automatically through setting up retention policies.
Tranquility operates under a best-effort design. It tries reasonably hard to preserve your data, by allowing you to set up replicas and by retrying failed pushes for a period of time, but it does not guarantee that your events will be processed exactly once. In some conditions, it can drop or duplicate events:
Under normal operation, these risks are minimal. But if you need absolute 100% fidelity for historical data, we recommend a hybrid/batch streaming architecture, described below.
You can combine batch and streaming methods in a hybrid batch/streaming architecture. In a hybrid architecture, you use a streaming method to do initial ingestion, and then periodically re-ingest older data in batch mode (typically every few hours, or nightly). When Druid re-ingests data for a time range, the new data automatically replaces the data from the earlier ingestion.
All streaming ingestion methods currently supported by Druid do introduce the possibility of dropped or duplicated messages in certain failure scenarios, and batch re-ingestion eliminates this potential source of error for historical data.
Batch re-ingestion also gives you the option to re-ingest your data if you needed to revise it for any reason.
Stream ingestion may generate a large number of small segments because it's difficult to optimize the segment size at ingestion time. The number of segments will increase over time, and this might cuase the query performance issue.
Details on how to optimize the segment size can be found on Segment size optimization.
Tranquility documentation be found here.
Tranquility configuration can be found here.
Tranquility's tuningConfig can be found here.