Looking for the latest stable documentation?
This tutorial shows you how to load your own data files into Druid.
For this tutorial, we'll assume you've already downloaded Druid as described in the single-machine quickstart and have it running on your local machine. You don't need to have loaded any data yet.
Once that's complete, you can load your own dataset by writing a custom ingestion spec.
When loading files into Druid, you will use Druid's batch loading process.
There's an example batch ingestion spec in quickstart/wikiticker-index.json
that you can modify
for your own needs.
The most important questions are:
If your data does not have a natural sense of time, you can tag each row with the current time. You can also tag all rows with a fixed timestamp, like "2000-01-01T00:00:00.000Z".
Let's use this pageviews dataset as an example. Druid supports TSV, CSV, and JSON out of the box. Note that nested JSON objects are not supported, so if you do use JSON, you should provide a file containing flattened objects.
{"time": "2015-09-01T00:00:00Z", "url": "/foo/bar", "user": "alice", "latencyMs": 32}
{"time": "2015-09-01T01:00:00Z", "url": "/", "user": "bob", "latencyMs": 11}
{"time": "2015-09-01T01:30:00Z", "url": "/foo/bar", "user": "bob", "latencyMs": 45}
Make sure the file has no newline at the end. If you save this to a file called "pageviews.json", then for this dataset:
You can copy the existing quickstart/wikiticker-index.json
indexing task to a new file:
cp quickstart/wikiticker-index.json my-index-task.json
And modify it by altering these sections:
"dataSource": "pageviews"
"inputSpec": {
"type": "static",
"paths": "pageviews.json"
}
"timestampSpec": {
"format": "auto",
"column": "time"
}
"dimensionsSpec": {
"dimensions": ["url", "user"]
}
"metricsSpec": [
{"name": "views", "type": "count"},
{"name": "latencyMs", "type": "doubleSum", "fieldName": "latencyMs"}
]
"granularitySpec": {
"type": "uniform",
"segmentGranularity": "day",
"queryGranularity": "none",
"intervals": ["2015-09-01/2015-09-02"]
}
To actually run this task, first make sure that the indexing task can read pageviews.json:
paths
in the ingestion spec.To kick off the indexing process, POST your indexing task to the Druid Overlord. In a standard Druid
install, the URL is http://OVERLORD_IP:8090/druid/indexer/v1/task
.
curl -X 'POST' -H 'Content-Type:application/json' -d @my-index-task.json OVERLORD_IP:8090/druid/indexer/v1/task
If you're running everything on a single machine, you can use localhost:
curl -X 'POST' -H 'Content-Type:application/json' -d @my-index-task.json localhost:8090/druid/indexer/v1/task
If anything goes wrong with this task (e.g. it finishes with status FAILED), you can troubleshoot by visiting the "Task log" on the overlord console.
Your data should become fully available within a minute or two. You can monitor this process on your Coordinator console at http://localhost:8081/#/.
Once your data is fully available, you can query it using any of the supported query methods.
For more information on loading batch data, please see the batch ingestion documentation.