Apache Druid

Updated at 2018-09-10 23:43

Apache Druid is an analytics data store for event-driven data. In more technical terms, Druid is a real-time columnar timeseries database that scales effortlessly.

It has a web application component to visualize the data.

It is common to use Apache Kafka as a buffer and feed events form Kafka to Druid. Druid should be able to use any event-oriented or time series Kafka topics without much of a hassle.

Druid doesn't support joins so you must do that in pre-processing phase e.g. with Spark or Flink.

Common use-cases:

  • Network flows
  • User activity
  • Device metrics
  • Application performance
  • Digital marketing advertisement data
  • Business intelligence

Data is stored into segments. Segments are immutable, and you configure how they are created; hourly, daily, monthly.

Druid installation is quite heavy. You need a SQL database, Zookeeper, S3 and a bunch of servers. Kubernetes + Helm can help with the setup.