Druid Summit is the annual, 2-day virtual summit for developers, architects and data professionals who are interested in learning or expanding their knowledge of Apache Druid.
- 30+ breakout sessions from leading organizations like Reddit, Atlassian, Confluent, Netflix, Pinterest, Zillow Group and more!
- Hear real-world case studies of Druid in production by leading practitioners
- Gain actionable insights from deep-dive technical talks on Apache Druid
- Discover the latest development methods, architectural patterns and operational best practices from peers
How do developers at Netflix, Confluent, and Reddit do more with analytics? They joined a community of like-minded people who bring analytics to applications to power a new generation of operational workflows. We’ll dive into the use cases and applications built with Apache Druid.
In this Q&A, Matt Armstrong, Head of Engineering - Observability & Data Platform at Confluent will discuss the use cases and architecture for Confluent's observability platform. Confluent leverages both Druid and Kafka to ingest millions of events per sec and deliver real-time operational visibility for engineering, product teams, and customers.
Real-time analytics has gone from cutting-edge to competitive advantage and is now becoming table stakes. In 8 years and 3 companies of working in the space, Gwen talked to hundreds of companies building real-time data products. In this talk, she'll share the coolest examples of real-time data in B2B SaaS applications, the architecture patterns for implementing them, and the insights she learned from the teams that built them.
Advertising backend systems need low latency for the efficient delivery and availability of budget data to the ad serving infrastructure. For the reddit team, we helped automate user behavior analysis to serve ads in uner 30 milliseconds.
Pinterest has been using Druid with large ingestion scales and low query latency for a few years. In this talk, we will share recent optimizations we made on querying skewed data, how we deal with large scale daily ingestions, tips on setting up dark read, and an idea of cost saving with AWS.
Druid is a powerful tool that has paved the way for a whole new wave of data-focused tech. In the Confluence Analytics team, we ingest millions of events every day and use this tool to power a plethora of features that make Confluence smarter, empowering users with their own data. All with sub-second query latency. In this talk, we’ll dive into how Druid serves as the backbone of our tech stack and how Druid’s robust set of features such as lookups have allowed us to perform complex ingestion tasks. We will also be covering Atlassian's plans to incorporate Druid's latest developments in SQL based ingestion to optimize our backfill workflow.
In this session I would like to talk about the huge amount of data we ingested into Druid (raw data is 9 Terra per day) by using EMR, all orchestrated by Airflow. While the data grew we started experiencing many problems. After trying many scaling options, we decided to change the approach and came up with KUDRAS, the Kubernetes Druid Autoscaler.
This project is written in Python and is being used in our Apache Druid production environment. KUDRAS is a service developed using fastAPI which scales middlemanager nodes up and down in the most effective way, minimizing ingestion task costs to the bare minimum while maximizing ingestion speed.
Apache Druid 24.0 has some amazing new capabilities. The Multi-stage Query Architecture (MSQA) that is now a part of the database will enable significant new functionality. In this initial release it enables batch ingestion using SQL. In this session we go into the details of using SQL to define and run batch ingestions. It's simpler and faster than traditional Druid batch ingestion. Join us to learn how to use it.
Come learn about the way that TrueCar uses Druid as the engine to optimize the buying experience through clickstream analytics. Anil will introduce us to the role that Druid played in modernizing TrueCar's data pipeline and migrating from batch to streaming analytics. This has enabled them to unlock faster decision-making through real-time dashboards and anomaly detection across digital interaction data.
At Netflix, our engineers often need to see metrics as distributions to get a full picture of how our users are experiencing playback. For example the “play delay”, or the time taken from hitting play to seeing the video start. Measuring this as an average would lose a lot of detail. We have tried various data types to store these distributions in a way that we can query across a massive dataset and get results in low seconds to remain interactive.
T-Digest and datasketches don't keep up with our needs, so we came up with our own storage format that can handle merging 100s of Billions of rows in sub-second query times.
In this session we’ll introduce Spectator Histogram, a storage format that allows good approximations of distributions, can scale to Trillions of rows while maintaining good query performance and allows for Percentile queries through use with Netflix OSS Atlas and the Druid-Bridge module.
Machine learning and AI have progressed out of the initial technology development phase and into full production. In this new domain, the concept of monitoring has been redefined. Monitoring now fully encompasses the traditional APM and infrastructure monitoring domains of the past, but also extends it to include measurement of the performance of the models themselves.
This new extension to the domain of monitoring brings with it the need to develop entirely new tools and approaches to capture, calculate, and present this data to users in a manner that makes sense. This requires extensive research and development, including the selection of the data storage technology that would power a system designed to meet these requirements.
In this talk I will dive into the technical requirements of a production model monitoring system, the architecture selected, and the system that was implemented to bring a production ML model monitoring product to market. In doing so, I will dive deeply into our choice of Apache Druid as the core data storage technology, and how we have leveraged and extended the platform Druid provides to build this product. This will include an in depth discussion of running Druid inside Kubernetes, custom data aggregation, and running Druid in production.
Machine learning models require thousands of data points to train models. This session will look at a number of mechanisms druid offers to speed up model training. The tuple sketch can be used to train models where many different metrics are required for training. The sessionization extension can be used to speed up training for use cases where the need is to forecast viewership or session experience. Druid's ability to output metrics at different time granularities can be used to smoothen out time forecasts.
ironSource exposed a Druid cluster for its external users, called Real Time Pivot, which runs at a scale of 2-3M events/sec, with tens of TBs added per day, while serving parallel queries within 1-2 seconds. We managed not only to achieve the goals of high throughput without getting backlogs and serving parallel queries with low latency, but we did all that at minimal hardware cost. In our presentation we’ll elaborate on how we achieved these three goals.
Apache Druid is often used to power applications that are fun, fluid, and fast. It’s designed for the last mile, sitting behind a UI where customers, suppliers, and employees are waiting for fast responses to their analytic questions. In this presentation, we’ll review the best modeling and ingestion practices in order to optimize application performance.
In this talk, we'll learn more about how Lyft builds data pipelines using Apache Druid, which is useful for several use cases including metrics tracking, model forecasting, and internal tools. We'll also talk about the challenges we faced while setting up our real-time ingestion pipeline into Druid using Apache Flink and Kafka, and how we went about solving them.
Statsig enables companies to run experiments so that they can understand the impact of the changes they are making to their product, and do in depth analysis so that they can understand how people use their product. Visualizing and analyzing data in real-time is an important functionality that we provide. Druid is a key component behind it that supports ingesting millions of events and serving live queries. In this talk, we will explain how Druid empowers us to achieve these product features.
In addition, we will also go through our journey of operating Druid in Kubernetes and how we brought Druid into production from scratch in just a few weeks. We will discuss the steps we took to make it production-ready internally, as well as how we make continuous improvements and upgrades to Druid while keeping internal developers and external users happy.
Druid has enabled ThousandEyes to reduce dashboard latency by 10x and provide our customers with a real-time, interactive view of their terabytes of data. In this talk we will learn how the Data Visualization team in ThousandEyes upserts data in Druid in real-time.
Most Druid applications aim for sub-second analytics queries - but this becomes difficult when dealing with large volumes of real-time data. We’ll take you through our series of engineering and ops efforts that ultimately helped us reach sub-second real-time queries at scale with Druid by using a custom data sharding mechanism.
At Singular, we process and store our customer’s marketing data so they can run complex analytical queries on it and ultimately make smarter, data-driven decisions. While many of these decisions can be based on historical batch-ingested data, more and more decisions are being made based on fresh real-time data.
With Apache Druid as our client-facing single source of truth, we have utilized its real-time Kafka ingestion effectively for some time, ingesting all the real-time data into a single data source. As we grew, however, our real-time data source got larger and larger and became a significant bottleneck, causing spikes in query times and service disruption. After researching the issue, we discovered the problem: since we had a single real-time data source, different customers' data was stored in the same segments. When running a heavy query, this caused almost all resources to be allocated to that query. It also caused the data to be stored sub-optimally and significantly increased the volume of data that needed to be scanned.
Spark Druid Segment Reader is a Spark connector that extends Spark DataFrame reader and supports directly reading Druid data from PySpark. By using the DataSource option you can easily define the input path and date range of reading directories.
The connector detects the segments for given intervals based on the directory structure on Deep Storage (without accessing the metadata store). For every interval only the latest version is loaded. Data schema of a single segment file is inferred automatically. All generated schemas are merged into one compatible schema.
At Deep.BI, we work heavily with Druid on a daily basis and it is a central component of our real-time stack for both Analytics and Machine Learning & AI use cases. Other key components of our stack include Kafka, Spark, and Cassandra.
One of the challenges we faced was that many times our data science team would need to work on ad hoc analyses of data from Druid and that would add a lot of pressure on the cluster that was not necessary since the majority of the jobs could be done in Spark in a more familiar environment for them as well.
In order to reduce the workload for them, our data engineering team, and the load on our Druid cluster, we created the Spark Druid Segment Reader.
It is an open-source (Apache 2.0 license) tool that acts as a Druid raw data extractor to Spark. It imports Druid data to Spark with no Druid involvement and helps utilize the data by data scientists or run complex, ad-hoc analyses without putting pressure on the Druid cluster.
The connector enables various teams (e.g. Data Scientists) to access all historical data on-demand going years back and run any necessary analyses on the data. They can re-process historical data without Druid cluster involved (for extraction) and thanks to this are able to reduce data duplication by huge margins between the data lake and Druid Deep Storage.
You can find the GitHub repository here: https://github.com/deep-bi/spark-druid-segment-reader
In Apache Druid, after ingesting some data and publishing a segment to Deep Storage, it is the Load, Drop rule a Kill process behavior that govern the cacheing of segments on historicals across multiple tiers. In this session we review the overall process that each segment undergoes and how to control it. We'll talk about how to deal with hot and cold tiers and provide details on how the coordinator's algorithm manages overall segments' life cycle.
In this talk, we will learn more about Medialab's analytics strategy to optimize ad campaigns across several global consumer internet brands using Druid. Juan will highlight the company's journey from Open Source Druid to Imply, and some of the dashboards that are used by end-users for real-time monitoring of advertising success and failure rates.
Case study to share my experience how easy is to setup a working Analytics DB by using Imply Polaris. Leverage the batch ingestion APIs using python, using Apache airflow scheduler. How we leveraging Theta sketches for unique property listing calculation. Overall experience with Polaris while I've built this PoC to fulfill our Self Serve Analytics needs.
VMware NSX Intelligence is a distributed analytics platform powered by Druid that can process and visualize network traffic flows, detect anomalous network behaviors, and recommend security policies. We will introduce our data pipeline (Kafka, Spark, Druid), how we ingest and store data with Druid, how we perform different kind of queries to achieve our use cases, and our learnings in the past 3 years.
Apache Druid can handle streaming ingestion jobs at any scale. This occasionally poses challenges. Apache Kafka is a popular distributed event streaming platform, so it’s a natural choice for this presentation. As a community, we hope to improve our existing documentation of Kafka errors, so we’re hoping that this presentation will get the ball rolling. We’ll cover basics about setting up a Kafka streaming ingestion job, including a best practice tip or two such as monitoring your deployment with metrics and logs. We’ll also talk about the location of Kafka related logs within your deployment. We’ll conclude with some common Kafka ingestion errors and solutions, specifically lag and parsing.
Sometimes we want to understand how Druid will perform using various types of data, but the actual data may not be available. The developer relations team at Imply has created a data generator program that can simulate various types of data for you. This script may be helpful if you are creating a POC or projecting the performance of data that does not yet exist. In this session, we will introduce you to this generator, tell you how you can get it (it's free), and explain how to configure it.
Customer Intelligence facilitates informed decisions to increase business efficiency and success. Out of The Blue™ (OTB) provides SaaS solutions that autonomously and continuously monitor KPIs to create actionable insights. Digital operations and product leaders use resultant intelligence to prioritize and focus on business objectives, capitalize on opportunities and resolve critical problems without having to initiate manual analysis or use dashboards.
This talk will also discuss how monitoring metrics help build an intelligence-driven organization, tactics successful organizations use, and the typical tooling challenges around getting this right.
Poshmark, a joint OTB and Imply customer, use OTB Perceive to validate the performance of internally driven actions like releases and promos and their adverse impacts on business performance. Autonomous analytics automatically identifies problems and their causes to facilitate rapid resolution.
You will learn:
Why Autonomous Customer Intelligence matters
Typical organizational and tooling challenges
How and why OTB uses Imply to generate Customer Intelligence, including detecting anomalous activity, and subtle and slow-changing trends
The benefits OTB’s customers realize
Apache Druid 24.0 has some amazing new capabilities. One of the most exciting ones is the ability to load complex nested JSON structures automatically. In this session we review the motivation for this feature; the complexity of dealing with nested and dynamic JSON data. We'll show how it is now extremely simple to ingest and query this type of data. We'll talk about Druid's ability to deal with changing schemas in the nested JSON and how every JSON field is indexed to provide speed at query time. We'll also take a look at early benchmarks with some pretty impressive results. If you've dealt with flattening data in the past, you won't want to miss this session!
Indian Express is one of the leading media publishing company based in Mumbai is using Imply for user activity analysis. In this talk, we share how Indian Express migrated their use case from Google Analytics and Tableau for their analysis to Imply enhanced real-time analysis with their streaming data. In addition to this existing use case, the additional use case is for monitoring various detailed events across the platform like trends in signup, login activities, subscription utilization, and campaign tracking based on Geo location.
Danggeun market is an online marketplace in Korea focussed on used goods sales in the neighbourhood. This session will detail out how Daangn uses druid to speed up clickstream data analytics. Realtime data is ingested into druid from kafka. The ingested data is queried in druid using the ability to query a nested column in druid. The session will wrap-up looking at some performance numbers and dashboards and discussing future direction including merging topics using kstream.
SQL is a powerful and elegant query language for databases. Druid is increasingly supporting and moving forward with SQL, which is arguably easier to learn and write than the original native (JSON) query language. This talk will discuss the Druid implementation of SQL, focusing on querying data. How does it work? What parts of SQL are and aren't supported? What are some gotchas, and tips and tricks, for people new to it? We'll focus on querying data. We'll also briefly discuss the new multi-stage query engine and SQL syntax for ingesting or transforming data, but the main focus will be on writing good, efficient queries.
Compaction in the Druid database has been evolving into use cases beyond improving query performance and reducing storage. This talk will first go over what is compaction, current use cases for compaction, a beginner’s guide on setting up compaction (and automatic compaction), and common pitfalls when setting up auto-compaction. The talk will then drive into short and long term plans for compaction in Druid. Future features include creating derived datasources / materialized views and automatically changing or maintaining different schemas for the same datasource based on the age of data.
What if I told you moving away from multi-level aggregates could actually speed up your reporting system? In this session, we look at how Druid can be used to retrieve data for reporting quickly and efficiently. We’ll discuss 3 reasons you should consider moving away from multi-level aggregates and 5 benefits gained by making the transition to Druid. Finally, we’ll take a brief look at the system architecture we built to support customer reporting needs.