In 2019, Gartner predicted that “by 2022, more than half of major new business systems will incorporate continuous intelligence that uses real-time context data to improve decisions,” and users have grown to expect real-time data, especially since the rise of social networks.
Companies are adopting real-time data for many reasons, including providing seamless and personalized experiences to users when interacting with services, and enabling real-time, data-driven decision making.
As the requirement for real-time data has grown, so have the technologies that enable it. Real-time analytics can be achieved in a number of ways, but approaches can generally be split into two camps: streaming analytics and analytics databases.
Streaming analytics happens inline, as data is streamed from one place to another. Analytics happens continuously and in real time, as data is fed through the pipeline. Analytics databases ingest data in as near real time as possible, and allow fast analytical queries to be done on this data.
In this post, we’ll talk through two technologies that implement these techniques: ksqlDB (earlier releases were known as KSQL or Kafka SQL), which provides streaming analytics, and Rockset, a real-time analytics database. We’ll dive into the pros and cons of each approach so you can decide which is right for you.
To deal with the scale and speed of the data being generated, a common pattern is to put this data onto a queue or stream. This decouples the mechanism for transporting the data away from any processing that you want to take place on the data. However, with this data being streamed in real-time, it makes sense to also process and analyze it in real-time, especially if you have a genuine use case for up-to-date analytics.
To overcome this, Confluent developed kqlDB. Developed to work with Apache Kafka, ksqlDB provides an SQL-like interface to data streams, allowing for filtering, aggregations and even joins across data streams. ksqlDB uses Kafka as the storage engine and then works as the compute engine. It also has built-in connectors for external data sources, such as connecting to databases over JDBC so they can be brought into Kafka to be joined with a real-time stream for enrichment.
You can perform analytics in two ways: pull queries or push queries. Pull queries allow you to look up results at a specific point in time and execute the query on the stream as a one-off. This is similar to running a query on a database where you execute the query and a result is returned; if you want to refresh the result, you run the query again. This is useful for synchronous applications and often run with lower latency, as the stream data can be fed into a materialized view, which is kept up to date automatically, so there is less work for the query to do.
Push queries allow you to subscribe to a table or a stream, and as the data is updated downstream, the query results will also reflect these updates in real-time. You execute the query once and the result changes as the data changes in the stream. This is a powerful use case for stream analytics as it allows you to subscribe to the result of a calculation on the data instead of subscribing to the data feed itself.
For example, let’s say you have a taxi app. When you request a taxi, the driver accepts the ride and then on the screen you are shown the driver’s location and your location and given an estimated time of arrival. To display the driver’s current location and the estimated time of arrival, you need to understand the driver’s position in real time and then from that continuously calculate the estimated time to arrive as the driver’s location updates.
You could do this in two ways. The first way is to frequently poll the driver’s location and every time you retrieve the location, display the new position on the screen and also perform the calculation to estimate their arrival time. Alternatively, you could use stream analytics.
The second way is to continuously stream the driver’s and the user’s locations in real-time. This same stream can be used to obtain the driver’s location for display purposes and also, by using a ksqlDB push query, you can calculate the time of arrival. Your application is then subscribed to the output from this push query and whenever the time of arrival changes it is automatically updated on the screen.
Real-Time Analytics Database
An analytics database, as its name suggests, allows for analytics on data stored in a database. Historically, this could mean batch ingesting data into a database and then performing analytical queries on that data. However, tools like Rockset allow you to keep the benefits of a database but provide tools to perform analytics in near real-time.
Fig 1. Difference between streaming analytics and real-time analytics database
Rockset provides out-of-the-box data connectors that allow data to be streamed into their analytics database. Rather than analyzing the data as it is streamed, the data is streamed into the database as close to real time as possible. Then, the analytics can take place on the data at rest. As shown in Fig 1, streaming analytics takes place on the stream itself whereas analytics databases ingest the data in real time and analytics is performed on the database.
There are a number of benefits to storing the data in a database. Firstly you can index the data according to the use case to increase performance and reduce query latency. Unfortunately, creating bespoke indexes in order to make queries run quickly adds significant administrative overhead. And if the database needs bespoke indexes to perform well, then users submitting ad hoc queries are not going to have a great experience. Rockset solved this problem with the Converged Index and an SQL engine implementation that doesn’t require administrators to create bespoke indexes.
With streaming analytics, the focus is often on what is happening right now and although analytics databases support this, they also enable analytics across larger historical data when required.
Some modern analytics databases also support schemaless ingest and can infer the schema on read to remove the burden of defining the schema upfront. For example, ksqlDB can connect to a Kafka topic that accepts unstructured data. However for ksqlDB to query this data, the schema of the underlying data needs to be defined upfront. On the other hand, modern analytics databases like Rockset allow the data to be ingested into a collection without defining the schema. This allows for flexible querying of the data, especially as the structure of the data evolves over time, as it doesn’t require any schema modifications to access the new properties.
Finally, cloud native analytics databases often separate the storage and compute resources. This gives you the ability to scale them independently. This is vital if you have applications with high query per second (QPS) workloads, as when your system needs to deal with a spike in queries. You can easily scale the compute to meet this demand without incurring extra storage costs.
Which Should I Use?
Overall, which system to use will ultimately depend on your use case. If your data is already flowing through Kafka topics and you want to run some real-time queries on this data in-flight, then ksqlDB may be the right choice. It will fulfil your use case and means you don’t have to invest in extra infrastructure to ingest this data into an analytics database. Remember, streaming analytics allows you to transform, filter and aggregate events as data is streamed in and your application can then subscribe to these results to get continuously updated results.
If your use cases are more varied, then a real-time analytics database like Rockset may be the right choice. Analytics databases are ideal if you have data from many different systems that you want to join together, as you can delay joins until query time to get the most up-to-date data. If you need to support ad-hoc queries on historical datasets on top of real-time analytics and require the compute and storage to be scaled separately (important if you have high or variable query concurrency), then a real-time analytics database is likely the right option.