This is the fourth post in a series by Rockset’s CTO Dhruba Borthakur on Designing the Next Generation of Data Systems for Real-Time Analytics. We’ll be publishing more posts in the series in the near future, so subscribe to our blog so you don’t miss them!
Posts published so far in the series:
- Why Mutability Is Essential for Real-Time Data Analytics
- Handling Out-of-Order Data in Real-Time Analytics Applications
- Handling Bursty Traffic in Real-Time Analytics Applications
- SQL and Complex Queries Are Needed for Real-Time Analytics
- Why Real-Time Analytics Requires Both the Flexibility of NoSQL and Strict Schemas of SQL Systems
Today’s data-driven businesses need not only fast answers derived from the freshest data, but they must also perform complex queries to solve complicated business problems.
For instance, customer personalization systems need to combine historic data sets with real-time data streams to instantly provide the most relevant product recommendations to customers. So must operational analytics systems providing mission-critical real-time business observability, such as the case of an online payments provider that needs to monitor its transactions worldwide for anomalies that could signal financial fraud.
Or imagine an e-learning platform that needs to provide up-to-the-minute insights into student and teacher usage for school district customers and internal customer-facing teams. Or a market news provider that needs to monitor and ensure that its financial customers are getting accurate, relevant updates within the narrow windows for profitable trades.
Limitations of NoSQL
SQL supports complex queries because it is a very expressive, mature language. Complex SQL queries have long been commonplace in business intelligence (BI). And when systems such as Hadoop and Hive arrived, it married complex queries with big data for the first time. Hive implemented an SQL layer on Hadoop’s native MapReduce programming paradigm. The tradeoff of these first-generation SQL-based big data systems was that they boosted data processing throughput at the expense of higher query latency. As a result, the use cases remained firmly in batch mode.
That changed when NoSQL databases such as key-value and document stores came on the scene. The design goal was low latency and scale. Now companies could take a massive data set, organize it into simple pairs of key values or documents and instantly perform lookups and other simple queries. The designers of these massive, scalable key-value stores or document databases decided that scale and speed were possible only if the queries were simple in nature. Looking up a value in a key-value store could be made lightning fast. By contrast, a SQL query, due to the inherent complexity of filters, sorts and aggregations, would be too technically challenging to execute fast on large amounts of data, they decided.
Pay No Attention to That Man Behind the Curtain
Unfortunately, due to the above, NoSQL databases tend to run into problems when queries are complex, nested and must return precise answers. This is intentionally not their forte. Their query languages, whether SQL-like variants such as CQL (Cassandra) and Druid SQL or wholly custom languages such as MQL (MongoDB), poorly support joins and other complex query commands that are standard to SQL, if they support them at all.
Vendors of NoSQL databases are like the Wizard of Oz, distracting you with smoke and mirrors and talking up narrow definitions of speed so you don’t notice the actual weaknesses of NoSQL databases when it comes to real-time analytics. Developers working with NoSQL databases end up being forced to embed joins and other data logic in their own application code — everything from fetching data from separate tables to doing the join optimizations and other analytical jobs.
While taking the NoSQL road is possible, it’s cumbersome and slow. Take an individual applying for a mortgage. To analyze their creditworthiness, you would create a data application that crunches data, such as the person’s credit history, outstanding loans and repayment history. To do so, you would need to combine several tables of data, some of which might be normalized, some of which are not. You might also analyze current and historical mortgage rates to determine what rate to offer.
With SQL, you could simply join tables of credit histories and loan payments together and aggregate large-scale historic data sets, such as daily mortgage rates. However, using something like Python or Java to manually recreate the joins and aggregations would multiply the lines of code in your application by tens or even a hundred compared to SQL.
More application code not only takes more time to create, but it almost always results in slower queries. Without access to a SQL-based query optimizer, accelerating queries is difficult and time-consuming because there is no demarcation between the business logic in the application and the query-based data access paths used by the application. Something as common as an intermediate join table, which SQL can handle efficiently and elegantly, can become a bloated memory hog in other languages.
Finally, a query written in application code is also more fragile, requiring constant maintenance and testing, and possible rewrites if data volumes change. And most developers lack the time and expertise to perform this constant maintenance.
There is only one NoSQL system I would consider reasonably competent at complex queries: GraphQL. GraphQL systems can associate data types with specific data fields, and provide functions to retrieve selected fields of a document. Its query API supports complex operations, such as filtering documents based on a set of matching fields and selectively returning a subset of fields from matching documents. GraphQL’s main analytics shortcoming is its lack of expressive power to join two disparate datasets based on the value of specific fields in those two datasets. Most analytical queries need this ability to join multiple data sources at query time.
Choosing the Best Tool for the Job – SQL
In technology as in life, every job has a tool that is best designed for it. For complex analytical queries, SQL is unquestionably the best tool. SQL has a rich set of powerful commands developed over half a century. It is easy to create queries, and even easier to tune and optimize them in order to accelerate results, shrink intermediate tables and reduce query costs.
There are some myths about SQL databases, but they are based on legacy relational systems from the 1990s. The truth is that modern cloud native SQL databases support all of the key features necessary for real-time analytics, including:
- Mutable data for incredibly fast data ingestion and smooth handling of late-arriving events.
- Flexible schemas that can adjust automatically based on the structure of the incoming streaming data.
- Instant scaleup of data writes or queries to handle bursts of data.
SQL remains incredibly popular, ranking among the most in-demand of all programming languages. As we’ve seen, it supports complex queries, which are a requirement for modern, real-time data analytics. By contrast, NoSQL databases are weak in executing joins and other complex query commands. Plus, finding an expert in a lesser-known custom query language can be time-consuming and expensive.
The bottom line is that you’ll have no problem finding skilled data engineers and data ops folks who know SQL and its capabilities with complex queries. And they’ll be able to put that knowledge and power to use, propelling your organization’s leap from batch to real-time analytics.
Dhruba Borthakur is CTO and co-founder of Rockset and is responsible for the company’s technical direction. He was an engineer on the database team at Facebook, where he was the founding engineer of the RocksDB data store. Earlier at Yahoo, he was one of the founding engineers of the Hadoop Distributed File System. He was also a contributor to the open source Apache HBase project.