LinkedIn Implements New Data Trigger Solution to Reduce Resource Usage For Data Lakes

AdminTeam May 4, 2024

With its vast user base and the numerous interactions that occur daily, LinkedIn generates an enormous amount of data every day. The billions of data points fuel various applications, from rankings to search. The addition of AI features has added further complexity to the platform.

LinkedIn relies on a massive data lake architecture to handle this vast volume of data. This enables efficient access to datasets generated by the platform. However, managing and utilizing such a large architecture remains a massive infrastructure challenge.

To some extent, utilizing data pipelines has helped LinkedIn address these challenges. The data pipelines continually consume data from different sources and transform it for analytics. Timely execution of these pipelines is critical for extracting meaningful insights from the platform’s data.

To further enhance the process, LinkedIn has announced the launch of LakeChime, a unified data trigger solution designed to streamline data management within the data lake. Powered by an RDBMS backend, LakeChime is capable of handling large-scale data triggers in very large data lakes, such as those used by LinkedIn.

While Metadata provides the essential information about the data stored in the data lake, data triggers respond to changes in the metadata by signaling that new data is available for processing.

Table formats used in the data lake play a key role in deciding data trigger primitives and semantics. Until recently, the Apache Hive table format has been the most popular choice for data lakes. Hive’s limitations in the form of partial data consumption and coarse granularity has made it less popular.

Data lakes have evolved towards modern table formats like Apache Iceberg, Delta, and Apache Hudi. However, significant challenges remain including how to handle the scale, latency, and throughput of metadata in modern table formats.

There are also the challenges in how to migrate a data lake that relies on Hive partition semantics for data triggers and how to present abstraction for data triggers as a concept to the user.

LinkedIn aims to solve some of these key challenges through LakeChime. It offers full backward compatibility with Hive by supporting partition triggers for all data types. The snapshot trigger semantics offer forward compatibility with modern table data formats.

(ArtemisDiana/Shutterstock)

The added benefit of snapshot triggers is that they offer a significant upgrade in UX compared to traditional partition triggers by enabling both low-latency computation and the ability to catch up on late data arrivals.

LakeChime is built to facilitate the migration of data lakes from the Hive table format to more modern formats. Another key feature of LakeChime is its ability for incremental computation at scale, bridging the gap between batch and stream processing. This provides a gateway to more efficient compute workflows.

The launch of LakeChime represents significant progress in addressing some of the key issues in handling large-scale data triggers. The LakeChime roadmap shared by LinkedIn indicates that the next move will be to integrate LakeChime with Coral and dbt for more a more simplified process for developers and to boost efficiency in data processing. Users will no longer need to figure out incremental processing logic. They can simply express their logic in batch semantics, and the integration will handle the transformation and execution of the logic.

Data Engineering in 2024: Predictions For Data Lakes and The Serving Layer

5 Key Differences Between a Data Lake vs Data Warehouse