May 23, 2024

How we Built a Generic Kafka Materializer Using Delta Live Tables

Blog-Hero-Graphic
Author: Rayan Alatab - Senior Data Engineer
At EarnIn we leverage data to revolutionize the provision of earned wage access services and to build the best financial intelligence platform possible. With those end goals in mind, EarnIn’s Data Engineering team has been supporting the company wide initiative to design event driven services following industry standards. In particular, the team also adheres to the Cloud Events specification.
With the increased move to events being produced by services, comes the need to democratize the events in our data lake for data processing, analytics, and machine learning. We wanted to build a self-serve solution for stream materialization that EarnIn engineers could easily work with.
EarnIn’s existing infrastructure uses DataBricks. So the decision to adopt DLT (Delta Live Tables) was a natural one. However, the DLT product does not provide a built-in integration with a schema registry. This was a perfect opportunity to build a robust materialization capability that seamlessly integrates with DLT and a schema registry.

High-Level Design

High-Level Design
The overall materializer flow is the following:
  1. Fetch the latest schema from the Confluent schema registry
  2. Consume the events from Kafka
  3. Deserialize to a Spark Dataframe
  4. Save the newly consumed data in the lake and register it in Unity Catalog
  5. One of our requirements was to support both JSON and Protobuf formatted events.
Deserializing the Kafka payload to a Spark Schema was straightforward with Protobuf but proved to be challenging with JSON. While many solutions for JSON schemas require users to hardcode a Spark schema for deserialization, we aimed for a completely generic approach that would be transparent to engineers, relying solely on the schema stored in the registry.

The flow of a JSON schema topic

Fetching the schemas from the schema registry uses the Confluent Python SDK

from confluent_kafka.schema_registry import SchemaRegistryClient, RegisteredSchema
schema_registry_conf = {
'url': "<schema_registry_url>",
'basic.auth.user.info': "<schema_registry_api_key>:<schema_registry_api_secret>"
}

schema_registry_client = SchemaRegistryClient(schema_registry_conf)

json_schema = schema_registry_client.get_latest_version(topic_name).schema.schema_str

Fetch the events from Kafka

kafka_df = self.spark.readStream.format("kafka") \
.option("subscribe", topic_name) \
.option("includeHeaders", "true") \
.option("kafka.group.id", <consumer_group_id>) \
.option("kafka.bootstrap.servers", <bootstrap_servers>) \
.option("startingOffsets", "earliest") \
.load()

Convert Json Schema to Spark Schema

converter = JsonToSparkSchemaConverter() (check GitHub repo)

spark_schema = converter.convert(schema_str)
As mentioned previously, there was no generic solution to convert a JSON schema to Spark schema for deserialization. EarnIn has built a small library to convert JSON schemas to a Spark Schema which we plan to release as an open source project! Once we do, this post will be updated with the new repo link.

Deserialize JSON payload to a Spark Struct

spark_df = kafka_df.withColumn("payload", from_json(col("value").cast("string"), spark_schema))

The flow of a Protobuf schema topic

For Protobuf topics, Spark has a built-in method that deserializes a binary column of Protobuf format to a Spark struct.
schema_registry_options = {"schema.registry.subject":<subject_name>,
"schema.registry.address":< registry_address>,
"confluent.schema.registry.basic.auth.credentials.source": "USER_INFO",
"confluent.schema.registry.basic.auth.user.info": '{}:{}'.format(<schema_registry_api_key>, <schema_registry_api_secret>)
}

spark_df = from_protobuf(col("value"), options=schema_registry_options)

Materialization

For the materialization itself, we save the data in two configurable forms, an event log and a snapshot, by creating DLT pipelines in Databricks. The event log consists of all historical data while the snapshot table represents the aggregated view of the events and the latest state. The DLT tables are then registered with Unity Catalog and are available in the lake.

Summary

The democratization of data in the lake empowers EarnIn to make well-informed decisions, thereby enhancing our ability to better serve our community members. By integrating Delta Live Tables with the Confluent Schema Registry, we've established a self-serve, resilient materializer capability that allows us to have better control over data freshness in the lake.
Interested in being a part of a collaborative engineering culture? Come join us at EarnIn.
Author: Rayan Alatab - Senior Data Engineer

You may enjoy

Thumbnail for EarnIn customers spending trends with average Americans
EarnIn customers spending trends with average Americans
Get the latest updates on banking, budgeting, debt, emergencies, retirement, taxes, and more from EarnIn's financial experts
Thumbnail for 2023 Tax Reporting Update for EarnIn Customer
2023 Tax Reporting Update for EarnIn Customer
Get the latest updates on banking, budgeting, debt, emergencies, retirement, taxes, and more from EarnIn's financial experts
Thumbnail for Earnin's Data Incident
Earnin's Data Incident
Get the latest updates on banking, budgeting, debt, emergencies, retirement, taxes, and more from EarnIn's financial experts
Thumbnail for Evolve Bank & Trust Cybersecurity Breach
Evolve Bank & Trust Cybersecurity Breach
Get the latest updates on banking, budgeting, debt, emergencies, retirement, taxes, and more from EarnIn's financial experts
Thumbnail for The Effects of Cutting Unemployment Benefits Early
The Effects of Cutting Unemployment Benefits Early
With the help of Earnin, Opportunity Insights conducted a study that found cutting unemployment benefits early may have hurt state economies.
Thumbnail for Why Debt Collection Has No Place at Earnin
Why Debt Collection Has No Place at Earnin
Earnin is building a financial system that works for people, and that doesn't include working with debt collectors.
Thumbnail for A Message From EarnIn Regarding Unemployment
A Message From EarnIn Regarding Unemployment
Discussing the future of our unemployment feature.
Thumbnail for Keep Calm and Prep for Tax Season Early On
Keep Calm and Prep for Tax Season Early On
Tax season has officially arrived. Though it can be stressful, there is a light at the end of the tunnel: a sweet, sweet refund. Most Americans will get a refund, and refunds can go to a variety of things - some people use it to fund a big-ticket item, some to pay down debt, some to save.
Thumbnail for What is Proof of Income & How Can You Show it?
What is Proof of Income & How Can You Show it?
Finding and providing proof of income can be stressful. Learn what it is, why it's important, and how to show it regardless of your employment situation.
Thumbnail for How to Increase Your Tax Refund: 4 Proven Strategies in 2024
How to Increase Your Tax Refund: 4 Proven Strategies in 2024
Learn how to maximize your tax return and get the most back with these tips to help you get a bigger tax refund this year.
Thumbnail for How to Make $500 Fast: 10 Practical Ideas
How to Make $500 Fast: 10 Practical Ideas
Discover 10 practical ways to earn extra cash quickly, from selling items to freelancing and working side hustles. Here’s how to make $500 fast.
Thumbnail for Shopping Habits of People Living Paycheck to Paycheck
Shopping Habits of People Living Paycheck to Paycheck
Amazon remains a dominant force in taking on traditional brick-and-mortar businesses. Their entry into the grocery space, with the acquisition of Whole Foods, will put an even greater amount of pressure on big-box retailers such as Walmart. While both retailers draw consumers who are looking for affordable prices and the ability to find just about everything in one place, Americans living paycheck to paycheck may have different considerations when choosing where to shop.
A wallet with bank notes sticking out
Access Your Earnings Today
Make the most of your money