May 23, 2024

How we Built a Generic Kafka Materializer Using Delta Live Tables

Blog-Hero-Graphic
Author: Rayan Alatab - Senior Data Engineer
At EarnIn we leverage data to revolutionize the provision of earned wage access services and to build the best financial intelligence platform possible. With those end goals in mind, EarnIn’s Data Engineering team has been supporting the company wide initiative to design event driven services following industry standards. In particular, the team also adheres to the Cloud Events specification.
With the increased move to events being produced by services, comes the need to democratize the events in our data lake for data processing, analytics, and machine learning. We wanted to build a self-serve solution for stream materialization that EarnIn engineers could easily work with.
EarnIn’s existing infrastructure uses DataBricks. So the decision to adopt DLT (Delta Live Tables) was a natural one. However, the DLT product does not provide a built-in integration with a schema registry. This was a perfect opportunity to build a robust materialization capability that seamlessly integrates with DLT and a schema registry.

High-Level Design

High-Level Design
The overall materializer flow is the following:
  1. Fetch the latest schema from the Confluent schema registry
  2. Consume the events from Kafka
  3. Deserialize to a Spark Dataframe
  4. Save the newly consumed data in the lake and register it in Unity Catalog
  5. One of our requirements was to support both JSON and Protobuf formatted events.
Deserializing the Kafka payload to a Spark Schema was straightforward with Protobuf but proved to be challenging with JSON. While many solutions for JSON schemas require users to hardcode a Spark schema for deserialization, we aimed for a completely generic approach that would be transparent to engineers, relying solely on the schema stored in the registry.

The flow of a JSON schema topic

Fetching the schemas from the schema registry uses the Confluent Python SDK

from confluent_kafka.schema_registry import SchemaRegistryClient, RegisteredSchema
schema_registry_conf = {
'url': "<schema_registry_url>",
'basic.auth.user.info': "<schema_registry_api_key>:<schema_registry_api_secret>"
}

schema_registry_client = SchemaRegistryClient(schema_registry_conf)

json_schema = schema_registry_client.get_latest_version(topic_name).schema.schema_str

Fetch the events from Kafka

kafka_df = self.spark.readStream.format("kafka") \
.option("subscribe", topic_name) \
.option("includeHeaders", "true") \
.option("kafka.group.id", <consumer_group_id>) \
.option("kafka.bootstrap.servers", <bootstrap_servers>) \
.option("startingOffsets", "earliest") \
.load()

Convert Json Schema to Spark Schema

converter = JsonToSparkSchemaConverter() (check GitHub repo)

spark_schema = converter.convert(schema_str)
As mentioned previously, there was no generic solution to convert a JSON schema to Spark schema for deserialization. EarnIn has built a small library to convert JSON schemas to a Spark Schema which we plan to release as an open source project! Once we do, this post will be updated with the new repo link.

Deserialize JSON payload to a Spark Struct

spark_df = kafka_df.withColumn("payload", from_json(col("value").cast("string"), spark_schema))

The flow of a Protobuf schema topic

For Protobuf topics, Spark has a built-in method that deserializes a binary column of Protobuf format to a Spark struct.
schema_registry_options = {"schema.registry.subject":<subject_name>,
"schema.registry.address":< registry_address>,
"confluent.schema.registry.basic.auth.credentials.source": "USER_INFO",
"confluent.schema.registry.basic.auth.user.info": '{}:{}'.format(<schema_registry_api_key>, <schema_registry_api_secret>)
}

spark_df = from_protobuf(col("value"), options=schema_registry_options)

Materialization

For the materialization itself, we save the data in two configurable forms, an event log and a snapshot, by creating DLT pipelines in Databricks. The event log consists of all historical data while the snapshot table represents the aggregated view of the events and the latest state. The DLT tables are then registered with Unity Catalog and are available in the lake.

Summary

The democratization of data in the lake empowers EarnIn to make well-informed decisions, thereby enhancing our ability to better serve our community members. By integrating Delta Live Tables with the Confluent Schema Registry, we've established a self-serve, resilient materializer capability that allows us to have better control over data freshness in the lake.
Interested in being a part of a collaborative engineering culture? Come join us at EarnIn.
Author: Rayan Alatab - Senior Data Engineer

You may enjoy

Thumbnail for EarnIn customers spending trends with average Americans
EarnIn customers spending trends with average Americans
Get the latest updates on banking, budgeting, debt, emergencies, retirement, taxes, and more from EarnIn's financial experts
Thumbnail for 2023 Tax Reporting Update for EarnIn Customer
2023 Tax Reporting Update for EarnIn Customer
Get the latest updates on banking, budgeting, debt, emergencies, retirement, taxes, and more from EarnIn's financial experts
Thumbnail for Evolve Bank & Trust Cybersecurity Breach
Evolve Bank & Trust Cybersecurity Breach
Get the latest updates on banking, budgeting, debt, emergencies, retirement, taxes, and more from EarnIn's financial experts
Thumbnail for Earnin's Data Incident
Earnin's Data Incident
Get the latest updates on banking, budgeting, debt, emergencies, retirement, taxes, and more from EarnIn's financial experts
Thumbnail for EarnIn Announces They Have Provided Access to $10 Billion in Earnings for Members
EarnIn Announces They Have Provided Access to $10 Billion in Earnings for Members
Earned Wage Access provider has processed over 125 million transactions, unlocking financial freedom for more than 2.5 million active users.
Thumbnail for The Effects of Cutting Unemployment Benefits Early
The Effects of Cutting Unemployment Benefits Early
With the help of Earnin, Opportunity Insights conducted a study that found cutting unemployment benefits early may have hurt state economies.
Thumbnail for A Message From EarnIn Regarding Unemployment
A Message From EarnIn Regarding Unemployment
Discussing the future of our unemployment feature.
Thumbnail for Why Debt Collection Has No Place at Earnin
Why Debt Collection Has No Place at Earnin
Earnin is building a financial system that works for people, and that doesn't include working with debt collectors.
Thumbnail for Shopping Habits of People Living Paycheck to Paycheck
Shopping Habits of People Living Paycheck to Paycheck
Amazon remains a dominant force in taking on traditional brick-and-mortar businesses. Their entry into the grocery space, with the acquisition of Whole Foods, will put an even greater amount of pressure on big-box retailers such as Walmart. While both retailers draw consumers who are looking for affordable prices and the ability to find just about everything in one place, Americans living paycheck to paycheck may have different considerations when choosing where to shop.
Thumbnail for What are Liquid and Non-Liquid Assets?
What are Liquid and Non-Liquid Assets?
What are total liquid assets? Dive into the distinctions between liquid and non-liquid assets and explore their significance in financial planning.
Thumbnail for How to prepare for a recession: 5 ways to be ready
How to prepare for a recession: 5 ways to be ready
High inflation, rising costs, layoffs — it’s sensible to be concerned. We’ll show you how to prepare for a recession and secure your finances.
Thumbnail for What is a Returned Item Fee and Why Was I Charged?
What is a Returned Item Fee and Why Was I Charged?
An NSF fee happens when your account lacks funds for a payment. Learn more in this guide and understand how it differs from overdraft fees.
A wallet with bank notes sticking out
Access Your Earnings Today
Make the most of your money