How we Built a Generic Kafka Materializer Using Delta Live Tables

Author: Rayan Alatab - Senior Data Engineer

At EarnIn we leverage data to revolutionize the provision of earned wage access services and to build the best financial intelligence platform possible. With those end goals in mind, EarnIn’s Data Engineering team has been supporting the company wide initiative to design event driven services following industry standards. In particular, the team also adheres to the Cloud Events specification.

With the increased move to events being produced by services, comes the need to democratize the events in our data lake for data processing, analytics, and machine learning. We wanted to build a self-serve solution for stream materialization that EarnIn engineers could easily work with.

EarnIn’s existing infrastructure uses DataBricks. So the decision to adopt DLT (Delta Live Tables) was a natural one. However, the DLT product does not provide a built-in integration with a schema registry. This was a perfect opportunity to build a robust materialization capability that seamlessly integrates with DLT and a schema registry.

High-Level Design

The overall materializer flow is the following:

Fetch the latest schema from the Confluent schema registry
Consume the events from Kafka
Deserialize to a Spark Dataframe
Save the newly consumed data in the lake and register it in Unity Catalog
One of our requirements was to support both JSON and Protobuf formatted events.

Deserializing the Kafka payload to a Spark Schema was straightforward with Protobuf but proved to be challenging with JSON. While many solutions for JSON schemas require users to hardcode a Spark schema for deserialization, we aimed for a completely generic approach that would be transparent to engineers, relying solely on the schema stored in the registry.

The flow of a JSON schema topic

Fetching the schemas from the schema registry uses the Confluent Python SDK

from confluent_kafka.schema_registry import SchemaRegistryClient, RegisteredSchema

schema_registry_conf = {
   'url': "<schema_registry_url>",
   'basic.auth.user.info': "<schema_registry_api_key>:<schema_registry_api_secret>"
}

schema_registry_client = SchemaRegistryClient(schema_registry_conf)

json_schema = schema_registry_client.get_latest_version(topic_name).schema.schema_str

Fetch the events from Kafka

kafka_df = self.spark.readStream.format("kafka") \
   .option("subscribe", topic_name) \
   .option("includeHeaders", "true") \
   .option("kafka.group.id", <consumer_group_id>) \
   .option("kafka.bootstrap.servers", <bootstrap_servers>) \
   .option("startingOffsets", "earliest") \
   .load()

Convert Json Schema to Spark Schema

converter = JsonToSparkSchemaConverter() (check GitHub repo)

spark_schema = converter.convert(schema_str)

As mentioned previously, there was no generic solution to convert a JSON schema to Spark schema for deserialization. EarnIn has built a small library to convert JSON schemas to a Spark Schema which we plan to release as an open source project! Once we do, this post will be updated with the new repo link.

Deserialize JSON payload to a Spark Struct

spark_df = kafka_df.withColumn("payload", from_json(col("value").cast("string"), spark_schema))

The flow of a Protobuf schema topic

For Protobuf topics, Spark has a built-in method that deserializes a binary column of Protobuf format to a Spark struct.

schema_registry_options = {"schema.registry.subject":<subject_name>,
          "schema.registry.address":< registry_address>,
          "confluent.schema.registry.basic.auth.credentials.source": "USER_INFO",
          "confluent.schema.registry.basic.auth.user.info":    '{}:{}'.format(<schema_registry_api_key>, <schema_registry_api_secret>)
}

spark_df = from_protobuf(col("value"), options=schema_registry_options)

Materialization

For the materialization itself, we save the data in two configurable forms, an event log and a snapshot, by creating DLT pipelines in Databricks. The event log consists of all historical data while the snapshot table represents the aggregated view of the events and the latest state. The DLT tables are then registered with Unity Catalog and are available in the lake.

Summary

The democratization of data in the lake empowers EarnIn to make well-informed decisions, thereby enhancing our ability to better serve our community members. By integrating Delta Live Tables with the Confluent Schema Registry, we've established a self-serve, resilient materializer capability that allows us to have better control over data freshness in the lake.

Interested in being a part of a collaborative engineering culture? Come join us at EarnIn.

Author: Rayan Alatab - Senior Data Engineer

How we Built a Generic Kafka Materializer Using Delta Live Tables

How we Built a Generic Kafka Materializer Using Delta Live Tables

High-Level Design

The flow of a JSON schema topic

Fetching the schemas from the schema registry uses the Confluent Python SDK

Fetch the events from Kafka

Convert Json Schema to Spark Schema

Deserialize JSON payload to a Spark Struct

The flow of a Protobuf schema topic

Materialization

Summary

Popular articles

10 Best Cash Advance & Early Pay Apps for 2025

Best Ways to Borrow Money: Fastest & Cheapest

How To Make 100 Dollars a Day in 2025: Quick Start Guide

What is a Cash Advance And How Does It Work?

Popular articles

10 Best Cash Advance & Early Pay Apps for 2025

Best Ways to Borrow Money: Fastest & Cheapest

How To Make 100 Dollars a Day in 2025: Quick Start Guide

What is a Cash Advance And How Does It Work?