Real-Time Lakehouse on AWS: Kafka → Glue → Redshift Serverless in Under 4 Hours
A retail client was making inventory decisions on 12-hour-old data. Stockouts and overstocks were costing an estimated $2M per year. Their overnight batch ETL pipeline — solid, reliable, and fundamentally broken for the business — needed to become a real-time system. We built it in four hours using MSK Kafka, AWS Glue 4.0 streaming, Apache Iceberg on S3, and Redshift Serverless. Here's the complete playbook.
The Business Problem: 12 Hours Too Late
Our client runs 340 retail stores across South and Southeast Asia, plus an e-commerce channel processing ~1.1 billion events per day: POS transactions, inventory scans, web clickstreams, and IoT stock-level sensors. All of this data fed an overnight batch ETL pipeline. Every morning at 7am, the analytics team received "fresh" data that was between 8 and 18 hours old depending on the previous night's processing run.
Inventory planners were making replenishment decisions based on yesterday's stock levels. Regional managers were looking at morning dashboards to understand what happened the previous afternoon. The e-commerce merchandising team was adjusting promotions based on data that could be a full day stale during peak campaign periods.
The financial impact was quantified in an internal audit: $2.1M per year in attributable losses from stockouts and overstocks — situations where real-time data would have triggered a replenishment or markdown decision hours earlier. The cost of fixing this was a fraction of that number.
The constraint: the existing Redshift cluster and Glue ETL jobs couldn't be decommissioned overnight. The new real-time pipeline needed to run alongside the batch system during a parallel validation phase. That meant the new architecture had to be additive, not a replacement — at least initially.
Architecture Overview
The complete data flow from event source to dashboard looks like this:
┌─────────────────────────────────────────────────────────────────────────────────┐
│ Event Producers │
│ │
│ ┌────────────────┐ ┌──────────────────┐ ┌──────────────────────────────┐ │
│ │ IoT Sensors │ │ POS Systems │ │ Web / App Events │ │
│ │ (MQTT → SNS) │ │ (Kinesis Agent) │ │ (Kafka Producer SDK) │ │
│ └───────┬────────┘ └────────┬─────────┘ └──────────────┬───────────────┘ │
└──────────┼────────────────────┼──────────────────────────┼────────────────────┘
└────────────────────┴──────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ Amazon MSK (Managed Streaming for Apache Kafka) │
│ 3 brokers · m5.large · 3 topics: inventory, transactions, clickstream │
│ Retention: 7 days · Replication factor: 3 · Auto-create topics: OFF │
└───────────────────────────────────────┬─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ AWS Glue 4.0 Streaming ETL Jobs (1 per topic, 3 total) │
│ DynamicFrame · checkpoints → S3 · micro-batch 60s · write Iceberg to S3 │
└───────────────────────────────────────┬─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ S3 Data Lake — Apache Iceberg Tables │
│ s3://retail-lakehouse-prod/iceberg/ │
│ Partitioned by: event_date / event_hour │
│ Schema registered in AWS Glue Data Catalog │
└───────────────────────────────────────┬─────────────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│ Redshift Serverless (Namespace: retail-prod) │
│ Workgroup: analytics-wg · External schema → Glue Data Catalog │
│ Materialized views refreshed every 5 min for dashboard queries │
└───────────┬─────────────────────────────────────────┬───────────────────────────┘
│ │
▼ ▼
┌──────────────────────┐ ┌────────────────────────┐
│ Grafana (JDBC) │ │ Amazon QuickSight │
│ Ops Dashboards │ │ Executive Reporting │
└──────────────────────┘ └────────────────────────┘
Each of the three Kafka topics maps to a Glue streaming job that writes Iceberg format to S3. Redshift Serverless reads these as external tables via the Glue Data Catalog — no data movement, no ETL into Redshift storage. Query latency benefits from Iceberg's file-level statistics and partition pruning, which means even full 90-day history queries stay fast.
Step 1 (30 min): MSK Cluster Setup
Provision an MSK cluster with three brokers using kafka.m5.large instances. We
chose m5.large rather than the smaller t3.small because the
producer throughput on inventory events peaks at ~120MB/s during store opening hours — the
t3.small throttles at ~20MB/s. Three brokers gives us a replication factor of
3 with no performance degradation.
Critical configuration decisions: disable auto.create.topics.enable (topics must
be explicitly created by the platform team — no accidental topic sprawl), set
log.retention.hours=168 (7 days, matching business recovery window), and configure
IAM authentication rather than SASL/SCRAM to stay in the AWS-native auth model.
For local development and schema validation, deploy the open-source Kafka UI container into the same VPC on an ECS Fargate task. This gives the data engineering team a browser-accessible console to inspect topic metadata, consumer group lag, and message schemas without granting direct MSK access.
Step 2 (45 min): AWS Glue 4.0 Streaming ETL Job
Create one Glue 4.0 streaming job per Kafka topic. The job reads from MSK, applies a schema, cleans and enriches the DynamicFrame, and writes to S3 in Iceberg format with a 60-second micro-batch trigger. The key parameters that matter for production throughput:
--enable-continuous-cloudwatch-log=true— essential for debugging streaming jobs--job-bookmark-option=job-bookmark-enable— enables checkpointingcheckpoint_location— must point to S3 (see gotchas section)write.target-file-size-bytes=134217728— prevents small file explosion (see gotchas)
#!/usr/bin/env python3
# AWS Glue 4.0 Streaming ETL Job — inventory events → S3 Iceberg
# File: glue_inventory_streaming.py
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue import DynamicFrame
from pyspark.sql.functions import col, from_json, to_date, hour, current_timestamp
from pyspark.sql.types import StructType, StructField, StringType, LongType, DoubleType
args = getResolvedOptions(sys.argv, ['JOB_NAME', 'kafka_bootstrap', 'output_path', 'checkpoint_path'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)
# Configure Iceberg catalog — Glue Data Catalog as the metastore
spark.conf.set("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
spark.conf.set("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog")
spark.conf.set("spark.sql.catalog.glue_catalog.warehouse", args['output_path'])
spark.conf.set("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
spark.conf.set("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")
# Inventory event schema
INVENTORY_SCHEMA = StructType([
StructField("store_id", StringType(), False),
StructField("sku", StringType(), False),
StructField("quantity_on_hand", LongType(), True),
StructField("quantity_reserved", LongType(), True),
StructField("unit_cost", DoubleType(), True),
StructField("event_timestamp", LongType(), False), # epoch ms
StructField("event_type", StringType(), False),
])
# Read from MSK — auto offset reset to latest in production
kafka_df = spark.readStream \
.format("kafka") \
.option("kafka.bootstrap.servers", args['kafka_bootstrap']) \
.option("kafka.security.protocol", "SASL_SSL") \
.option("kafka.sasl.mechanism", "AWS_MSK_IAM") \
.option("kafka.sasl.jaas.config", "software.amazon.msk.auth.iam.IAMLoginModule required;") \
.option("subscribe", "retail.inventory.events") \
.option("startingOffsets", "latest") \
.option("maxOffsetsPerTrigger", 500000) \
.load()
# Parse JSON payload, add partition columns
parsed_df = kafka_df \
.select(from_json(col("value").cast("string"), INVENTORY_SCHEMA).alias("data")) \
.select("data.*") \
.withColumn("event_date", to_date(col("event_timestamp") / 1000).cast("date")) \
.withColumn("event_hour", hour((col("event_timestamp") / 1000).cast("timestamp"))) \
.withColumn("ingested_at", current_timestamp()) \
.filter(col("store_id").isNotNull() & col("sku").isNotNull())
# Write to Iceberg via foreachBatch — gives us UPSERT capability
def write_iceberg_batch(batch_df, batch_id):
batch_df.createOrReplaceTempView("inventory_batch")
spark.sql("""
MERGE INTO glue_catalog.retail_db.inventory_events t
USING inventory_batch s
ON t.store_id = s.store_id AND t.sku = s.sku AND t.event_timestamp = s.event_timestamp
WHEN MATCHED THEN UPDATE SET *
WHEN NOT MATCHED THEN INSERT *
""")
query = parsed_df.writeStream \
.outputMode("append") \
.option("checkpointLocation", args['checkpoint_path']) \
.trigger(processingTime="60 seconds") \
.foreachBatch(write_iceberg_batch) \
.start()
query.awaitTermination()
job.commit()
Step 3 (30 min): Iceberg Table Setup and Partition Strategy
The partition strategy is the most consequential decision in the entire architecture. Wrong partitioning means either too many small files (thousands of S3 PUTs per hour, degraded read performance, high S3 API costs) or too few files that are too large (poor partition pruning on time-range queries).
For 1B+ events/day, a two-level partition on event_date and event_hour
gives us 24 partition buckets per day. Each bucket receives ~42M events. With 128MB target
file sizes, that's roughly 3–5 Parquet files per partition — optimal for both write throughput
and read performance on Redshift Serverless.
Create the table in the Glue Data Catalog via Spark SQL (run as a one-time job before the streaming job starts):
-- Create the Iceberg table in Glue Data Catalog (run once)
-- Execute in a Glue interactive session or one-time Glue ETL job
CREATE DATABASE IF NOT EXISTS retail_db
LOCATION 's3://retail-lakehouse-prod/iceberg/';
CREATE TABLE IF NOT EXISTS glue_catalog.retail_db.inventory_events (
store_id STRING NOT NULL COMMENT 'Store identifier',
sku STRING NOT NULL COMMENT 'Product SKU',
quantity_on_hand BIGINT COMMENT 'Physical stock count',
quantity_reserved BIGINT COMMENT 'Units reserved for pending orders',
unit_cost DOUBLE COMMENT 'Unit cost in USD',
event_timestamp BIGINT NOT NULL COMMENT 'Event time epoch milliseconds',
event_type STRING NOT NULL COMMENT 'ADJUST | SALE | RECEIVE | TRANSFER',
ingested_at TIMESTAMP COMMENT 'Glue ingestion timestamp',
event_date DATE NOT NULL COMMENT 'Partition column: event date',
event_hour INT NOT NULL COMMENT 'Partition column: hour of day 0-23'
)
USING iceberg
PARTITIONED BY (event_date, event_hour)
TBLPROPERTIES (
'write.target-file-size-bytes' = '134217728', -- 128MB target
'write.parquet.compression-codec' = 'snappy',
'write.metadata.metrics.default' = 'truncate(16)',
'history.expire.max-snapshot-age-ms' = '604800000', -- 7 day snapshot retention
'write.merge.mode' = 'merge-on-read',
'schema.compatibility' = 'forward' -- allow new columns to be added
);
-- Register partition spec in Glue catalog (enables schema evolution)
ALTER TABLE glue_catalog.retail_db.inventory_events
SET TBLPROPERTIES ('schema.auto-update' = 'true');
Step 4 (45 min): Redshift Serverless and External Tables
Create a Redshift Serverless namespace and workgroup. Configure the workgroup with a base RPU (Redshift Processing Units) of 32 — enough for concurrent analytical queries without over- provisioning. The beauty of Serverless is that this scales automatically during peak query periods.
The external schema setup connects Redshift to the Glue Data Catalog, making the Iceberg tables directly queryable without any data ingestion into Redshift storage.
-- Step 4a: Create external schema pointing to Glue Data Catalog
CREATE EXTERNAL SCHEMA retail_iceberg
FROM DATA CATALOG
DATABASE 'retail_db'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftServerlessGlueRole'
CREATE EXTERNAL DATABASE IF NOT EXISTS;
-- Step 4b: Verify the external table is accessible
SELECT COUNT(*), MAX(event_timestamp)
FROM retail_iceberg.inventory_events
WHERE event_date = CURRENT_DATE;
-- Step 4c: Create materialized view for fast dashboard queries
-- (avoids full Iceberg scan on every dashboard refresh)
CREATE MATERIALIZED VIEW mv_inventory_realtime
AUTO REFRESH YES
AS
SELECT
i.store_id,
i.sku,
SUM(CASE WHEN i.event_type = 'RECEIVE' THEN i.quantity_on_hand ELSE 0 END) AS units_received,
SUM(CASE WHEN i.event_type = 'SALE' THEN i.quantity_on_hand ELSE 0 END) AS units_sold,
SUM(i.quantity_on_hand - i.quantity_reserved) AS net_available,
MAX(i.ingested_at) AS last_updated,
DATEDIFF(second, MAX(i.ingested_at)::TIMESTAMP, GETDATE()) AS lag_seconds
FROM retail_iceberg.inventory_events i
WHERE i.event_date >= DATEADD(day, -7, CURRENT_DATE)
GROUP BY 1, 2;
-- Step 4d: Schedule materialized view refresh every 5 minutes via Redshift Scheduler
CREATE SCHEDULE refresh_mv_inventory
AT 'cron(*/5 * * * ? *)';
CALL DBMS_SCHEDULER.CREATE_JOB(
job_name => 'mv_inventory_refresh',
job_command => 'REFRESH MATERIALIZED VIEW mv_inventory_realtime',
schedule_name => 'refresh_mv_inventory',
enabled => TRUE
);
-- Step 4e: Query for Grafana — events per minute with pipeline lag
SELECT
DATE_TRUNC('minute', ingested_at) AS minute_bucket,
COUNT(*) AS event_count,
AVG(DATEDIFF(second, (event_timestamp / 1000)::TIMESTAMP, ingested_at)) AS avg_lag_seconds
FROM retail_iceberg.inventory_events
WHERE event_date = CURRENT_DATE
AND event_hour = EXTRACT(hour FROM GETDATE())
GROUP BY 1
ORDER BY 1 DESC
LIMIT 60;
Step 5 (30 min): Grafana Dashboards
Connect Grafana to Redshift Serverless via the native Redshift datasource plugin using JDBC.
Use the workgroup endpoint URL from the Redshift Serverless console. Configure the connection
with an IAM user scoped to SELECT on the retail_iceberg schema only.
The four dashboard panels that matter most for the business:
- Events/sec (time series):
SELECT COUNT(*)/60 as eps, DATE_TRUNC('minute', ingested_at)from the materialized view — shows real-time throughput and detects producer failures - Category inventory breakdown (bar chart): joins inventory events to a product dimension table for category-level stock positions
- Geographic heatmap: store-level net_available stock mapped to a regional grid — planners can see regional stockout risk at a glance
- Pipeline lag monitor (stat panel): displays current
avg_lag_secondsfrom the materialized view — alerts Slack if lag exceeds 120 seconds
The pipeline lag panel is the most operationally important. When it spikes above 120 seconds, it means either the Glue streaming job has stalled, Kafka consumer lag is growing, or Redshift Serverless hit a cold-start cycle. This single number replaced a dozen separate CloudWatch alarms that the team was previously managing.
The 3 Gotchas Nobody Tells You
Gotcha 1: Glue Streaming Checkpoints Must Go to S3, Not EFS
Glue streaming jobs need a checkpoint location to track Kafka consumer offsets. The natural instinct is to mount an EFS file system — it looks like a local filesystem, which is what Spark's checkpoint mechanism expects. Don't. Under sustained load (we saw this at ~80K events/sec), the EFS NFS operations cause checkpoint writes to block, which causes the streaming query to stall while waiting for checkpoint confirmation, which causes memory pressure as the micro-batch buffers fill up. The result is OOM crashes that look like memory configuration problems but are actually I/O bottlenecks.
The fix is simple: set the checkpoint location to an S3 path and add
--conf spark.hadoop.fs.s3a.fast.upload=true to the Glue job parameters.
S3 checkpointing is async and non-blocking. We never had an OOM after switching.
checkpointLocation=s3://your-bucket/checkpoints/job-name/ — this is the only
safe checkpoint location for Glue streaming jobs at scale. Set it before your first
production test and never change it (changing the checkpoint path resets consumer offsets
and causes re-processing or data gaps depending on Kafka retention config).
Gotcha 2: Redshift Serverless Cold-Start Penalty
Redshift Serverless scales to zero when idle — great for cost, terrible for first-query latency. A cold-start query (first query after an idle period) can take 25–40 seconds to return results as the compute warms up. For a Grafana dashboard that auto-refreshes every 30 seconds, this means a morning dashboard load (after an overnight quiet period) appears broken.
The fix: schedule a lightweight "keep-warm" query every 4 hours using Amazon EventBridge +
a Lambda function that executes SELECT 1 against the Redshift Serverless
workgroup. This prevents full cold starts during business hours. During overnight hours
(midnight–6am), we allow it to go cold — nobody's querying dashboards at 3am, and the
cost savings during that window are real.
Gotcha 3: Iceberg Write Amplification From Small Files
Without explicit configuration, Iceberg's default target file size is 512MB. For a streaming job writing in 60-second micro-batches at 42M events/batch, the write throughput rarely fills a 512MB file before the trigger fires. The result: hundreds of tiny 2–5MB Parquet files accumulating per partition per hour. Within a week you have hundreds of thousands of S3 objects, S3 API costs balloon, and Iceberg metadata files grow large enough to slow down planning.
Set 'write.target-file-size-bytes'='134217728' (128MB) in the table properties.
At our event size (~850 bytes/event average), a 60-second micro-batch at 700K events/min
produces ~595MB of data — roughly 4–5 files at the 128MB target. Also schedule an Iceberg
table maintenance job (using Spark in a separate Glue job running every 6 hours) to compact
any residual small files and expire old snapshots.
Performance Results: Before and After
After two weeks of parallel running (validating that real-time output matched the overnight batch within acceptable tolerance), the client cut over dashboards and inventory systems to the real-time feed. The measured outcomes against the pre-migration baseline:
(was 12 hours)
(was 45s on 90-day data)
(vs $14.5K provisioned)
events (first 60 days)
The 47-second average pipeline lag is the end-to-end number from event generation at the edge (POS terminal or IoT sensor) to the value appearing in a Grafana dashboard panel. This includes Kafka producer latency (~50ms), MSK broker replication (~200ms), Glue micro-batch processing (60s trigger + ~15s processing), S3 write (~5s), Iceberg metadata update (~2s), and Redshift materialized view refresh (every 5 minutes, so dashboard queries hit the MV which is always fresh within 5 minutes, but the raw Iceberg query is available immediately).
The $8,200/month infrastructure cost covers: MSK cluster (~$1,100/mo for 3 × m5.large), three Glue DPU-hours/streaming jobs (~$2,800/mo at ~4 DPUs each running continuously), S3 storage and API costs (~$800/mo for 30 days of Iceberg data), and Redshift Serverless base RPU + query compute (~$3,500/mo average). This compares to $14,500/month for the equivalent provisioned ra3.xlplus Redshift cluster the client had previously scoped.
When NOT to Use This Stack
This architecture has real operational complexity. It's justified at 1B+ events/day with a compelling business case for sub-minute latency. It is not justified in every situation.
- Under 1M events/day: DynamoDB Streams + Lambda + Athena is dramatically simpler. No Kafka cluster to manage, no Glue streaming jobs, and the query performance on that data volume is entirely adequate. The MSK + Glue stack adds ~$2,500/month in fixed costs before you process a single event.
- Complex ML feature store requirements: If your real-time data needs to feed ML models for online inference (not just analytics), add Feast or SageMaker Feature Store to the pipeline instead of routing everything through Redshift. Redshift is a poor feature store — low-latency point lookups are not what it's optimized for.
- Multi-region active-active replication: This architecture is single-region. If your producers are globally distributed and you need cross-region event consistency, you need Kafka MirrorMaker 2 between MSK clusters. That doubles your operational complexity and your MSK cost. Consider whether Kinesis Data Streams (which has native cross-region replication) is a better fit for that requirement.
Building a real-time data platform on AWS?
Book a free 30-minute architecture review with our data engineering team. We'll review your event volumes, latency requirements, and budget constraints — and tell you honestly whether Kafka + Glue + Redshift is the right call or whether a simpler stack fits better.
Book Free Architecture Review →