Data Engineering

Real-Time Lakehouse on AWS: Kafka → Glue → Redshift Serverless in Under 4 Hours

Vikram Joshi · Data Engineering Lead, codetoday.io April 2025 11 min read

A retail client was making inventory decisions on 12-hour-old data. Stockouts and overstocks were costing an estimated $2M per year. Their overnight batch ETL pipeline — solid, reliable, and fundamentally broken for the business — needed to become a real-time system. We built it in four hours using MSK Kafka, AWS Glue 4.0 streaming, Apache Iceberg on S3, and Redshift Serverless. Here's the complete playbook.

The Business Problem: 12 Hours Too Late

Our client runs 340 retail stores across South and Southeast Asia, plus an e-commerce channel processing ~1.1 billion events per day: POS transactions, inventory scans, web clickstreams, and IoT stock-level sensors. All of this data fed an overnight batch ETL pipeline. Every morning at 7am, the analytics team received "fresh" data that was between 8 and 18 hours old depending on the previous night's processing run.

Inventory planners were making replenishment decisions based on yesterday's stock levels. Regional managers were looking at morning dashboards to understand what happened the previous afternoon. The e-commerce merchandising team was adjusting promotions based on data that could be a full day stale during peak campaign periods.

The financial impact was quantified in an internal audit: $2.1M per year in attributable losses from stockouts and overstocks — situations where real-time data would have triggered a replenishment or markdown decision hours earlier. The cost of fixing this was a fraction of that number.

The constraint: the existing Redshift cluster and Glue ETL jobs couldn't be decommissioned overnight. The new real-time pipeline needed to run alongside the batch system during a parallel validation phase. That meant the new architecture had to be additive, not a replacement — at least initially.

Architecture Overview

The complete data flow from event source to dashboard looks like this:


┌─────────────────────────────────────────────────────────────────────────────────┐
│                              Event Producers                                    │
│                                                                                 │
│  ┌────────────────┐  ┌──────────────────┐  ┌──────────────────────────────┐    │
│  │  IoT Sensors   │  │   POS Systems    │  │   Web / App Events           │    │
│  │  (MQTT → SNS)  │  │  (Kinesis Agent) │  │   (Kafka Producer SDK)       │    │
│  └───────┬────────┘  └────────┬─────────┘  └──────────────┬───────────────┘    │
└──────────┼────────────────────┼──────────────────────────┼────────────────────┘
           └────────────────────┴──────────────────────────┘
                                          │
                                          ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│              Amazon MSK (Managed Streaming for Apache Kafka)                    │
│    3 brokers · m5.large · 3 topics: inventory, transactions, clickstream        │
│    Retention: 7 days · Replication factor: 3 · Auto-create topics: OFF         │
└───────────────────────────────────────┬─────────────────────────────────────────┘
                                        │
                                        ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│           AWS Glue 4.0 Streaming ETL Jobs (1 per topic, 3 total)                │
│  DynamicFrame · checkpoints → S3 · micro-batch 60s · write Iceberg to S3       │
└───────────────────────────────────────┬─────────────────────────────────────────┘
                                        │
                                        ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│                  S3 Data Lake — Apache Iceberg Tables                           │
│   s3://retail-lakehouse-prod/iceberg/                                           │
│   Partitioned by: event_date / event_hour                                       │
│   Schema registered in AWS Glue Data Catalog                                   │
└───────────────────────────────────────┬─────────────────────────────────────────┘
                                        │
                                        ▼
┌─────────────────────────────────────────────────────────────────────────────────┐
│               Redshift Serverless (Namespace: retail-prod)                      │
│   Workgroup: analytics-wg · External schema → Glue Data Catalog                │
│   Materialized views refreshed every 5 min for dashboard queries               │
└───────────┬─────────────────────────────────────────┬───────────────────────────┘
            │                                         │
            ▼                                         ▼
  ┌──────────────────────┐               ┌────────────────────────┐
  │  Grafana (JDBC)      │               │  Amazon QuickSight     │
  │  Ops Dashboards      │               │  Executive Reporting   │
  └──────────────────────┘               └────────────────────────┘

Each of the three Kafka topics maps to a Glue streaming job that writes Iceberg format to S3. Redshift Serverless reads these as external tables via the Glue Data Catalog — no data movement, no ETL into Redshift storage. Query latency benefits from Iceberg's file-level statistics and partition pruning, which means even full 90-day history queries stay fast.

Step 1 (30 min): MSK Cluster Setup

Provision an MSK cluster with three brokers using kafka.m5.large instances. We chose m5.large rather than the smaller t3.small because the producer throughput on inventory events peaks at ~120MB/s during store opening hours — the t3.small throttles at ~20MB/s. Three brokers gives us a replication factor of 3 with no performance degradation.

Critical configuration decisions: disable auto.create.topics.enable (topics must be explicitly created by the platform team — no accidental topic sprawl), set log.retention.hours=168 (7 days, matching business recovery window), and configure IAM authentication rather than SASL/SCRAM to stay in the AWS-native auth model.

For local development and schema validation, deploy the open-source Kafka UI container into the same VPC on an ECS Fargate task. This gives the data engineering team a browser-accessible console to inspect topic metadata, consumer group lag, and message schemas without granting direct MSK access.

Step 2 (45 min): AWS Glue 4.0 Streaming ETL Job

Create one Glue 4.0 streaming job per Kafka topic. The job reads from MSK, applies a schema, cleans and enriches the DynamicFrame, and writes to S3 in Iceberg format with a 60-second micro-batch trigger. The key parameters that matter for production throughput:

  • --enable-continuous-cloudwatch-log=true — essential for debugging streaming jobs
  • --job-bookmark-option=job-bookmark-enable — enables checkpointing
  • checkpoint_location — must point to S3 (see gotchas section)
  • write.target-file-size-bytes=134217728 — prevents small file explosion (see gotchas)
#!/usr/bin/env python3
# AWS Glue 4.0 Streaming ETL Job — inventory events → S3 Iceberg
# File: glue_inventory_streaming.py

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue import DynamicFrame
from pyspark.sql.functions import col, from_json, to_date, hour, current_timestamp
from pyspark.sql.types import StructType, StructField, StringType, LongType, DoubleType

args = getResolvedOptions(sys.argv, ['JOB_NAME', 'kafka_bootstrap', 'output_path', 'checkpoint_path'])

sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

# Configure Iceberg catalog — Glue Data Catalog as the metastore
spark.conf.set("spark.sql.extensions", "org.apache.iceberg.spark.extensions.IcebergSparkSessionExtensions")
spark.conf.set("spark.sql.catalog.glue_catalog", "org.apache.iceberg.spark.SparkCatalog")
spark.conf.set("spark.sql.catalog.glue_catalog.warehouse", args['output_path'])
spark.conf.set("spark.sql.catalog.glue_catalog.catalog-impl", "org.apache.iceberg.aws.glue.GlueCatalog")
spark.conf.set("spark.sql.catalog.glue_catalog.io-impl", "org.apache.iceberg.aws.s3.S3FileIO")

# Inventory event schema
INVENTORY_SCHEMA = StructType([
    StructField("store_id", StringType(), False),
    StructField("sku", StringType(), False),
    StructField("quantity_on_hand", LongType(), True),
    StructField("quantity_reserved", LongType(), True),
    StructField("unit_cost", DoubleType(), True),
    StructField("event_timestamp", LongType(), False),  # epoch ms
    StructField("event_type", StringType(), False),
])

# Read from MSK — auto offset reset to latest in production
kafka_df = spark.readStream \
    .format("kafka") \
    .option("kafka.bootstrap.servers", args['kafka_bootstrap']) \
    .option("kafka.security.protocol", "SASL_SSL") \
    .option("kafka.sasl.mechanism", "AWS_MSK_IAM") \
    .option("kafka.sasl.jaas.config", "software.amazon.msk.auth.iam.IAMLoginModule required;") \
    .option("subscribe", "retail.inventory.events") \
    .option("startingOffsets", "latest") \
    .option("maxOffsetsPerTrigger", 500000) \
    .load()

# Parse JSON payload, add partition columns
parsed_df = kafka_df \
    .select(from_json(col("value").cast("string"), INVENTORY_SCHEMA).alias("data")) \
    .select("data.*") \
    .withColumn("event_date", to_date(col("event_timestamp") / 1000).cast("date")) \
    .withColumn("event_hour", hour((col("event_timestamp") / 1000).cast("timestamp"))) \
    .withColumn("ingested_at", current_timestamp()) \
    .filter(col("store_id").isNotNull() & col("sku").isNotNull())

# Write to Iceberg via foreachBatch — gives us UPSERT capability
def write_iceberg_batch(batch_df, batch_id):
    batch_df.createOrReplaceTempView("inventory_batch")
    spark.sql("""
        MERGE INTO glue_catalog.retail_db.inventory_events t
        USING inventory_batch s
        ON t.store_id = s.store_id AND t.sku = s.sku AND t.event_timestamp = s.event_timestamp
        WHEN MATCHED THEN UPDATE SET *
        WHEN NOT MATCHED THEN INSERT *
    """)

query = parsed_df.writeStream \
    .outputMode("append") \
    .option("checkpointLocation", args['checkpoint_path']) \
    .trigger(processingTime="60 seconds") \
    .foreachBatch(write_iceberg_batch) \
    .start()

query.awaitTermination()
job.commit()

Step 3 (30 min): Iceberg Table Setup and Partition Strategy

The partition strategy is the most consequential decision in the entire architecture. Wrong partitioning means either too many small files (thousands of S3 PUTs per hour, degraded read performance, high S3 API costs) or too few files that are too large (poor partition pruning on time-range queries).

For 1B+ events/day, a two-level partition on event_date and event_hour gives us 24 partition buckets per day. Each bucket receives ~42M events. With 128MB target file sizes, that's roughly 3–5 Parquet files per partition — optimal for both write throughput and read performance on Redshift Serverless.

Create the table in the Glue Data Catalog via Spark SQL (run as a one-time job before the streaming job starts):

-- Create the Iceberg table in Glue Data Catalog (run once)
-- Execute in a Glue interactive session or one-time Glue ETL job

CREATE DATABASE IF NOT EXISTS retail_db
LOCATION 's3://retail-lakehouse-prod/iceberg/';

CREATE TABLE IF NOT EXISTS glue_catalog.retail_db.inventory_events (
    store_id       STRING     NOT NULL COMMENT 'Store identifier',
    sku            STRING     NOT NULL COMMENT 'Product SKU',
    quantity_on_hand  BIGINT  COMMENT 'Physical stock count',
    quantity_reserved BIGINT  COMMENT 'Units reserved for pending orders',
    unit_cost      DOUBLE     COMMENT 'Unit cost in USD',
    event_timestamp  BIGINT   NOT NULL COMMENT 'Event time epoch milliseconds',
    event_type     STRING     NOT NULL COMMENT 'ADJUST | SALE | RECEIVE | TRANSFER',
    ingested_at    TIMESTAMP  COMMENT 'Glue ingestion timestamp',
    event_date     DATE       NOT NULL COMMENT 'Partition column: event date',
    event_hour     INT        NOT NULL COMMENT 'Partition column: hour of day 0-23'
)
USING iceberg
PARTITIONED BY (event_date, event_hour)
TBLPROPERTIES (
    'write.target-file-size-bytes' = '134217728',     -- 128MB target
    'write.parquet.compression-codec' = 'snappy',
    'write.metadata.metrics.default' = 'truncate(16)',
    'history.expire.max-snapshot-age-ms' = '604800000',  -- 7 day snapshot retention
    'write.merge.mode' = 'merge-on-read',
    'schema.compatibility' = 'forward'                   -- allow new columns to be added
);

-- Register partition spec in Glue catalog (enables schema evolution)
ALTER TABLE glue_catalog.retail_db.inventory_events
SET TBLPROPERTIES ('schema.auto-update' = 'true');

Step 4 (45 min): Redshift Serverless and External Tables

Create a Redshift Serverless namespace and workgroup. Configure the workgroup with a base RPU (Redshift Processing Units) of 32 — enough for concurrent analytical queries without over- provisioning. The beauty of Serverless is that this scales automatically during peak query periods.

The external schema setup connects Redshift to the Glue Data Catalog, making the Iceberg tables directly queryable without any data ingestion into Redshift storage.

-- Step 4a: Create external schema pointing to Glue Data Catalog
CREATE EXTERNAL SCHEMA retail_iceberg
FROM DATA CATALOG
DATABASE 'retail_db'
IAM_ROLE 'arn:aws:iam::123456789012:role/RedshiftServerlessGlueRole'
CREATE EXTERNAL DATABASE IF NOT EXISTS;

-- Step 4b: Verify the external table is accessible
SELECT COUNT(*), MAX(event_timestamp)
FROM retail_iceberg.inventory_events
WHERE event_date = CURRENT_DATE;

-- Step 4c: Create materialized view for fast dashboard queries
-- (avoids full Iceberg scan on every dashboard refresh)
CREATE MATERIALIZED VIEW mv_inventory_realtime
AUTO REFRESH YES
AS
SELECT
    i.store_id,
    i.sku,
    SUM(CASE WHEN i.event_type = 'RECEIVE' THEN i.quantity_on_hand ELSE 0 END) AS units_received,
    SUM(CASE WHEN i.event_type = 'SALE'    THEN i.quantity_on_hand ELSE 0 END) AS units_sold,
    SUM(i.quantity_on_hand - i.quantity_reserved) AS net_available,
    MAX(i.ingested_at) AS last_updated,
    DATEDIFF(second, MAX(i.ingested_at)::TIMESTAMP, GETDATE()) AS lag_seconds
FROM retail_iceberg.inventory_events i
WHERE i.event_date >= DATEADD(day, -7, CURRENT_DATE)
GROUP BY 1, 2;

-- Step 4d: Schedule materialized view refresh every 5 minutes via Redshift Scheduler
CREATE SCHEDULE refresh_mv_inventory
  AT 'cron(*/5 * * * ? *)';

CALL DBMS_SCHEDULER.CREATE_JOB(
    job_name => 'mv_inventory_refresh',
    job_command => 'REFRESH MATERIALIZED VIEW mv_inventory_realtime',
    schedule_name => 'refresh_mv_inventory',
    enabled => TRUE
);

-- Step 4e: Query for Grafana — events per minute with pipeline lag
SELECT
    DATE_TRUNC('minute', ingested_at) AS minute_bucket,
    COUNT(*) AS event_count,
    AVG(DATEDIFF(second, (event_timestamp / 1000)::TIMESTAMP, ingested_at)) AS avg_lag_seconds
FROM retail_iceberg.inventory_events
WHERE event_date = CURRENT_DATE
  AND event_hour = EXTRACT(hour FROM GETDATE())
GROUP BY 1
ORDER BY 1 DESC
LIMIT 60;

Step 5 (30 min): Grafana Dashboards

Connect Grafana to Redshift Serverless via the native Redshift datasource plugin using JDBC. Use the workgroup endpoint URL from the Redshift Serverless console. Configure the connection with an IAM user scoped to SELECT on the retail_iceberg schema only.

The four dashboard panels that matter most for the business:

  • Events/sec (time series): SELECT COUNT(*)/60 as eps, DATE_TRUNC('minute', ingested_at) from the materialized view — shows real-time throughput and detects producer failures
  • Category inventory breakdown (bar chart): joins inventory events to a product dimension table for category-level stock positions
  • Geographic heatmap: store-level net_available stock mapped to a regional grid — planners can see regional stockout risk at a glance
  • Pipeline lag monitor (stat panel): displays current avg_lag_seconds from the materialized view — alerts Slack if lag exceeds 120 seconds

The pipeline lag panel is the most operationally important. When it spikes above 120 seconds, it means either the Glue streaming job has stalled, Kafka consumer lag is growing, or Redshift Serverless hit a cold-start cycle. This single number replaced a dozen separate CloudWatch alarms that the team was previously managing.

The 3 Gotchas Nobody Tells You

Gotcha 1: Glue Streaming Checkpoints Must Go to S3, Not EFS

Glue streaming jobs need a checkpoint location to track Kafka consumer offsets. The natural instinct is to mount an EFS file system — it looks like a local filesystem, which is what Spark's checkpoint mechanism expects. Don't. Under sustained load (we saw this at ~80K events/sec), the EFS NFS operations cause checkpoint writes to block, which causes the streaming query to stall while waiting for checkpoint confirmation, which causes memory pressure as the micro-batch buffers fill up. The result is OOM crashes that look like memory configuration problems but are actually I/O bottlenecks.

The fix is simple: set the checkpoint location to an S3 path and add --conf spark.hadoop.fs.s3a.fast.upload=true to the Glue job parameters. S3 checkpointing is async and non-blocking. We never had an OOM after switching.

⚠ Warning: EFS Checkpointing Will OOM Your Glue Streaming Job checkpointLocation=s3://your-bucket/checkpoints/job-name/ — this is the only safe checkpoint location for Glue streaming jobs at scale. Set it before your first production test and never change it (changing the checkpoint path resets consumer offsets and causes re-processing or data gaps depending on Kafka retention config).

Gotcha 2: Redshift Serverless Cold-Start Penalty

Redshift Serverless scales to zero when idle — great for cost, terrible for first-query latency. A cold-start query (first query after an idle period) can take 25–40 seconds to return results as the compute warms up. For a Grafana dashboard that auto-refreshes every 30 seconds, this means a morning dashboard load (after an overnight quiet period) appears broken.

The fix: schedule a lightweight "keep-warm" query every 4 hours using Amazon EventBridge + a Lambda function that executes SELECT 1 against the Redshift Serverless workgroup. This prevents full cold starts during business hours. During overnight hours (midnight–6am), we allow it to go cold — nobody's querying dashboards at 3am, and the cost savings during that window are real.

Gotcha 3: Iceberg Write Amplification From Small Files

Without explicit configuration, Iceberg's default target file size is 512MB. For a streaming job writing in 60-second micro-batches at 42M events/batch, the write throughput rarely fills a 512MB file before the trigger fires. The result: hundreds of tiny 2–5MB Parquet files accumulating per partition per hour. Within a week you have hundreds of thousands of S3 objects, S3 API costs balloon, and Iceberg metadata files grow large enough to slow down planning.

Set 'write.target-file-size-bytes'='134217728' (128MB) in the table properties. At our event size (~850 bytes/event average), a 60-second micro-batch at 700K events/min produces ~595MB of data — roughly 4–5 files at the 128MB target. Also schedule an Iceberg table maintenance job (using Spark in a separate Glue job running every 6 hours) to compact any residual small files and expire old snapshots.

Performance Results: Before and After

After two weeks of parallel running (validating that real-time output matched the overnight batch within acceptable tolerance), the client cut over dashboards and inventory systems to the real-time feed. The measured outcomes against the pre-migration baseline:

47s
Avg pipeline lag
(was 12 hours)
1.2s
P99 query latency
(was 45s on 90-day data)
$8.2K
Monthly infra cost
(vs $14.5K provisioned)
43%
Reduction in stockout
events (first 60 days)

The 47-second average pipeline lag is the end-to-end number from event generation at the edge (POS terminal or IoT sensor) to the value appearing in a Grafana dashboard panel. This includes Kafka producer latency (~50ms), MSK broker replication (~200ms), Glue micro-batch processing (60s trigger + ~15s processing), S3 write (~5s), Iceberg metadata update (~2s), and Redshift materialized view refresh (every 5 minutes, so dashboard queries hit the MV which is always fresh within 5 minutes, but the raw Iceberg query is available immediately).

The $8,200/month infrastructure cost covers: MSK cluster (~$1,100/mo for 3 × m5.large), three Glue DPU-hours/streaming jobs (~$2,800/mo at ~4 DPUs each running continuously), S3 storage and API costs (~$800/mo for 30 days of Iceberg data), and Redshift Serverless base RPU + query compute (~$3,500/mo average). This compares to $14,500/month for the equivalent provisioned ra3.xlplus Redshift cluster the client had previously scoped.

When NOT to Use This Stack

This architecture has real operational complexity. It's justified at 1B+ events/day with a compelling business case for sub-minute latency. It is not justified in every situation.

  • Under 1M events/day: DynamoDB Streams + Lambda + Athena is dramatically simpler. No Kafka cluster to manage, no Glue streaming jobs, and the query performance on that data volume is entirely adequate. The MSK + Glue stack adds ~$2,500/month in fixed costs before you process a single event.
  • Complex ML feature store requirements: If your real-time data needs to feed ML models for online inference (not just analytics), add Feast or SageMaker Feature Store to the pipeline instead of routing everything through Redshift. Redshift is a poor feature store — low-latency point lookups are not what it's optimized for.
  • Multi-region active-active replication: This architecture is single-region. If your producers are globally distributed and you need cross-region event consistency, you need Kafka MirrorMaker 2 between MSK clusters. That doubles your operational complexity and your MSK cost. Consider whether Kinesis Data Streams (which has native cross-region replication) is a better fit for that requirement.
💡 Tip: Validate With Batch Before Cutting Over Run the real-time pipeline in parallel with your existing batch pipeline for at least one full business cycle before cutting over dashboards and operational systems. Compare aggregate daily totals between the two systems — a systematic discrepancy (we found a 0.3% gap caused by timezone handling in the POS event producer) is far better discovered during parallel running than after a production cutover.
Apache Kafka Amazon MSK AWS Glue Apache Iceberg Redshift Serverless Lakehouse Real-Time Streaming Data Engineering FinOps

Building a real-time data platform on AWS?

Book a free 30-minute architecture review with our data engineering team. We'll review your event volumes, latency requirements, and budget constraints — and tell you honestly whether Kafka + Glue + Redshift is the right call or whether a simpler stack fits better.

Book Free Architecture Review →
// Related Articles
// DevOps
Kubernetes Cost Optimisation Checklist 2025
// FinOps
SageMaker Zombie Endpoints: The $700K/yr Cloud Cost Spiral
// Related Service
Data Engineering
We build production real-time lakehouses on Kafka, AWS Glue, and Redshift Serverless — end-to-end, governed, and cost-optimized.
Learn More