6 min read
Designing a Real-Time Ad Click Aggregator

Using Flink, Kafka/Kinesis, and Understanding Lambda vs Kappa Architecture

Building the ads vertical at Bolt Food — where sponsored listings need real-time performance tracking — made this problem very concrete for me. When designing large-scale advertising systems, one of the most common problems is:

How do we aggregate billions of ad clicks in real time, accurately, and reliably?

This post walks through:

  • The system design of a real-time ad click aggregator
  • Why Kafka/Kinesis + Flink is a strong choice
  • What watermarks are (in simple terms)
  • Lambda vs Kappa architecture
  • How to discuss this confidently in system design interviews

1. The Problem

We need to:

  • Ingest millions of ad click events per second
  • Aggregate clicks per ad (per minute, hour, day)
  • Handle late and out-of-order events
  • Ensure billing accuracy (no double counting)
  • Provide real-time dashboards
  • Store historical analytics data

Each click event looks like:

{
  "ad_id": "123",
  "user_id": "abc",
  "timestamp": "2026-03-07T12:00:45Z"
}

2. High-Level Architecture

flowchart LR
    Users([Users]) --> CC[Click Collector API]
    CC --> Kafka[[Kafka / Kinesis]]
    Kafka --> Flink[Apache Flink]
    Flink --> Redis[(Redis / DynamoDB)]
    Flink --> S3[(S3 Parquet)]
    Redis --> Dashboard[Dashboard / API]

Components

1. Click Collector

  • Validates incoming click events
  • Assigns unique click IDs
  • Pushes events to Kafka/Kinesis

2. Kafka / Kinesis (Event Stream)

  • Durable, scalable message queue
  • Partitions by ad_id
  • Enables replay of historical events
  • Consumes click events
  • Uses event-time windows
  • Maintains state per ad_id
  • Emits aggregated counts

4. Storage

  • Redis / DynamoDB — real-time dashboard
  • S3 (Parquet) — long-term analytics

Horizontal Scalability

  • Kafka partitions scale ingestion.
  • Flink parallel operators scale processing.
  • Partition by ad_id ensures even distribution.

Exactly-Once Processing

For billing, accuracy matters.

Flink:

  • Maintains state (via RocksDB)
  • Uses checkpointing
  • Recovers without double counting

Handles Late Events

Mobile clicks arrive late.

Flink uses:

  • Event-time processing
  • Watermarks
  • Allowed lateness

Which brings us to an important concept.


4. What Are Watermarks? (Explained Simply)

Imagine you’re counting clicks for 12:00–12:01.

Some clicks arrive late.

You don’t want to:

  • Close too early (miss clicks)
  • Wait forever

A watermark is Flink saying:

“I believe I’ve seen all events up to time X.”

Once the watermark passes 12:01:

  • Flink closes that window
  • Emits the final count

Late events beyond allowed lateness:

  • Are dropped
  • Or handled separately

This balances accuracy and latency.


5. Deduplication Strategy

Click systems must prevent fraud and duplicates.

Approaches

  • Assign unique click_id
  • Use Flink keyed state to track seen IDs
  • Use TTL to expire old IDs
  • Use idempotent writes in the sink

In interviews, say:

“We ensure idempotency at both the stream processing layer and the storage layer.”


6. Lambda vs Kappa Architecture

Now let’s zoom out architecturally.

Lambda Architecture

Two pipelines:

  • Batch Layer (Hadoop/Spark)
  • Speed Layer (Flink/Storm)
  • Serving Layer (merges both)
flowchart TD
    Source([Event Source]) --> Speed[Speed Layer - Flink]
    Source --> Batch[Batch Layer - Spark]
    Speed --> Serving[Serving Layer]
    Batch --> Serving
    Serving --> Query([Queries])

Flow:

  • Real-time updates from Speed Layer
  • Nightly batch recomputation fixes inaccuracies
ProsCons
High accuracyTwo codebases
Good for massive historical datasetsOperational complexity

Kappa Architecture

Single stream pipeline. Everything flows through streaming.

Need to recompute? Replay from Kafka.

flowchart LR
    Source([Event Source]) --> Kafka[[Kafka]]
    Kafka --> Flink[Flink]
    Flink --> Store[(Storage)]
    Store --> Query([Queries])
    Kafka -. replay .-> Flink
ProsCons
Much simplerHeavy reprocessing can be expensive
One processing path
Easier maintenance

7. Which One Would I Choose?

For a modern ad click system:

I would choose Kappa architecture.

Because:

  • Flink supports event-time + exactly-once
  • Kafka allows replay
  • Operational complexity is lower
  • Real-time is first-class

Lambda is useful when:

  • Historical recomputation is massive
  • Legacy batch systems already exist

8. Handling Failures

What if Flink crashes?

  • State is checkpointed
  • Job restarts from last checkpoint
  • Kafka offsets are restored

What if Redis slows down?

  • Flink applies backpressure
  • Autoscale Redis
  • Buffer and retry

In interviews:

“We rely on Flink’s checkpointing and Kafka durability to ensure fault tolerance and no data loss.”


9. Multi-Region Design

For global ads:

  • Regional Kafka clusters
  • Regional Flink jobs
  • Global aggregation pipeline
  • S3 as central analytics store

This reduces latency and improves resilience. For a deeper look at geo-partitioning and region-based ownership strategies, see the Planet-Scale Auction System post.


Final Thoughts

The real insight in ad click aggregation is that billing accuracy is a streaming problem. You can’t batch your way to real-time correctness when advertisers are paying per click and dashboards need to reflect reality within seconds.

The key architectural decision is Lambda vs Kappa. For most modern systems, Kappa wins — Flink’s event-time processing and exactly-once guarantees mean you no longer need a separate batch layer to “fix” the real-time path. One pipeline, one source of truth, and Kafka replay for reprocessing when you need it.

If you take one thing away: watermarks and exactly-once semantics aren’t academic concepts — they’re the difference between an aggregation system that works and one that silently loses money.