Build a real-time ad analysis system using Redpanda, Flink, and Pinot

ByDunith DhanushkaonNovember 16, 2023
Reference architecture for a real-time ad performance analytics platform

Behind every ad served to a user, impressions and click events follow. Ad impressions represent the number of times an ad is fetched and potentially viewed by a user. A click on an ad captures the user’s intention for conversion. The ability to analyze these events in real time is crucial to optimize ad campaigns, maximize ROI, and deliver hyper-personalized content to users.

The responsibility of an ad events processing system is to manage the flow of events, cleanse them, aggregate clicks and impressions, and provide this data in an accessible format for reporting and analytics to measure the overall performance of ad campaigns.

In this post, we present a reference architecture for a real-time ad performance analytics system using Redpanda, Apache Flink®, and Apache Pinot™ for unbeatable scalability, performance, and correctness of insights. (If you haven’t already, check out our previous reference architecture on how to build a demand-side platform for AdTech first.)

Ready? Let’s dig in.

3 keys to accurately measuring ad performance

When building a system for real-time ad performance tracking, you’ll need to meet the following objectives.

  1. Speed: Speed enables timely decision-making, allowing advertisers to make split-second decisions about ad placement, targeting, and bidding to capture opportunities and respond to market changes. Also, real-time performance analysis helps allocate budgets to the most effective campaigns and channels, ensuring that resources aren’t wasted on underperforming strategies. Furthermore, timely detection and prevention of ad frauds prevents ad budget wastage on non-human or low-quality traffic.

  2. Reliability: The system must be reliable in terms of data integrity. Ad events represent actual money paid by customers, so a loss of events means a loss of potential revenue.

  3. Accuracy: Correctness of ad performance metrics is also important for precisely billing customers and partners. Erroneously counted clicks result in overcharging advertisers and overreporting the success of ads. That is where exactly-once ad event processing is crucial.

Reference architecture for real-time ad performance analytics

To meet the above goals, we can design a system by leveraging three major technologies, including but not limited to:

  • Redpanda as the streaming data platform

  • Flink for streaming ETL

  • Pinot for serving OLAP queries on streaming ad events

To give you a better understanding of how they all connect, take a look at the diagram below.

Overall solution architecture of a real-time ad performance analytics platform
Overall solution architecture of a real-time ad performance analytics platform

Remember that the specific requirements and technologies may vary based on your organization's needs, but this should give you a good starting point. Now, let’s briefly cover each part of the diagram and the roles these three powerful technologies play.

Real-time ad event ingestion with Redpanda

Ad event ingestion is the first step of the solution, ensuring that every ad impression, click, and conversion is reliably captured, transported, and stored in a central location, enabling subsequent analysis. The ingestion process gathers data from diverse sources, including websites, mobile apps, and ad servers, in various formats such as JSON, XML, Avro, or log files.

Ingestion must happen in real time, allowing downstream consumers to process events and react as quickly as possible. So, the solution leverages a streaming data platform like Redpanda for streaming event ingestion over a data warehouse or a data lake. Redpanda is the first point of contact for all ad events entering the solution, ingesting raw events over different protocols, such as Avro, JSON, and Protobuf.

Ad events are captured and ingested into Redpanda.
Ad events are captured and ingested into Redpanda.

What value does Redpanda bring in this scenario? Well, Redpanda helps us achieve all three goals: speed, reliability, and accuracy. This is thanks to its scalable platform for ingesting ad events at higher throughputs, potentially going up to 1GB/sec workloads while proving lower tail latencies in the milliseconds range. Its novel design based on native Raft implementation also improves data safety, ensuring no single loss of events happens while in operation—giving you increased reliability and accuracy.

Take Redpanda for a spin!

Try the most simple, powerful streaming data platform—for free.

Streaming ETL and exactly-once processing with Flink

After raw ad events are ingested into a source topic in Redpanda, they’re still not quite ready for analysis. First, they must be cleansed, deduped, transformed, and enriched. Traditionally, this has been done by running periodic extract, transform, and load (ETL) jobs on raw data.

However, since the solution prioritizes generating real-time insights, it leverages a stream processing engine like Flink to perform streaming ETL while the ad events are ingested. Happily enough, Redpanda is Kafka API-compatible so you can use a Flink Kafka source for upstream consumption. Alternatively, you could use other stream processors compatible with Kafka APIs in the solutions, like Apache Beam.

Like Redpanda, Flink also meets all three of our initial requirements. Its stream processing model allows it to handle high-velocity data streams with low-latency processing, making it perfectly aligned with the demands of the advertising industry. Flink's ability to scale horizontally also makes it well-suited for handling the ever-growing volume of ad events.

Flink's stateful processing capabilities enable it to maintain context and analyze ad impressions over time, facilitating functions like user session tracking and conversion attribution. When it comes to reliability, Flink offers fault tolerance, ensuring that data isn’t lost in the event of failures, which is crucial in maintaining uninterrupted real-time analysis.

Furthermore, the exactly-once configuration in both Flink and Redpanda ensures that any events processed through Flink and sunk to Redpanda are done transactionally. When you use Flink’s KafkaConsumer with “read_committed” mode enabled, it’ll only read transactional messages, significantly enhancing the accuracy of generated metrics.

Serving faster OLAP queries with Pinot

Primary consumers of ad metrics computed by Pinot
Primary consumers of ad metrics computed by Pinot

Once the ad events are cleansed and processed, they’re ingested into Pinot, a real-time OLAP database capable of running analytical queries at millisecond latencies. Having Pinot as the serving layer feeds several downstream consumers, including real-time ad performance dashboards and data products, such as internal applications and APIs exposed to external partners.

Pinot is a streaming database capable of streaming event ingestion from data sources, like Apache Kafka®. And, anything compatible with Kafka is compatible with Redpanda, so integrating Redpanda with Pinot is pretty straightforward.

Pinot is designed to handle high-throughput data with low-latency queries, which is crucial in ad analysis where a massive volume of ad impressions needs to be processed quickly. This ensures that advertisers can react to changing conditions swiftly.

As the volume of ad impressions and data grows, Pinot can be easily scaled horizontally to handle the increased load, ensuring that the system remains responsive even as its demands increase.

When it comes to query flexibility, Pinot supports a variety of query types, including filtering, aggregation, and drill-downs. Advertisers and analysts can use these capabilities to dissect ad performance data from different angles, gaining deeper insights into user behavior and campaign effectiveness.

Reporting and ad-hoc SQL querying

The solution can benefit from a data warehouse for reporting and data analysis, which is not latency-sensitive. For example, these can be typical business intelligence (BI) workloads, such as producing daily reports and ad-hoc SQL querying, leveraged by analysts and data scientists.

You can use a compatible Kafka Connect sink connector with Redpanda to stream the processed data into various data warehouses, such as BigQuery, Snowflake, Redshift, etc.

Long-term storage

Raw data can be streamed into a data lakehouse solution for ML workloads, such as building and running ML models for improved ad targeting.

And that’s it! With this reference architecture, you can confidently lay the foundation for a real-time ad performance analytics system fueled by speed, reliability, and accuracy—the most crucial attributes in the AdTech industry.

Get started with Redpanda for lightning-fast ad analytics

Whether you’re an ad exchange, publisher, or advertiser—measuring ad performance is vital in making informed decisions about ongoing ad campaigns. Feel free to adopt this architecture and augment it with your own technology components.

With Redpanda in the mix, you can add more mileage to your journey in terms of performance, reliability, and cost-efficiency. Plus, since Redpanda is designed with developers in mind, you can count on a vastly more simplified experience when it comes to operations and maintenance.

To get started with Redpanda, browse the Redpanda blog for tutorials or dive into one of the many free courses at Redpanda University. If you have questions or want to chat with our engineers, join the Redpanda Community on Slack.

Let's keep in touch

Subscribe and never miss another blog post, announcement, or community event. We hate spam and will never sell your contact information.