Bringing Shadow Indexing, Wasm Data Transforms, and Protobuf to Redpanda

Blog

Announcement

The Redpanda platform is getting three new features.

ByVivek SaraswatonDecember 15, 2021

Bringing Shadow Indexing, Wasm Data Transforms, and Protobuf to Redpanda

‘Tis the season of giving, so we are excited to announce three new major features of the Redpanda platform this month: Shadow Indexing, support for Protobuf in the Schema Registry, and tech preview for Wasm Data Transforms. Each of these innovative features is the result of considerable development effort and each provides significant capabilities and benefits to developers building real-time applications.

Shadow Indexing tech preview: Unify real-time and historical data with zero code changes

Shadow Indexing is our tiered storage solution. It archives log segments to cloud object storage and replays those logs as a stream from the cloud if the data no longer exists in the cluster. This makes streaming truly global by moving the source of truth from individual clusters to the cloud, and unifies real-time and historical data without any code changes to your applications. After extensive work from our team, we are excited to announce the release of Shadow Indexing as a tech preview feature in Redpanda Cloud and Enterprise editions. This feature is not yet supported for production usage.

Shadow Indexing

The initial release of Shadow Indexing focuses on the use cases of data archival and Kafka topic recovery. Users can enable Shadow Indexing on a per-topic basis and automatically push data to either AWS S3 or GCP Cloud Storage. Users can also set topics to automatically recover from object storage if they are deleted from the cluster due to data loss or other issues. Note that this feature is enabled by default on our Redpanda Cloud platform for new topics created via the UI in order to ensure maximum durability for your data.

Enabling Shadow Indexing is simple. When creating a new topic using rpk, you can simply pass the following flags to enable remote reads/writes as well as topic recovery:

--redpanda.remote.write – Uploads data from Redpanda to cloud storage
--redpanda.remote.read – Fetches data from cloud storage to Redpanda
--redpanda.remote.recovery – Recovers a topic from cloud storage.

Adding Shadow Indexing when creating a topic is as simple as

--rpk topic create <topic-name> -c redpanda.remote.read=true -c redpanda.remote.write=true

or for an existing topic

--rpk topic alter-config <topic-name> --set redpanda.remote.read=true --set redpanda.remote.write=true

What makes Shadow Indexing truly unique compared to traditional tiered storage is the ability to replay log data natively from object storage—in the future, look for use cases like spinning up read-only Redpanda clusters that can independently retrieve data from object storage without impacting operational clusters or making code changes. An upcoming blog post will give a deep dive into our Shadow Indexing architecture and use cases.

Protobuf: More options for schema management

Earlier this year we released our Schema Registry with support for Apache Avro. The latest Redpanda release extends our registry with support for Protobuf, Google’s language- and platform-neutral mechanism for serializing structured data.

Both Avro and Protobuf are great options that support multiple languages, schema evolution, and excellent performance. While Avro has been a common schema choice within the Hadoop and Kafka communities for Java-based applications, Protobuf has gained popularity for applications leveraging Go, Python, Rust, and Ruby. And if you’re still looking for more schema choices, stay tuned for JSON support next year!

Here’s a quick example of how to get started with Protobuf in the Schema Registry:

Import a schema with a message called Simple and subject simple.proto:

curl -X POST -H "Content-Type: application/json" --data '{"schema":"syntax = \"proto3\"; message Simple { string id = 1; }","schemaType":"PROTOBUF"}' http://localhost:8081/subjects/simple.proto/versions

Import a schema, referencing the previous one:

curl -X POST -H "Content-Type: application/json" --data '{"schema":"syntax = \"proto3\"; import \"simple.proto\"; message Imported { Simple id = 1; }","schemaType":"PROTOBUF","references":[{"name":"simple.proto","subject":"simple.proto","version":1}]}' http://localhost:8081/subjects/imported/versions

Other commands from the blog post work similarly for Avro and Protobuf.

Data Transforms tech preview: WebAssembly (Wasm) support soon!

Coming later this month in our Redpanda Developer Edition is the tech preview of Redpanda Data Transforms, which enables users to create WebAssembly (Wasm) based scripts to perform simple data transformations on topics. This reduces “data ping-pong” by eliminating the need to send data out to a stream processor for common transformation tasks like data scrubbing, cleaning, normalizations, etc. For a more in-depth look at the technology behind data transforms, read our previous post on our Wasm engine architecture.

Pacemaker

The Data Transforms tech preview is built around our custom asynchronous transformation engine, a sidecar to the core Redpanda process that creates a transformed child topic with 1:1 partition matching from the parent topic. This is best used for stateful one-shot transformations that can sustain reasonably low latency (e.g. 10-50+ms). Common use cases for this involve mapping data; one we hear often from our customers is the ability to redact data (for example, remove PII such as names) from a produced topic before it gets sent to a consumer.

Getting started with Data Transforms is as easy as using rpk to load a Wasm script and attach it to a topic on your Redpanda cluster:

rpk wasm generate <project_name> --  Generates a sample nodejs project you can modify for the transformation
rpk wasm deploy <javascript_file.js> --name <coprocessor_uuid> -- Deploys coprocessor to the Wasm engine and will run the transform on all interested topics
rpk wasm remove <coprocessor_uuid> -- Deregisters and removes the transform from the cluster

This feature is in tech preview and thus not supported for production use. As we move to general availability and beyond, we plan to introduce an in-line transformation engine for ultra-low latency use cases, along with using code injection to hook into Redpanda for more strategic alterations to our distributed storage engine. Stay tuned for more architecture deep dives and feature details.

You can get started with Shadow Indexing and ProtoBuf support today by downloading the latest version of Redpanda or contact us to sign up for the Cloud. We’d love to hear your feedback on our Slack Community.