Intro

For some time now I’ve been working and building around ethereum’s infrastructure, and in that space of time I’ve come to realise how incredibly complex it is to scale, manage and even query the blockchain. This complexity is amplified by the sheer volume of transactions, it’s ever-growing state, and the demand for real-time access to data.

I’ve been contributing to the Cardinal-Flume project and I wanted to write some stuff about it here. However, to explain what Cardinal-Flume is and why it’s necessary, I want to first unpack the ecosystem it operates within as it’s is part of a larger effort to solve longstanding issues with managing ethereum nodes at scale. The architecture surrounding Cardinal-Flume isn’t isolated – it exists as part of Cardinal’s broader design. So this note is a walkthrough of our efforts over the years to provide solutions to these issues.

What i’ll be talking about:

the problem with managing ethereum nodes
ethercattle initiative; it’s limitations
introducing cardinal; it’s architecture
how it all works together

🚧

This is not a technical deep dive into the design & architecture of ethereum or any ethereum client. I'm only touching on problems, concepts and solutions from a combined perspective of infra management and protocol dev. Most of these projects are open source and documented --links can be found at the end of this note.

The problem with managing ethereum nodes

Ethereum nodes are the backbone of the network, they participate in consensus, validate transactions & blocks, store blockchain data and expose json-rpc endpoints for dApps, wallets and many other services to interact with ethereum.

Running ethereum nodes at scale can be difficult and expensive because:

high resource usage: eth nodes store gigabytes to terabytes of blockchain data. They need to sync with the entire blockchain which consumes a lot of memory and disk space.
scaling is hard and expensive: running a single nodes can work for small-scale applications, but for services that require blockchain data & querying (e.g dApps, explorers) it becomes challenging because you’ll need scale horizontally (deploy multiple nodes). Each node syncs independently, often duplicating effort and resources.
maintenance is cumbersome: nodes require software updates, restarts and resyncs which can take hours or days. When nodes go down, resyncing adds hours of downtime, impacting the availability of services. This introduces operational overhead that teams must constantly manage.

Additionally, querying historical data is inefficient, as nodes are optimized for real-time state, not long-range block/log queries.

EtherCattle Initiative

To address the issues of scaling and managing eth nodes, EtherCattle was developed as a way to simplify node replication and reduce operational overhead. It introduced streaming replication, allowing a master node to continuously stream new blocks, logs, and transaction data to replica nodes in real time.

The system consisted of:

a master node: a fully synced eth client responsible for validating incoming blocks & transactions and maintaining peer-to-peer connections with the network,
a kafka cluster —which captured write operations and state from the master node, streaming them to connected replicas in real time and
replica nodes: lightweight nodes that subscribed to Kafka topics, continuously receiving updates from the master. These replicas avoided syncing independently from the network, reducing overhead. Admins could run as many replica nodes as necessary to meet capacity requirements, distributing load across multiple nodes.

Essentially, EtherCattle required a master node (a Geth client) that synced with the network, validated incoming blocks & transactions and performed write operations. These operations were streamed over Kafka to replica nodes, which handled RPC requests. As the master wrote data to disk, it logged write operations to a Kafka topic. Replicas subscribed to Kafka topics, consuming these write operations and applying them locally to stay in sync. If the master ever failed, a replica could be promoted to master, ensuring uninterrupted node operation.

By leveraging this architecture, EtherCattle aimed to help teams:

eliminate redundant syncs: replica nodes mirrored the master’s state by consuming streamed data instead of syncing independently from the network.
enable horizontal scaling: administrators could deploy replicas to handle growing query demands (eth_call, eth_getLogs), preventing the master node from being overloaded.
reduce downtime and recovery: in case of master node failure, replicas could be promoted to master, minimizing downtime without resyncing the entire blockchain.

It addressed many of the pain points associated with managing Ethereum nodes. It automated replication, making it easier to scale node infrastructure horizontally without the operational overhead of running multiple independently synced nodes. Although it has now been deprecated to make way for cardinal, you can still read more about its technical design and documentation here

Limitations of EtherCattle

While EtherCattle introduced streaming replication and improved node scaling, its architecture revealed significant limitations over time. The primary issue stemmed from the tight coupling between master and replica nodes. EtherCattle relied on bit-for-bit streaming of database writes from the master to replicas, which required both to share identical database schemas. This rigid design limited EtherCattle’s flexibility, which made it difficult to support blockchain clients with different database structures or integrate networks beyond ethereum. These others networks (evm based/derived) either developed their own clients or had forked from Geth long enough ago that maintaining compatibility required ongoing updates to match schema changes. This architectural constraint ultimately led to the development of Cardinal.

Some other limitations were:

query performance: while ethercattle distributed the workload across replicas, it couldn’t solve the inefficiency of historical queries like eth_getLogs. These queries required scanning through blocks one by one, making the system slower and less responsive when handling long-range data requests.
storage overhead: even though replica nodes avoided syncing directly from ethereum, they still had to store the entire blockchain locally. This resulted in high disk usage across all nodes, contributing to redundant storage and inflated infrastructure costs.
operational complexity: managing kafka clusters, coordinating failovers, and scaling replicas introduced additional DevOps overhead, requiring teams to maintain both Ethereum clients and streaming infrastructure

Introducing Cardinal

To overcome the limitations of EtherCattle, Cardinal was developed as the next step in scaling and managing Ethereum nodes. By taking a modular approach, Cardinal separates the responsibilities of state execution, historical data storage, data streaming, and query handling into distinct components. This decoupling allows each part of the system to focus on specific tasks, reducing overhead and improving efficiency. This shift not only resolves EtherCattle’s schema coupling issue but also provides a more flexible and efficient framework for managing Ethereum nodes across different clients and networks, ultimately reducing operational overhead compared to EtherCattle.

A key element of Cardinal’s design was the introduction of an explicit communication layer that separated the master and replica servers. By abstracting how each component handled data storage, Cardinal made it easier to adapt various Ethereum clients as master nodes. This approach also enabled replicas to focus solely on serving rpc requests and storing data in a way optimized for querying, rather than being constrained by the requirements of participating in Ethereum’s peer-to-peer protocol. This architectural shift lowered costs by allowing replicas to store a leaner version of the Ethereum state trie and streamlining Kafka usage for more efficient data transport.

Cardinal’s Architecture

At a systems level, Cardinal’s architecture shares similarities with EtherCattle:
Master nodes connect to peers, process blocks, establish consensus, and stream data through Kafka to replicas and Flume, which handle RPC requests. However, the internal design of each component reflected a significant departure from EtherCattle’s structure. It was divided into 5 distinct components.

0. Master Nodes

Master nodes are a fundamental part of the Cardinal architecture. These are full Ethereum clients (based on plugeth) that connect to the Ethereum peer-to-peer network and handle critical blockchain operations like syncing blocks, validating transactions, and maintaining consensus. It is the source of chain data that ensures the Cardinal stack operates on accurate and up-to-date information.
In Cardinal, masters stream validated blocks, logs, and transactions through Cardinal-Streams to Cardinal-EVM and Cardinal-Flume. By managing consensus, masters allow Cardinal-EVM to focus on state execution and Cardinal-Flume to handle historical data. This separation improves scalability and fault tolerance, ensuring the system remains reliable even if one master encounters an issue. While EtherCattle used the same codebase for both masters and replicas, Cardinal draws a hard line between the two.

1. Cardinal EVM (execution layer)

Cardinal-EVM forms the execution layer of the Cardinal architecture. It is a heavily stripped-down version of Geth, providing Ethereum Virtual Machine (EVM) functionality without the overhead of full node operations like peer-to-peer networking or consensus. Cardinal-EVM maintains a lightweight execution state focused on executing smart contract bytecode and handling state-related RPC queries.

Unlike traditional Ethereum clients, Cardinal-EVM does not handle transactions, signing, receipts, logs, uncles, or block production. These responsibilities are unnecessary for the EVM’s primary role of state execution. Cardinal-EVM keeps track of only the essential data required for its operation, including Ethereum account data, contract code and storage, recent block headers (typically the last 128 blocks), and mappings between block headers and block numbers.

In terms of interaction, Cardinal-EVM exposes a subset of Web3 RPC methods for state-related operations, such as eth_call, eth_estimateGas, and eth_getBalance. It maintains only the latest execution state, (without storing block history, transactions, receipts, or logs) as it relies on Cardinal-Flume for accessing historical blockchain data. This separation keeps Cardinal-EVM lightweight and efficient, focusing solely on state execution without being burdened by deep block storage or transaction management.

2. Cardinal Streams (data transport layer)

Cardinal-Streams replaces EtherCattle’s raw database replication with a more resilient and flexible protocol for streaming blockchain data. In EtherCattle, replication relied on streaming bit-for-bit database writes, which tightly coupled master and replica nodes and made it difficult to adapt to schema changes or support different clients.

Cardinal-Streams solves this by serializing and streaming structured block, log, and transaction data from master nodes to Cardinal-EVM and Cardinal-Flume through Kafka or similar messaging systems. It operates as a service that listens for data from master nodes, serializes it into structured formats, and streams it to configured destinations. By functioning as a middleware layer, it abstracts the complexities of data replication, ensuring that data flows efficiently between the components of the Cardinal architecture. This decoupled approach allows each part of the stack to handle data independently without worrying about schema consistency.

To ensure reliability, Cardinal-Streams introduces out-of-order message handling, deduplication, and fault tolerance. If a message arrives late or multiple times, Streams can piece it together without disrupting data flow. This reduces replication errors and simplifies scaling across different Ethereum clients or networks.

3. Cardinal Storage (persistence layer)

Cardinal-Storage manages the persistence of blockchain data streamed through Cardinal-Streams, handling the storage of blocks, logs, and state. It is designed to store recent blocks for fast access while maintaining archival records for long-term queries and handling chain reorganizations.

Cardinal-Storage uses BadgerDB, in-memory storage, and overlay databases as its backends, allowing flexibility based on the needs of the deployment. This design lets operators choose the right balance between performance, scalability, and simplicity. By offloading data persistence from Cardinal-EVM, Cardinal-Storage ensures the execution layer remains lightweight, while historical and block data are efficiently managed in the background. This separation improves performance and allows each part of the Cardinal stack to operate independently without being tightly coupled.

For deeper historical access, Cardinal-Storage can be built with archival storage mode, enabling full blockchain history storage for larger or older queries. Currently, this mode is only used on the Holesky network.

4. Cardinal-RPC (query and api layer)

Cardinal-RPC is not a running service or an active part of the Cardinal system but a utility library used to develop Web3-compatible RPC servers that interact with Cardinal’s components. It provides tools for exposing Ethereum RPC endpoints (eth_call, eth_getLogs, etc.) in applications built on the Cardinal architecture. Using Go’s reflection capabilities, Cardinal-RPC automates the mapping of Go methods to JSON-RPC endpoints, reducing the complexity of creating custom RPC servers. Instead of handling serialization and HTTP requests manually, developers can focus on building interfaces that directly interact with Cardinal-EVM for state-related queries and Cardinal-Flume for historical data.

5. Cardinal Flume (historical data layer)

Log Flume (og flume)

Cardinal-Flume is an evolution of an earlier project called Flume. The goal of Flume was to address inefficiencies in eth_getLogs queries on Ethereum clients. Traditional clients like Geth often struggled with long-range queries, requiring minutes to return results and slowing down dApps and services that depended on frequent access to event data emitted by smart contracts.

Flume was designed to offload log indexing from Ethereum nodes and provide a faster, more efficient querying solution. Running alongside Ethereum clients, Flume ingested log data directly from master nodes (e.g., Geth) via Kafka or websockets. It captured finalized blocks in real time, indexing logs and block information for optimized retrieval. Flume used SQLite for its storage layer, a lightweight embedded database capable of managing large datasets (up to 4TB) with query speeds in low milliseconds.

To make querying more efficient, Flume exposed its own set of custom RPC methods prefixed with flume_ (e.g., flume_erc20ByAccount, flume_getTransactionsByParticipant). These methods bypassed slower native Ethereum RPCs, allowing users to directly access indexed data. Additionally, a plugin framework allowed Flume to adapt to specific networks, supporting custom indexing requirements through modular extensions.

Over time, Flume evolved into a robust historical data indexer, broadening its scope to include blocks, transactions, and receipts. By the time it was integrated into Cardinal, Flume had become a standalone solution for managing Ethereum’s historical data, complementing EtherCattle’s master-replica architecture by offloading log and block queries.

Flume Reinvented as Cardinal-Flume

With the introduction of Cardinal, Flume was reinvented as Cardinal-Flume, seamlessly integrated into Cardinal’s modular architecture. Its role shifted from a standalone indexer to the dedicated historical data layer for Cardinal, managing logs, blocks, transactions, and receipts for archival and query purposes.

Cardinal-Flume receives data directly from Cardinal-EVM via Cardinal-Streams. As Cardinal-EVM prunes block data and state beyond the last 128 blocks, these are streamed to Flume for long-term storage and indexing. Similar to its earlier design, Cardinal-Flume uses SQLite as its primary database. It integrates with Cardinal-Storage to allow scalability and backend flexibility while maintaining performance.

Flume processes historical RPC queries (eth_getLogs, eth_getBlockByNumber), which are routed through Cardinal-RPC, ensuring that the execution layer (Cardinal-EVM) is not overloaded with archival requests. By offloading these tasks, Cardinal-Flume allows Cardinal-EVM to remain lightweight and focused solely on live state execution.

In Cardinal, Flume operates in two modes:

Flume Light maintains a database of recent blocks (typically the last 30 minutes). This mode is optimized for frequent queries without requiring a full historical dataset. Flume Heavy stores the complete blockchain history, including logs, blocks, transactions, and receipts, ensuring that long-range queries can be satisfied without limitation. In essence, Cardinal-Flume works as the archival engine for Cardinal, indexing and storing all data beyond the shallow state retained by Cardinal-EVM. Integrated tightly with Cardinal-Streams, Flume ensures reliable data flow from master nodes to storage, providing fast and scalable access to Ethereum’s historical data.

Cardinal-Types

Cardinal-Types is a utility library that provides shared data types, constants, and utility functions used across the Cardinal stack. While not a running component, it plays a crucial role in the development architecture, acting as a dependency for ensuring a standardized interface and improving code maintainability within the ecosystem. It is relied on for shared definitions, such as block structures, log schemas, and serialization utilities, making it an integral part of building and maintaining Cardinal.

How Cardinal Works Together

The master nodes connect to the peer-to-peer network and validate blocks, transactions, and logs. These nodes stream the validated data through Cardinal-Streams, which transports it to both Cardinal-EVM and Cardinal-Flume. Cardinal-Streams ensures reliable delivery of structured data, eliminating the schema dependency that tightly coupled components in EtherCattle.

Cardinal-EVM receives execution-relevant data from Cardinal-Streams to update its lightweight execution state, which includes account balances, contract code, and storage. It handles real-time RPC queries, such as contract calls and balance checks, using its shallow state—typically retaining only the last 128 blocks. Since Cardinal-EVM does not store historical data, older blocks, transactions, and logs are managed by Cardinal-Flume, which archives them for long-term retrieval. This separation allows EVM to remain lightweight and focused solely on live operations, while historical data is efficiently managed in the background.

Cardinal-Flume is responsible for archiving and indexing blockchain data, ensuring that older blocks, logs, and transactions remain accessible for long-range queries like eth_getLogs. Flume works in two modes: Flume Light, which keeps only recent data (e.g., the last 30 minutes) for short-term queries, and Flume Heavy, which maintains the complete blockchain history for deeper archival access. Whether Flume operates in light or heavy mode depends on how it is configured. Flume Light is often used when storage resources are constrained or for deployments prioritizing speed over comprehensive archival. Cardinal-Storage supports both Flume and EVM by persisting indexed data and ensuring historical blockchain data is available when needed. Storage acts as a foundational layer, ensuring that Flume’s SQLite-based databases have a reliable and scalable backend.

This architecture resolves EtherCattle’s inefficiencies by decoupling execution from storage, enabling independent scaling of components, and providing robust solutions for real-time and historical data access

Conclusion

Cardinal addressed EtherCattle’s most pressing issues with a modular architecture that separated state execution, data streaming, and historical storage. This decoupling resolved the tight coupling between master and replica nodes, allowing Cardinal to operate independently of schema constraints that plagued EtherCattle. By streaming structured data instead of raw database writes, Cardinal eliminates the inefficiencies of redundant storage and complex synchronization.

Through this design, Cardinal achieves faster access to data, with Cardinal-EVM handling real-time operations and Cardinal-Flume indexing and storing historical data for efficient long-range queries. Flume’s light and heavy modes ensure flexibility, enabling rapid data access where needed or full archival access for deep queries. These capabilities ensure that dApps and services can scale effectively, retrieving data quickly without burdening the execution layer.

Beyond EtherCattle’s limitations, Cardinal introduces scalability and fault tolerance. Each component—whether EVM, Flume, or Storage—operates independently, allowing for horizontal scaling to handle increased workloads. Master nodes can work in tandem, ensuring continuity even if one node encounters an issue, while Flume’s indexing ensures that historical data remains readily accessible without impacting live state operations.