Objective: This week we move from defining overall architectural styles to understanding the underlying mechanisms that allow geographically separated components to Communicate reliably and Coordinate their actions without a centralized clock. This session heavily maps to the theories presented in the van Steen & Tanenbaum Distributed Systems textbook.
📖 Reading Assignment: Before proceeding with this week’s exercises, please read Chapter 4 (Communication) and Chapter 6 (Coordination) in Distributed Systems (van Steen and Tanenbaum, Version 4.0.3x).
At the core of any distributed system is the ability to move data between nodes. While we have explored raw TCP/UDP sockets (Week 2/3), enterprise systems rely on higher-level abstractions.
The aim of RPC is to hide communication by making a remote function call look identical to a local function call.
charge_credit_card() and the connection drops, did the server execute it and fail to send a reply, or did the server crash before execution? Dealing with at-least-once vs at-most-once semantics introduces high complexity.Unlike RPC’s synchronous, tightly-coupled nature, Message-Oriented Middleware (MOM) allows asynchronous decoupling.
If multiple independent computers are working on a shared problem, how do they agree on the state of the world?
In a single machine, there is one CPU clock. In a distributed system, every computer has its own physical quartz clock. Over time, these clocks drift apart due to microscopic hardware differences.
Many distributed systems require one node to act as a coordinator. If that coordinator crashes, the remaining nodes must elect a new one without human intervention.
Implementing consensus algorithms (like Paxos or Raft) from scratch is notoriously difficult. Modern cloud-native architectures rely on highly consistent coordination services like Apache ZooKeeper (or etcd).
ZooKeeper is a centralized open-source service for maintaining configuration information, naming, providing distributed synchronization, and offering group services over large clusters.
znodes.Instead of building the Bully algorithm, nodes simply use ZooKeeper’s atomic creation:
/cluster/leader./cluster/leader and go to sleep./cluster/leader again.Task: The ZooKeeper Homogeneous MapReduce Array
Move to the k8s_zk_template directory. We are rebuilding our Week 5 Image equalization engine to be completely fault-tolerant and homogeneous.
You must deploy a Kubernetes application where all Pods run the exact same app.py code. Upon startup, the pods connect to a central ZooKeeper service in the cluster.
kazoo to elect a Master.role: master to itself.http://master-service to that specific pod!/jobs znodes. The ephemeral workers watch this path, claim specific chunk jobs, and return the histogram logic securely.Before tackling the real-world architectural problems below, ensure you understand the following challenging technical vocabulary implicitly relied upon in Distributed Systems:
set_x(5) is idempotent, whereas add_to_x(1) is not). In distributed network architectures, designing API handlers to be idempotent (often by passing unique Transaction UUIDs) allows clients to safely retry failed network requests without causing duplicate actions (like double-charging a credit card).The following problems bridge the theoretical concepts of Communication (Chapter 4) and Coordination (Chapter 6) into practical, real-world data center and internet-scale scenarios. Reference the van Steen & Tanenbaum (v4.0.3x) textbook for foundational models.
Reference: Chapter 4: Communication (RPC Semantics, Sec. 4.2.2, ~p. 182-185)
Scenario: An e-commerce backend in an AWS Data Center uses Remote Procedure Calls (RPC) to communicate between a CartService and a BillingService. When a customer checks out over the internet (often dropping cellular connection), the CartService issues an RPC process_payment(user_id, $50) to the BillingService. A timeout occurs and no response is returned.
Question: What is the difference between At-least-once and At-most-once semantics in this scenario? Which is safer for internet-scale financial transactions, and how would you implement it from a design perspective?
Worked Solution:
CartService will keep retrying the RPC indefinitely until it receives an ACK. If the network dropped the reply but not the request, the BillingService might execute the $50 charge multiple times.CartService generates a unique Transaction ID (UUID) and passes it in the RPC: process_payment(user_id, $50, TXN_1234). We can now safely use an At-least-once retry loop. The BillingService coordinates by storing TXN_1234 in a database. If the RPC is retried, the BillingService sees the UUID, skips the credit card charge, and simply returns the cached success response.Reference: Chapter 6: Coordination (Logical Clocks, Sec. 6.2, ~p. 306-310)
Scenario: You have a distributed NoSQL database spanning three localized data centers (Nodes A, B, and C). They have distinct physical quartz clocks that suffer from clock drift.
Question: How does Lamport’s Logical Clocks solve this without needing to physically synchronize the server clocks using NTP?
Worked Solution:
Reference: Chapter 6: Coordination (Election Algorithms, Sec. 6.5, ~p. 339-344) & Chapter 4: Communication (Message-Oriented, Sec. 4.3, ~p. 192-205)
Scenario: Millions of IoT smart-thermostats (Edge devices) connect over the public internet to a massive Data Center cluster. They communicate purely via an asynchronous, topic-based Publish-Subscribe (Pub/Sub) message broker (like MQTT).
Within the Data Center, a cluster of 5 Analytics nodes process this data. Only one node can act as the “Coordinator” responsible for writing data to the Master SQL Database. The current Coordinator node catches on fire and is destroyed.
Question: Work through the exact steps of how the remaining 4 nodes utilize an Election Algorithm to coordinate, and how they subsequently communicate the result to the asynchronous Pub/Sub broker to ensure IoT traffic is not lost during the outage.
Worked Solution (The Bully Algorithm + MOM):
ELECTION message to all nodes with a higher ID (Node 4 and Node 5).OK message, forcing Node 3 to stand down.ELECTION message to higher nodes (Node 5).COORDINATOR message to Nodes 1, 2, and 3. The cluster is re-coordinated.thermostat/data topic, and begins draining the buffered message queue, writing the backlog to the SQL database.To bridge the gap between abstract theory and your practical Kubernetes exercises, let’s explicitly compare the architectural topology you built in Week 5 (k8s_histogram_eq) against the highly resilient array you are deploying today in Week 9 (k8s_zk_template).
In Week 5, we approached Kubernetes with a traditional Service-Oriented mindset:
master Deployment and one explicit worker Deployment.Today, we are moving to a purely distributed Event-Based mindset powered by an external Coordination Engine (Apache ZooKeeper).
kazoo Python library to race for a ZooKeeper lock.
The single pod that wins the race assumes the Master role. It then recapitulates its own environment: it issues a command to the internal Kubernetes API to patch its own Pod label to role: master.
Because our internal Kubernetes Service specifically targets role: master, the internal DNS seamlessly redirects all incoming user web-traffic strictly to whatever pod won the election! If that pod dies, another pod wins the election, changes its label, and the DNS dynamically swings to the new pod in milliseconds.JSON jobs inside ZooKeeper (/jobs/<job_id>).
The identical “Worker” pods simply place a Watch on that directory. When a job drops, they race to grab it.To visually understand the real-time asynchronous data flow of the kazoo ZooKeeper Watchers, native K8s log shipping, and the eventlet.tpool OS-thread offloading mechanism, reference the execution diagrams below:
sequenceDiagram
participant U as User (Frontend)
participant M as Master Pod (zk-app)
participant Zk as ZooKeeper
participant W as Worker Pods (1..4)
U->>M: POST /upload (File Payload)
M->>Zk: zk.retry(ensure_path, '/jobs/job_123')
M->>M: eventlet.tpool (OS Thread): extract_chunks()
rect rgb(30, 45, 60)
Note right of M: Map Phase (Histograms)
M->>Zk: zk.create('/jobs/job_123/histograms/task_0..4')
Zk-->>W: Kazoo ChildrenWatch Trigger
W->>Zk: zk.create('.../task_0/lock', ephemeral=True)
Note over W: Worker claims task!
W->>W: eventlet.tpool: process_histogram_task()
W->>Zk: zk.set(..., histogram_data)
W->>Zk: zk.create('.../task_0/done')
end
M->>Zk: Polling barrier on 'done' znodes
Note over M: Master awaits all workers
M->>M: eventlet.tpool: compute_global_cdf(histograms)
rect rgb(30, 45, 60)
Note right of M: Reduce Phase (Equalization)
M->>Zk: zk.create('/jobs/job_123/equalize/task_0..4', CDF_Payload)
Zk-->>W: Kazoo ChildrenWatch Trigger
W->>W: eventlet.tpool: apply_cdf_task(CDF_Payload)
W->>Zk: zk.create('.../task_0/done')
end
M->>M: eventlet.tpool: stitch_image()
M->>U: SocketIO.emit('job_status', {url: final_path})
M->>Zk: zk.delete('/jobs/job_123', recursive=True)
[!NOTE] Why
eventlet.tpool? If massivenumpyimage-slicing executions (e.g., 75MB.tiffchunks) run purely within Python’s Eventlet GreenThreads, they consume 100% of the single CPU process, artificially starving the Kazoo ZooKeeper heartbeat thread. By explicitly wrapping Python’s synchronous CPU-bound bottlenecks sequentially ineventlet.tpool.execute(), MapReduce processing is safely delegated to real OS threads while keeping the asynchronous network socket hub entirely available for continuous ZK ping orchestration!
Now that we have covered the theory of message queues and ZooKeeper-orchestrated consensus, you can organically peruse the underlying application code supporting the MapReduce infrastructure.
Because GitHub does not natively allow you to download a single specific sub-directory natively from the Web UI, you have two primary methods to grab the project folder and deploy it locally onto your Minikube cluster:
git clone https://github.com/robmarano/robmarano.github.io.git
cd robmarano.github.io/courses/ece465/2026/weeks/week_09/k8s_zk_template
k8s_zk_template directory into an isolated .zip archive.
Before deciding to download to your local machine, you can interactively peruse the key architectural code elements operating beneath the MapReduce stack right here in your browser utilizing GitHub’s native viewer.
Click through the file links below to study the Python backend kazoo execution models, or investigate how the frontend CSS Masonry grid isolates HTTP cross-site scripting natively:
app.py (Master/Worker ZooKeeper Event Hub) — This handles the primary Socket.IO web hooks, bazoo.recipe.election races, native event ChildrenWatch deployments, and the asynchronous tpool data arrays.core/image_processing.py (MapReduce Data Pipeline) — This handles extracting .tiff image partitions, serializing JSON histogram chunks natively for the znode payloads, and resolving final matrix equations.templates/index.html (WebSockets & XSS Grid UI) — Examines the native DOM element instantiation (document.createElement) to build live, secure, asynchronous terminal feeds.k8s/zookeeper.yaml & k8s/app.yaml — Dive into the core manifestation topologies deploying the isolated ZooKeeper proxy node vs the 5 homogeneous MapReduce Python Nodes.