Communication & RPC

In distributed systems, communication is the backbone that enables processes running on different machines to coordinate and share data. It involves mechanisms like message passing, sockets, and middleware to handle network delays, data serialization, and fault tolerance. Remote Procedure Call (RPC) is a higher-level abstraction that allows a program to execute procedures on a remote machine as if they were local, hiding the underlying complexity of networking.

1. How does message passing differ from shared memory in distributed systems, and what are the trade-offs?

Message Passing: Processes exchange information by sending/receiving messages via a network.

Advantages: Works naturally in physically distributed systems. Provides loose coupling (sender and receiver can run on different hardware/OS). Easier to implement security and fault tolerance at the protocol level.
Disadvantages: Network latency and bandwidth overhead. Possible issues with ordering, message loss, and duplication.
Use Case: Microservices communicating over REST/gRPC.

Shared Memory: Processes communicate by reading/writing to a common memory region.

Advantages: Very fast (no serialization/networking overhead). Useful for tightly coupled multi-core or single-node clusters.
Disadvantages: Needs synchronization primitives (locks, semaphores). Maintaining coherence across multiple physical nodes is hard.
Use Case: Multithreaded applications on a single machine.

Trade-Off:

Message passing scales better across machines -> distributed systems.
Shared memory is faster but limited to tightly coupled environments

2. Explain the concept of synchronous vs. asynchronous RPC and when each is used.

Synchronous RPC:

Client sends a request and waits (blocks) until the server replies.
Pros: Simple programming model, predictable behavior.
Cons: Wastes time during delays, risk of client blocking indefinitely.
Use Case: Authentication calls, database queries that must return before proceeding.

Asynchronous RPC:

Client sends a request and continues execution without waiting.
Response delivered later via callback, future/promise, or polling.
Pros: Improves concurrency, reduces client idle time.
Cons: More complex error handling, harder to reason about control flow.
Use Case: Long-running tasks (e.g., video transcoding, machine learning jobs).

3. How does RPC handle heterogeneity in distributed systems?

Challenge: Different machines may have different OS, hardware, data representations (e.g., big-endian vs. little-endian).

Solutions in RPC:

Stubs: Auto-generated client and server-side code that marshal (pack) and unmarshal (unpack) parameters.
Serialization: Converts complex data into platform-independent format (JSON, XML, Protobuf, Avro).
Transport Layer Abstraction: RPC frameworks hide underlying protocols (TCP, UDP, HTTP/2).

Example: gRPC allows Python services to talk with Java or Go services using Protobuf, ensuring type safety and interoperability across heterogeneous platforms.

4. Explain how at-most-once and exactly-once semantics are achieved in RPC.

At-Most-Once Semantics:

Ensures the request executes 0 or 1 times (never more than once).
Mechanisms: Unique request IDs, duplicate request suppression, caching responses.
Limitation: If a request is lost, it may not execute at all.
Use Case: File write operations where duplicate writes must be avoided.

Exactly-Once Semantics:

Ensures the request executes one and only one time, even with retries.
Mechanisms: Persistent logging, acknowledgment tracking, idempotent operations.
Limitation: More expensive to implement due to state tracking.
Example: Banking systems (to prevent double debit/credit on retries).

Comparison:

At-most-once: Safer but may miss execution.
Exactly-once: Stronger guarantee but requires more coordination.

5. How does RPC manage failures such as network crashes or server crashes?

Types of Failures:

Client crash (request lost).
Server crash (request received but not processed fully).
Network crash/partition (messages delayed, lost, or duplicated).

RPC Failure Handling Mechanisms:

Timeouts: Client detects failure when response not received in expected time.
Retries: Safe for idempotent operations (e.g., read requests).
Error codes/Exceptions: Client is explicitly informed of failure.
Circuit Breakers: Prevent cascading failures by halting requests to unhealthy services.
Failover: Clients redirect to backup servers.

Example: gRPC supports retry policies, deadlines, and backoff strategies to handle transient failures gracefully.

6. How does remote procedure call transparency impact distributed system design?

Transparency Goal: RPC hides distribution details, making remote calls look like local calls (location, replication, concurrency transparency).
Advantage: Developers can focus on business logic without handling networking details.
Disadvantage: Over-transparency hides latency, failures, and retries, causing unrealistic assumptions (e.g., thinking every call is instantaneous).
Example: In microservices, blindly chaining RPC calls may cause cascading failures if latency or partial failures are not considered.

7. Describe the difference between stub-based and message-based RPC implementations.

Stub-based RPC

Uses automatically generated client/server stubs.
Handles marshalling/unmarshalling (parameter packing).
Safer & developer-friendly (type-checking, code generation).
Example: gRPC, CORBA.

Message-based RPC

Developers explicitly send structured messages (e.g., JSON, XML).
More flexible but requires manual parsing and error handling.
Error-prone (no strict type enforcement).
Example: REST APIs over HTTP/JSON.

8. How do distributed systems ensure ordering and reliability in RPC calls?

Ordering:

Sequence numbers / logical clocks ensure dependent RPCs are applied in correct order.
Total-ordering protocols (e.g., Lamport clocks, vector clocks) may be used.

Reliability:

TCP ensures reliable delivery.
RPC frameworks add ACKs, retries, and duplicate suppression (for UDP).

Example: In a stock trading platform, RPCs must maintain strict order (buy before sell) to prevent inconsistent states.

9. Explain how streaming RPC differs from traditional request-response RPC.

Traditional RPC: One request -> one response.

Streaming RPC: Supports multiple messages per call over a single channel.

Client-streaming: Many requests -> one response.
Server-streaming: One request -> many responses.
Bidirectional streaming: Both sides exchange multiple messages concurrently.

Use cases: Real-time chat apps, IoT telemetry, video streaming, financial market feeds.

Example: gRPC streaming is widely used for live data pipelines (e.g., Kafka consumers -> services).

10. How can RPC be optimized for high-latency or unreliable networks?

Remote Procedure Calls (RPC) over high-latency or unreliable networks need optimizations to reduce delays, avoid repeated failures, and maintain consistency:

Batching and Aggregation

Send multiple requests together in a single RPC call instead of many small ones.
Reduces network round-trips.
Example: Sending multiple database updates in a batch rather than separate calls.

Asynchronous RPC

Client continues execution without waiting for the server’s immediate response.
Improves concurrency and throughput in slow networks.

Efficient Serialization & Compression

Use compact formats like Protocol Buffers (Protobuf) or Avro instead of verbose JSON/XML.
Compress large payloads to save bandwidth.

Idempotent Operations

Design RPCs so that retries do not cause duplicate side effects.
Example: “Transfer $100 if not already done” instead of blindly “Transfer $100.”

Timeouts and Retries with Backoff

Use exponential backoff (retry after increasing delays).
Prevents flooding the network with retries during outages.

Circuit Breakers & Fallbacks

Temporarily stop calling an unresponsive service to avoid cascading failures.
Provide fallback responses where possible (e.g., cached data).

Streaming RPC (instead of many calls)

Keep a persistent connection to stream multiple messages.
Avoids repeated connection setup overhead.

11. How does asynchronous messaging in RPC help prevent blocking in distributed systems?

Asynchronous RPC allows the client to continue execution immediately after sending a request. The response is handled later via callbacks, polling, or event loops. This prevents blocking, improves system throughput, and supports high-concurrency applications.

Example: A web server using asynchronous RPC can continue handling other requests while waiting for a database response.

12. Explain how RPC frameworks ensure type safety across heterogeneous platforms.

RPC frameworks ensure interoperability and type safety by:

Interface Definition Languages (IDLs): Define service methods, parameter types, and return types in a platform-independent way. For examples, Protocol Buffers (Protobuf), Thrift, Avro.
Stub generation: IDL compilers generate client and server stubs in different programming languages.
Serialization formats: Encode structured data into a common wire format that preserves type info.
Cross-language compatibility: Ensures that an int32 defined in the IDL is interpreted consistently in Java, C++, Python, etc.

Example: gRPC with Protobuf allows a Python client to safely call a Java server method with strong type guarantees.

13. How do RPC systems implement load balancing for high availability?

RPC frameworks use multiple strategies:

Client-side load balancing: Client chooses from a pool of server endpoints (e.g., Netflix Ribbon).
Server-side load balancing: A proxy or dispatcher distributes incoming requests (e.g., Envoy in service mesh).
DNS-based load balancing: DNS returns multiple IPs for the same service, spreading requests geographically.
Dynamic load balancing: Uses real-time metrics (CPU, latency) to route traffic adaptively.

Example: Netflix uses client-side load balancing with Ribbon + Eureka to distribute microservice RPC calls evenly.

14. How does RPC handle network partitions and ensure eventual consistency?

In case of partitions:

Retries for idempotent operations: Safe re-execution avoids lost updates.
Versioning & vector clocks: Track concurrent updates to resolve conflicts.
Quorum-based replication: Operations proceed only if a majority of replicas respond.
Eventual consistency: Updates propagate across partitions once connectivity is restored.

Example: DynamoDB uses vector clocks and quorum replication to reconcile RPC updates after partitions.

15. How can RPC frameworks mitigate the problem of “head-of-line blocking”?

Head-of-line blocking occurs when a single slow request delays all subsequent RPC calls on the same connection. Mitigation techniques:

Multiplexing: Use multiple logical streams over one connection (e.g., HTTP/2 or gRPC).
Parallel RPC connections: Send independent requests concurrently.
Async non-blocking calls: Allow other operations to continue while waiting for a slow response.

Example: gRPC over HTTP/2 uses stream multiplexing to prevent HOL blocking.

16. Describe the role of marshalling and unmarshalling in RPC. Why is it critical?

Marshalling: Converts procedure parameters/objects into a standard transmittable format (byte stream).

Unmarshalling: Reconstructs the data back into usable objects on the receiver side.

Criticality:

Ensures data integrity and type safety across heterogeneous architectures.
Makes RPC calls platform-independent (e.g., Java -> C++).
Errors in marshalling cause corrupted data, crashes, or incorrect results.

Example: Protobuf marshals structured data into a binary format, which the receiver unmarshals back to typed objects.

17. How does RPC handle exactly-once execution semantics in unreliable networks?
Achieving exactly-once semantics involves:

Problem: Retransmissions after failures can cause duplicate executions.

Solution mechanisms:

Assign unique request IDs to each RPC.
Server maintains a log/cache of processed requests and their results.
On retries, server returns the cached response instead of re-executing.

Guarantees correctness in idempotent and non-idempotent operations.

Example: Banking systems ensure that a “withdraw $100” RPC executes exactly once even if the client retries due to timeout.

18. How does RPC differ from REST and gRPC in terms of communication style?

RPC (classical): Focuses on making a remote call look like a local function. Often synchronous, tightly coupled.
REST: Resource-oriented, stateless, uses HTTP verbs (GET, POST, etc.). Language-agnostic but less efficient for high-throughput systems.
gRPC (modern RPC): Uses HTTP/2 + Protobuf, supports streaming and multiplexing, strongly typed stubs, high performance.

Example: REST is used in web APIs for portability, while gRPC is preferred in microservices for efficiency and type safety.

19. What are the security challenges in RPC and how are they addressed?

Challenges:

Eavesdropping -> Interception of sensitive RPC data.
Replay attacks -> Attacker replays old valid messages.
Spoofing -> Fake clients/servers impersonating real ones.

Solutions:

Transport encryption (TLS/SSL) -> Protects confidentiality and integrity.
Authentication & authorization (certificates, tokens, OAuth).
Nonces / sequence numbers -> Prevent replay attacks.

Example: gRPC over HTTPS enforces TLS encryption + mutual authentication to secure service-to-service RPC calls.

20. How do modern RPC frameworks support observability in microservices?

Distributed systems need end-to-end visibility of RPC calls. Frameworks integrate with observability tools to provide:

Tracing (e.g., OpenTelemetry, Jaeger) -> track RPC latency across services.
Metrics (e.g., Prometheus) -> measure request success/failure rates.
Logging -> capture detailed RPC execution info.

Example: gRPC supports interceptors that automatically propagate trace IDs, allowing debugging of slow or failing RPC chains.