8. Backend-to-Agent Communication Protocol Selection#

Date: 26/08/2025

Status#

Accepted

Context#

The SmartEM Decisions system requires a communication protocol for delivering real-time microscope control instructions from Kubernetes-hosted backend services to Windows workstation agents controlling cryo-electron microscopes.

Key Requirements#

  • Real-time delivery: Instructions must reach agents within seconds of generation

  • High throughput: Support high-frequency microscopy workflows

  • Fault tolerance: Connection failures must not result in lost instructions

  • Network compatibility: Must work within existing network infrastructure

  • Integration: Seamless integration with existing FastAPI and RabbitMQ architecture

Constraints#

  • Windows workstations have limited network connectivity

  • Agents cannot initiate connections to backend (firewall restrictions)

  • Instructions occur every 30-120 seconds per agent during data collection

  • Scientific reproducibility requires reliable instruction delivery and audit trails

Decision#

We will implement Server-Sent Events (SSE) for instruction streaming combined with HTTP POST acknowledgements.

Alternatives Considered#

1. WebSockets#

Pros: Bidirectional, low latency, excellent real-time performance Cons: Complex state management, firewall compatibility issues, unnecessary complexity for 30-120 second instruction intervals, fragile in environments with routine network maintenance Verdict: Unsuitable - over-engineered for microscope control timing patterns

2. HTTP Long-Polling#

Pros: Simple HTTP-based, good firewall compatibility, natural timeout handling Cons: Resource intensive with blocking connections, complex timeout management, potential connection exhaustion Verdict: Rejected - inefficient resource usage

3. gRPC Streaming#

Pros: Excellent performance, built-in streaming, strong typing Cons: HTTP/2 proxy compatibility issues, Protocol Buffers complexity, firewall restrictions, over-engineered for infrequent instructions Verdict: Rejected - unnecessary complexity

4. Message Queue Pull (Direct RabbitMQ)#

Pros: Native reliability, mature authentication, built-in failover Cons: Exposes message queue to restricted networks, complex credential management, conflicts with network isolation policies Verdict: Rejected - security boundary violations

5. File-Based Communication#

Pros: Simple implementation, natural persistence, no network dependencies Cons: Polling overhead, poor real-time performance, file locking complexity, inadequate for sub-second requirements Verdict: Rejected - insufficient performance

Consequences#

Positive#

  • Optimal performance: SSE provides real-time delivery with minimal latency for microscopy workflows

  • Reliable delivery: HTTP acknowledgements ensure instruction receipt confirmation

  • Fault tolerance: Automatic fallback to HTTP polling during connection issues

  • Network compatibility: HTTP/SSE protocols work within existing firewall configurations

  • Simple integration: Leverages existing FastAPI infrastructure with minimal changes

  • Audit compliance: HTTP-based acknowledgements provide full instruction traceability

Negative#

  • Connection management: SSE connections require careful lifecycle management with retry logic

  • Proxy sensitivity: Long-lived SSE connections may be affected by corporate proxies

  • Retry complexity: Exponential backoff logic must handle various failure scenarios

Polling Fallback Rejected#

Initial consideration included HTTP polling as a fallback mechanism, but this was rejected for the following reasons:

  • Complexity without benefit: Network issues affecting SSE typically also affect HTTP polling

  • Instruction frequency: 30-120 second intervals make polling overhead unnecessary

  • Robust retry sufficient: SSE with exponential backoff reconnection handles temporary failures

  • Failure correlation: Most network problems (DNS, firewall, proxy) impact both protocols equally

  • Maintenance burden: Dual code paths increase complexity without meaningful reliability improvement

Decision: Implement robust SSE with retry logic only, without polling fallback.

Implementation Requirements#

  • FastAPI SSE endpoint with connection lifecycle management and retry logic

  • HTTP acknowledgement endpoints for delivery confirmation

  • Exponential backoff reconnection for SSE failures

  • Integration with existing RabbitMQ event system

  • Database persistence for instruction state and delivery tracking