# 8. Backend-to-Agent Communication Protocol Selection Date: 26/08/2025 ## Status Accepted ## Context The SmartEM Decisions system requires a communication protocol for delivering real-time microscope control instructions from Kubernetes-hosted backend services to Windows workstation agents controlling cryo-electron microscopes. ### Key Requirements - **Real-time delivery**: Instructions must reach agents within seconds of generation - **High throughput**: Support high-frequency microscopy workflows - **Fault tolerance**: Connection failures must not result in lost instructions - **Network compatibility**: Must work within existing network infrastructure - **Integration**: Seamless integration with existing FastAPI and RabbitMQ architecture ### Constraints - Windows workstations have limited network connectivity - Agents cannot initiate connections to backend (firewall restrictions) - Instructions occur every 30-120 seconds per agent during data collection - Scientific reproducibility requires reliable instruction delivery and audit trails ## Decision We will implement **Server-Sent Events (SSE) for instruction streaming combined with HTTP POST acknowledgements**. ## Alternatives Considered ### 1. WebSockets **Pros**: Bidirectional, low latency, excellent real-time performance **Cons**: Complex state management, firewall compatibility issues, unnecessary complexity for 30-120 second instruction intervals, fragile in environments with routine network maintenance **Verdict**: Unsuitable - over-engineered for microscope control timing patterns ### 2. HTTP Long-Polling **Pros**: Simple HTTP-based, good firewall compatibility, natural timeout handling **Cons**: Resource intensive with blocking connections, complex timeout management, potential connection exhaustion **Verdict**: Rejected - inefficient resource usage ### 3. gRPC Streaming **Pros**: Excellent performance, built-in streaming, strong typing **Cons**: HTTP/2 proxy compatibility issues, Protocol Buffers complexity, firewall restrictions, over-engineered for infrequent instructions **Verdict**: Rejected - unnecessary complexity ### 4. Message Queue Pull (Direct RabbitMQ) **Pros**: Native reliability, mature authentication, built-in failover **Cons**: Exposes message queue to restricted networks, complex credential management, conflicts with network isolation policies **Verdict**: Rejected - security boundary violations ### 5. File-Based Communication **Pros**: Simple implementation, natural persistence, no network dependencies **Cons**: Polling overhead, poor real-time performance, file locking complexity, inadequate for sub-second requirements **Verdict**: Rejected - insufficient performance ## Consequences ### Positive - **Optimal performance**: SSE provides real-time delivery with minimal latency for microscopy workflows - **Reliable delivery**: HTTP acknowledgements ensure instruction receipt confirmation - **Fault tolerance**: Automatic fallback to HTTP polling during connection issues - **Network compatibility**: HTTP/SSE protocols work within existing firewall configurations - **Simple integration**: Leverages existing FastAPI infrastructure with minimal changes - **Audit compliance**: HTTP-based acknowledgements provide full instruction traceability ### Negative - **Connection management**: SSE connections require careful lifecycle management with retry logic - **Proxy sensitivity**: Long-lived SSE connections may be affected by corporate proxies - **Retry complexity**: Exponential backoff logic must handle various failure scenarios ### Polling Fallback Rejected Initial consideration included HTTP polling as a fallback mechanism, but this was rejected for the following reasons: - **Complexity without benefit**: Network issues affecting SSE typically also affect HTTP polling - **Instruction frequency**: 30-120 second intervals make polling overhead unnecessary - **Robust retry sufficient**: SSE with exponential backoff reconnection handles temporary failures - **Failure correlation**: Most network problems (DNS, firewall, proxy) impact both protocols equally - **Maintenance burden**: Dual code paths increase complexity without meaningful reliability improvement **Decision**: Implement robust SSE with retry logic only, without polling fallback. ### Implementation Requirements - FastAPI SSE endpoint with connection lifecycle management and retry logic - HTTP acknowledgement endpoints for delivery confirmation - Exponential backoff reconnection for SSE failures - Integration with existing RabbitMQ event system - Database persistence for instruction state and delivery tracking