SmartEM Agent: Design Specification#
Version: 2.0 Date: 28/10/2025 Status: Implemented
Executive Summary#
This document specifies the design for the SmartEM Agent, the real-time EPU filesystem monitoring service. The implementation addresses non-deterministic file ordering issues that previously caused GridSquare and FoilHole processing failures. The agent provides robust orphan handling for arbitrary entity relationship ordering, improved bursty write handling, and enhanced code maintainability whilst preserving the proven parser implementation.
Background#
Implementation Overview#
SmartEM Agent is a Windows-deployed service that monitors EPU (cryo-electron microscopy software) output directories
in real time, parsing XML metadata and synchronising entity data to the backend via REST API. The architecture
(src/smartem_agent/) comprises several key components:
Parser (
fs_parser.py): XML parsing of EPU session manifests, atlas data, GridSquare metadata, FoilHole manifests, and micrograph metadataEvent Classifier (
event_classifier.py): Classifies file events by entity type and assigns processing priorityEvent Queue (
event_queue.py): Priority queue for buffering classified events during bursty writesEvent Processor (
event_processor.py): Coordinates parsing, parent checking, and orphan managementOrphan Manager (
orphan_manager.py): Manages entities awaiting parent availability with event-driven resolutionError Handler (
error_handler.py): Categorises and handles transient vs permanent errors with retry logicWatcher (
fs_watcher.py): Filesystem monitoring using watchdog library, orchestrating all componentsData Store (
model/store.py): In-memory entity cache with optional persistent backend synchronisation viaPersistentDataStore. Provides relationship tracking and natural ID lookups for deduplication
The agent operates in two modes:
Dry-run mode: In-memory data store only, no API persistence
Production mode: Persistent data store with REST API synchronisation and SSE instruction streaming
Deployment occurs as a Windows executable, with testing facilitated by fsrecorder playback simulation of EPU output patterns.
Historical Problem Statement#
The original implementation exhibited critical failures when processing EPU filesystem output, manifesting primarily in GridSquare and FoilHole entity processing during end-to-end tests. These failures stemmed from three core architectural deficiencies that have now been addressed:
1. Insufficient Orphan Handling (Critical Priority)#
EPU filesystem output exhibits completely non-deterministic ordering characteristics:
Child-before-parent sequences: FoilHole manifests can appear before their parent GridSquare metadata files; similarly, Micrographs can precede FoilHoles, and GridSquares can precede Atlas data.
Bursty file writes: EPU buffers and dumps hundreds of files simultaneously in unpredictable sequences.
No ordering guarantees: The filesystem provides no guarantees about write completion order, even for logically related entities.
The current orphan handling implementation (_process_orphaned_files method, lines 527-543 in fs_watcher.py)
contains several weaknesses:
Grid-level orphan processing only: Orphans are processed exclusively when new grids are detected. GridSquare and FoilHole orphans that arrive before their parents are not handled.
Single retry attempt: Orphaned files are processed once when a potential parent appears, with no mechanism for subsequent retry if the parent-child chain remains incomplete.
Path-based resolution: Relies on
get_grid_by_pathfor orphan-to-grid matching, which fails for entities requiring UUID-based parent lookups (GridSquares needing Grid UUIDs, FoilHoles needing GridSquare UUIDs).No persistent orphan state: Orphaned entities are stored only in the
orphaned_filesdictionary without timeout tracking or systematic retry scheduling.
This results in silent processing failures for entities arriving in unexpected orders, particularly evident in end-to-end test failures where GridSquares and FoilHoles fail to persist to the database.
2. Inadequate Bursty Write Handling#
EPU can generate hundreds of files within seconds during data collection bursts. The current rate-limiting approach
(log_interval parameter, default 10 seconds) provides only event batching for logging purposes, not processing
control. Issues include:
Serial processing bottleneck: All events process sequentially in
on_any_event, creating backlog during bursts.No backpressure mechanism: High-frequency events accumulate in
changed_filesdictionary without processing limits.Insufficient buffering strategy: Event accumulation occurs but processing remains unbounded and non-prioritised.
The optimal processing model (serial versus concurrent) requires investigation. Entity interdependencies suggest serial processing within entity hierarchies (Grid → GridSquare → FoilHole → Micrograph) may be necessary to maintain parent-child ordering guarantees, whilst cross-hierarchy processing could potentially parallelise.
3. Code Maintainability Concerns#
The current watcher implementation exhibits technical debt that impedes modification and testing:
Complex conditional nesting: The
on_any_eventmethod (lines 441-515) contains deeply nested match/case statements with interleaved file pattern matching, entity type determination, and orphan handling logic.Unclear separation of concerns: Orphan management, event routing, file classification, and entity processing are intermingled within single methods.
Difficult testing: Tight coupling between filesystem events, parsing, and persistence makes unit testing challenging without extensive mocking.
Hardcoded patterns: File pattern matching relies on regex patterns embedded throughout the code rather than centralised configuration.
These maintainability issues compound the difficulty of implementing robust orphan handling, as the current structure resists modification without introducing regression risks.
Design Goals and Principles#
The SmartEM Agent v2 design adheres to the following principles:
Primary Goals#
Robust orphan handling: Support arbitrary entity arrival ordering without assumptions about parent-child sequence. No entity should fail to process due to parent unavailability at initial observation time.
Bursty write resilience: Handle high-frequency file creation bursts (hundreds of files within seconds) without processing failures, dropped events, or unbound resource consumption.
Maintainability: Produce clear, testable code with well-separated concerns, enabling future modifications without architectural rework.
Reliability: Ensure eventual consistency between filesystem state and datastore state, with comprehensive error recovery mechanisms.
Design Principles#
No ordering assumptions: The system must not assume any specific file arrival sequence. Parents may arrive after children; siblings may arrive in arbitrary order.
Idempotency: Entity processing operations must be safely repeatable. Re-processing a file should produce identical datastore state.
Separation of concerns: File watching, event classification, parsing, orphan management, and persistence should occupy distinct, loosely coupled components.
Testability: Components should support unit testing without filesystem dependencies or extensive mocking.
Incremental processing: Large batches should process incrementally to prevent resource exhaustion and enable progress monitoring.
Observability: Processing state, orphan status, and retry attempts should be visible through structured logging and metrics.
Constraints#
Dual-ID system: Every entity maintains both a synthetic UUID (v4, agent-generated, database primary key) and a natural ID (EPU-generated, e.g. GridSquare “1”, FoilHole “2-3”). UUID generation occurs immediately when entity data is first created. Natural IDs are essential for parent-child relationship resolution and orphan matching. This dual-ID architecture is critical for the entire system and must be preserved.
Dual data store model: Maintain both in-memory and persistent stores. The in-memory store serves as cache for relationship lookups, development convenience, and architectural modularity.
Windows deployment compatibility: Solution must build to Windows executable for Win10/11 deployment.
Backwards compatibility preference: Avoid breaking changes where practical, though not a strict requirement as the system is not yet in production.
Parser preservation:
fs_parser.pyworks reliably and should remain largely unchanged.
Proposed Architecture#
Dual-ID System: Foundation for Orphan Resolution#
Understanding the dual-ID system is critical for comprehending the orphan handling strategy.
Identity Management#
Every entity in the system maintains two identifiers:
Synthetic UUID (Primary Key):
Generated agent-side using UUID v4
Created immediately when entity data structure is instantiated
Database primary key and system-wide identifier
Used for all parent-child relationships in Data Store and backend database
Never appears in EPU files
Natural ID (EPU Identifier):
Originates from EPU software (e.g., GridSquare “1”, FoilHole “2-3”, Micrograph “2-3-1”)
Found in EPU XML files and directory names
Used for entity lookup during parent relationship resolution
Enables deduplication (same natural ID = same entity, reuse UUID)
Parent Relationship Resolution#
When parsing a FoilHole file, the process works as follows:
Parser extracts natural IDs:
FoilHole natural ID: “2-3”
Parent GridSquare natural ID: “2” (extracted from file path or XML content)
UUID lookup for parent:
Call
datastore.find_gridsquare_by_natural_id("2")If found: retrieve parent’s UUID → set
foilhole.gridsquare_uuid = parent.uuidIf not found: orphan detected (parent doesn’t exist yet)
UUID management for entity:
Call
datastore.find_foilhole_by_natural_id("2-3")If found (update scenario): reuse existing UUID
If not found (new entity): UUID already generated during entity creation
Persistence:
With parent UUID available: persist FoilHole with complete relationship data
Without parent UUID: register as orphan awaiting parent appearance
Orphan Detection Mechanism#
An entity becomes orphaned when:
Parser successfully extracts entity data from EPU file
Natural ID lookup for required parent returns None
Persistence attempt fails parent existence check
Example orphan scenarios:
FoilHole “2-3” parsed but GridSquare “2” doesn’t exist yet
Micrograph “2-3-1” parsed but FoilHole “2-3” doesn’t exist yet
GridSquare “5” parsed but parent Grid hasn’t been created yet (rare, handled by current implementation)
Orphan Resolution Strategy: Event-Driven Matching#
Key insight: Orphan resolution is event-driven, not periodic retry-based.
When a parent entity appears:
Process parent normally: Parse file, generate/retrieve UUID, persist to Data Store
Check orphan registry: Query for orphans requiring this parent’s natural ID
Resolve matching orphans: For each orphan:
Set orphan’s parent UUID (now available)
Persist orphan to Data Store
Remove from orphan registry
Example flow - FoilHole arrives before GridSquare:
FoilHole “2-3” file appears
Parser extracts: natural ID “2-3”, parent GridSquare natural ID “2”
Lookup GridSquare “2”: not found → orphan
Register orphan:
{natural_id: "2-3", parent_natural_id: "2", entity_data: <FoilHole>}(later) GridSquare “2” file appears
Parse and persist GridSquare “2” (gets UUID)
Check orphans: “Are any orphans waiting for GridSquare ‘2’?”
Find FoilHole “2-3” orphan
Set
foilhole.gridsquare_uuid = gridsquare.uuidPersist FoilHole “2-3”
Remove from orphan registry
This event-driven approach provides immediate orphan resolution with zero polling overhead.
High-Level Component Structure#
The v2 architecture introduces four new components whilst preserving the proven Parser and Data Store implementations:
Filesystem Watcher → Event Classifier → Event Queue → Event Processor → Parser / Orphan Manager → Data Store → Backend REST API
New components (Event Classifier, Event Queue, Event Processor, Orphan Manager) handle event buffering, classification, and orphan resolution. Existing components (Parser, Data Store) remain unchanged.
Component Descriptions#
Filesystem Watcher (Unchanged Externally)#
Continues using watchdog’s FileSystemEventHandler for monitoring file creation and modification events. Interface
remains compatible with existing deployment. Internal processing delegation changes to route events through new
components.
Responsibilities:
Monitor watch directory for file creation/modification events
Filter events by configured file patterns
Delegate event processing to Event Classifier
No changes to watchdog integration or SSE client management.
Event Classifier (New Component)#
Examines file paths and classifies events by entity type and processing priority.
Responsibilities:
Apply regex patterns to determine entity type (Grid, Atlas, GridSquare, FoilHole, Micrograph)
Extract natural IDs from file paths (GridSquare ID, FoilHole ID, etc.)
Assign processing priority based on entity hierarchy (Grid > Atlas > GridSquare > FoilHole > Micrograph)
Enqueue classified events to Event Queue
Interface:
@dataclass
class ClassifiedEvent:
    entity_type: EntityType  # Enum: GRID, ATLAS, GRIDSQUARE, FOILHOLE, MICROGRAPH
    file_path: Path
    natural_id: str | None  # Extracted ID where applicable
    priority: int  # Lower number = higher priority
    timestamp: float
    event_type: str  # 'created' or 'modified'
Processing Logic:
Match file path against known patterns from
EpuParserExtract identifiers using existing regex patterns
Priority assignment: Grid=0, Atlas=1, GridSquare=2, FoilHole=3, Micrograph=4
Unknown files are logged and discarded
Event Queue (New Component)#
Buffered queue for classified events with priority ordering and batch processing support.
Responsibilities:
Buffer incoming classified events during bursty writes
Maintain priority ordering (higher-priority entities process first)
Provide batch retrieval for processing
Support backpressure via size limits
Track queue depth metrics for monitoring
Interface:
class EventQueue:
    def enqueue(self, event: ClassifiedEvent) -> None: ...
    def dequeue_batch(self, max_size: int = 50) -> list[ClassifiedEvent]: ...
    def size(self) -> int: ...
    def clear(self) -> None: ...
Configuration:
Maximum queue size: 1000 events (configurable)
Batch size: 50 events per processing cycle (configurable)
Priority queue implementation using
heapqorqueue.PriorityQueue
Event Processor (New Component - Core Orchestrator)#
Coordinates event processing, delegating to Parser and Orphan Manager based on parent availability.
Responsibilities:
Retrieve event batches from Event Queue
Delegate parsing to
EpuParserCheck parent entity availability in Data Store
Route successful parses to Data Store for persistence
Route orphaned entities (missing parents) to Orphan Manager
Handle processing errors with logging and optional retry
Processing Flow:
For each event in batch:
  1. Parse file using EpuParser → entity data
  2. Check parent exists in Data Store:
     - If parent exists: persist entity via Data Store
     - If parent missing: register with Orphan Manager
  3. Trigger Orphan Manager to retry orphans (parents may now exist)
  4. Log processing outcome (success/orphan/error)
Interface:
class EventProcessor:
    def __init__(self, parser: EpuParser, datastore: DataStore, orphan_mgr: OrphanManager): ...
    def process_batch(self, events: list[ClassifiedEvent]) -> ProcessingStats: ...
Orphan Manager (New Component - Critical for Reliability)#
Manages entities awaiting parent availability with event-driven resolution.
Responsibilities:
Store orphaned entities indexed by required parent natural ID
Resolve orphans immediately when their parent entity appears
Detect and log permanently orphaned entities (timeout-based)
Provide orphan status visibility for debugging
Data Structures:
@dataclass
class OrphanedEntity:
    entity_type: EntityType
    entity_data: GridSquareData | FoilHoleData | MicrographData | ...
    required_parent_type: EntityType
    required_parent_natural_id: str  # Natural ID of parent (e.g., "2" for GridSquare)
    file_path: Path
    first_seen: float  # Timestamp for timeout detection
Orphan Storage:
# Index orphans by parent natural ID for fast lookup
orphans_by_parent: dict[tuple[EntityType, str], list[OrphanedEntity]]
# Key: (parent_entity_type, parent_natural_id)
# Value: List of orphans waiting for this parent
# Example: (GRIDSQUARE, "2") → [FoilHole_2-3, FoilHole_2-5, Micrograph_2-1-1]
Event-Driven Resolution Strategy:
When Event Processor successfully persists a parent entity (e.g., GridSquare “2”):
Trigger orphan check: Call
orphan_manager.resolve_orphans_for(GRIDSQUARE, "2")Lookup matching orphans: Retrieve all orphans waiting for
(GRIDSQUARE, "2")Resolve each orphan:
Lookup parent entity in Data Store by natural ID (now guaranteed to exist)
Extract parent’s UUID
Set orphan’s parent UUID field (e.g.,
foilhole.gridsquare_uuid = gridsquare.uuid)Persist orphan to Data Store
Remove from orphan registry
Log successful resolution
Cascading resolution: Resolving an orphan may enable resolution of its children
Example: Resolving GridSquare “2” may enable FoilHole “2-3”, which may enable Micrograph “2-3-1”
Event Processor recursively checks orphans after each successful persistence
Timeout Detection (safeguard for genuinely missing files):
Periodic background task (every 60 seconds) checks orphan timestamps
Orphans exceeding timeout threshold (e.g., 5 minutes) logged as permanently orphaned
Enables operator intervention for corrupted or incomplete EPU sessions
Interface:
class OrphanManager:
    def register_orphan(self, entity_data, required_parent_type, required_parent_natural_id, file_path): ...
    def resolve_orphans_for(self, parent_type: EntityType, parent_natural_id: str) -> int: ...
    def check_timeouts(self, max_age_seconds: float = 300) -> list[OrphanedEntity]: ...
    def get_orphan_stats(self) -> dict[EntityType, int]: ...
Data Store (Minimal Changes)#
Existing InMemoryDataStore and PersistentDataStore implementations remain largely unchanged. Minor enhancements:
Additions:
has_parent(entity_type, parent_id) -> bool: Check parent existence without retrieving full entityEnhanced logging for API retry failures (already present but may need refinement)
No changes to UUID generation, relationship tracking, or API synchronisation logic.
Parser (No Changes)#
EpuParser remains unchanged. All existing parsing methods continue functioning as designed.
Data Flow#
The event processing flow proceeds through distinct stages:
Event Detection: Filesystem Watcher detects file creation/modification events
Classification: Event Classifier determines entity type and assigns priority
Buffering: Event Queue stores classified events with priority ordering
Batch Processing (periodic, e.g. every 100ms):
Event Processor dequeues batch (up to 50 events)
For each event:
Parser extracts entity data from file
Event Processor checks parent exists in Data Store
If parent exists: persist entity to Data Store → Backend API
If parent missing: register with Orphan Manager
Orphan Resolution (event-driven): After successfully persisting each entity:
Event Processor calls
orphan_manager.resolve_orphans_for(entity_type, natural_id)Orphan Manager finds all orphans waiting for this parent
Resolves each orphan by completing parent relationship and persisting
Cascading: Resolving an orphan triggers check for its own children orphans
Timeout detection: Periodic background task (every 60s) identifies orphans exceeding age threshold (5 minutes) for operator intervention.
Orphan Resolution Strategy (Detailed)#
The orphan resolution strategy is the core innovation in v2, addressing the primary failure mode of v1. This section provides detailed rationale and implementation specifics.
Why Orphans Occur#
Orphans arise from three filesystem characteristics:
Non-atomic writes: EPU writes files individually, not as atomic transactions. A GridSquare’s FoilHole manifests may complete writing before the GridSquare’s own metadata file finishes writing.
Buffered writes: Operating system write buffering and EPU’s internal buffering can reorder file completion relative to logical creation order.
Multi-directory writes: EPU writes to multiple directories (
Metadata/,Images-Disc*/, etc.) concurrently, with no synchronisation guarantees across directory boundaries.
Orphan Detection#
An entity is orphaned if its required parent is not present in the Data Store at processing time. Parent requirements:
Entity Type  | 
Required Parent Type  | 
Parent Identifier  | 
|---|---|---|
GridSquare  | 
Grid  | 
Grid UUID (from path lookup)  | 
FoilHole  | 
GridSquare  | 
GridSquare UUID (from ID)  | 
Micrograph  | 
FoilHole  | 
FoilHole UUID (from ID)  | 
Atlas  | 
Grid  | 
Grid UUID (from path lookup)  | 
AtlasTile  | 
Atlas  | 
Atlas UUID  | 
Grid entities never orphan (they are root entities). Atlas and AtlasTile orphaning is theoretically possible but extremely rare in practice (Atlas.dm typically completes before GridSquare files).
Event-Driven Resolution Rationale#
The v2 orphan resolution strategy is event-driven, not retry-based. This fundamental design choice provides several advantages:
Zero latency resolution: When a parent entity appears and is persisted, orphan resolution occurs immediately in the same processing cycle. No polling intervals, no exponential backoff delays.
No wasted cycles: Resolution attempts only occur when new parents arrive. Traditional periodic retry systems waste CPU checking orphans when no new parents exist.
Cascading resolution: Resolving an orphan (e.g., GridSquare “2”) immediately triggers resolution checks for its children (FoilHoles waiting for GridSquare “2”), enabling multi-level dependency chains to resolve in a single event processing cycle.
Deterministic behavior: Resolution timing depends solely on parent arrival, not on retry scheduling parameters. This simplifies testing and debugging.
Resolution Mechanism#
When Event Processor successfully persists any entity:
Extract entity identity: Determine entity type and natural ID (e.g., GRIDSQUARE, “2”)
Trigger resolution: Call
orphan_manager.resolve_orphans_for(GRIDSQUARE, "2")Orphan Manager executes:
Lookup orphans indexed by
(GRIDSQUARE, "2")For each matching orphan (e.g., FoilHole “2-3”):
Retrieve parent entity from Data Store by natural ID
Extract parent’s UUID
Complete orphan’s parent relationship (set
foilhole.gridsquare_uuid)Persist orphan to Data Store
Remove from orphan registry
Recursively trigger resolution for this orphan’s natural ID (in case it has children)
This cascading resolution enables arbitrarily deep dependency chains to resolve in a single processing cycle.
Orphan Storage#
Orphans persist in memory within OrphanManager using a dictionary indexed by required parent:
orphans_by_parent: dict[tuple[EntityType, str], list[OrphanedEntity]]
# Key: (required_parent_type, required_parent_natural_id)
# Value: List of orphans waiting for this parent
# Example: (GRIDSQUARE, "2") → [FoilHole "2-3", FoilHole "2-5", Micrograph "2-1-1"]
This indexing structure enables:
O(1) lookup of all orphans waiting for a specific parent
Efficient bulk resolution when parent appears
Natural support for one-to-many parent-child relationships
Orphans do not persist across agent restarts. Upon restart, filesystem re-scanning will re-discover orphaned files, re-triggering orphan registration. This design trades restart recovery complexity for implementation simplicity, acceptable given infrequent agent restarts during active data collection.
Timeout Detection#
Event-driven resolution handles >99% of orphan scenarios. For genuinely missing parent files (corrupted EPU sessions, incomplete transfers):
Periodic background task (every 60 seconds) scans all orphans
Orphans exceeding age threshold (default: 5 minutes) logged as permanently orphaned
Logs include: entity type, natural ID, required parent, first seen timestamp, file path
Enables operator intervention for data recovery or session debugging
Key Design Decisions#
Decision 1: Serial vs Concurrent Event Processing#
Question: Should events process serially (one at a time) or concurrently (parallel processing)?
Analysis:
Serial Processing:
Pros: Maintains strict entity hierarchy ordering; simplifies parent dependency checking; avoids race conditions in Data Store access; straightforward implementation
Cons: Potentially slower throughput during extreme bursts (hundreds of files); single-threaded processing bottleneck
Concurrent Processing:
Pros: Higher throughput potential; better CPU utilisation on multi-core systems; reduced processing latency during bursts
Cons: Complex parent dependency coordination; requires Data Store locking/synchronisation; race conditions between orphan registration and parent creation; significantly increased implementation complexity
Decision: Serial processing within the Event Processor.
Rationale:
Entity interdependencies: Parent entities must exist before children can persist. Concurrent processing risks child processing completing before parent processing, artificially creating orphans.
Throughput sufficiency: Parsing and persisting a single entity takes <10ms in typical cases. Serial processing sustains ~100 entities/second, far exceeding typical EPU output rates (10-20 entities/second during active collection).
Simplicity: Serial processing eliminates concurrency bugs, Data Store locking requirements, and orphan registration race conditions.
Burst handling: Event Queue buffering and batch processing (50 events/batch) provide sufficient burst absorption. Processing 50 events serially takes ~500ms, acceptable latency.
Future consideration: If profiling reveals serial processing bottlenecks, investigate hierarchical parallelism (concurrent processing across independent Grid hierarchies, serial within each hierarchy).
Decision 2: Orphan Storage Location#
Question: Should orphans store in-memory, in persistent Data Store, or in separate storage?
Decision: In-memory storage within OrphanManager.
Rationale:
Lifecycle: Orphans are transient processing state, not domain entities. They should not pollute the Data Store entity collections.
Persistence unnecessary: Orphans resolve within seconds to minutes. Persistence across agent restarts provides minimal value given filesystem re-scanning automatically re-discovers orphaned files.
Simplicity: In-memory storage avoids Data Store schema changes, API synchronisation complexity, and database migrations.
Performance: In-memory lookups are faster than database queries, beneficial for high-frequency retry attempts.
Trade-off: Orphan state loss on agent restart. Acceptable given infrequent restarts during active data collection and automatic re-discovery upon restart.
Decision 3: Event Queue Implementation#
Question: What data structure and size limits should the Event Queue use?
Decision: Priority queue with 1000 event maximum size and 50 event batch size.
Rationale:
Priority queue: Ensures higher-priority entities (Grids, Atlases) process before lower-priority entities (Micrographs), reducing orphan occurrence by processing parents first when possible.
1000 event limit: Accommodates extreme bursts (EPU rarely exceeds 500 files/burst) whilst preventing unbounded memory growth. Exceeding limit triggers warning logging and oldest event eviction.
50 event batch size: Balances processing latency (500ms at 10ms/entity) with orphan resolution responsiveness. Smaller batches increase orphan resolution frequency but reduce throughput; larger batches improve throughput but delay orphan resolution (orphans resolve when their parent is persisted).
Configuration: All parameters (queue size, batch size, priorities) should be configurable via environment variables or configuration file for deployment-specific tuning.
Decision 4: Event-Driven vs Periodic Retry Orphan Resolution#
Question: Should orphan resolution use event-driven triggers or periodic retry with backoff?
Decision: Event-driven resolution triggered when parent entities are persisted.
Rationale:
Zero latency: Orphans resolve immediately when their parent appears, not on next retry cycle. Typical resolution time: <1ms after parent persistence.
No wasted cycles: Resolution attempts only occur when new parents arrive. Periodic retry wastes CPU checking orphans during quiet periods.
Cascading resolution: Resolving an orphan (e.g., GridSquare) immediately enables resolution of its children (FoilHoles), allowing multi-level dependency chains to resolve in a single processing cycle.
Deterministic testing: Resolution timing depends solely on parent arrival order, not on retry intervals or backoff parameters. Simplifies test case design and debugging.
Simpler implementation: No retry scheduling, no exponential backoff calculations, no retry count tracking.
Trade-off: Requires indexing orphans by required parent natural ID rather than simple list storage. The O(1) lookup performance benefit outweighs the marginal increase in storage complexity.
Timeout detection: Periodic background task (60s interval) identifies orphans exceeding age threshold (5 minutes) for genuinely missing parent files. Timeouts generate warning logs but do NOT evict orphans from memory, supporting long-running production sessions (48+ hours) where parent files may arrive after extended delays. Memory footprint for orphans is negligible (~750 bytes per orphan, ~1.9 MB for 2,500 orphans in extreme scenarios).
Implementation Strategy#
Phased Development Approach#
The implementation was completed in phases, initially developed in src/smartem_agent2/ alongside the original
implementation to enable iterative development. After thorough testing and validation, the new implementation replaced
the original in src/smartem_agent/.
Phase 1: Foundation Components (Week 1-2)#
Deliverables:
event_classifier.py: Event classification with entity type detection and priority assignmentevent_queue.py: Priority queue implementation with configurable size limitsorphan_manager.py: Orphan storage and basic retry scheduling (without exponential backoff initially)Unit tests for all three components using synthetic events (no filesystem dependencies)
Validation:
Unit test coverage >90% for new components
Event classification correctly identifies all EPU entity types from sample file paths
Event queue maintains priority ordering under random insertion
Orphan manager stores and retrieves orphans by entity type and natural ID
Phase 2: Event Processor and Integration (Week 3-4)#
Deliverables:
event_processor.py: Batch processing orchestration with Parser and Data Store integrationIntegration of Event Classifier, Event Queue, Event Processor, and Orphan Manager
Adaptation of
fs_watcher.pyto delegate events to Event Classifier (minimal changes to existing class)Integration tests using fsrecorder playback with small datasets (10-20 GridSquares)
Validation:
Event Processor correctly processes batches with known file sets
Orphan Manager successfully resolves GridSquare and FoilHole orphans in test datasets
Integration tests pass with zero entity loss and zero orphan leakage
Performance benchmarking: process 100 entities within 2 seconds (serial processing)
Phase 3: Transient Error Retry and Enhanced Observability (Week 5-6)#
Context: Production environment differs significantly from dev/test:
Production: Windows binary, live microscope, real-time file writes over 4-48 hours
Dev/Test: Linux interpreted Python, fsrecorder simulation with accelerated timing (100-1000x speed)
Primary goal: Address dev/test brittleness from accelerated playback whilst improving production reliability
Deliverables:
Exponential backoff retry for transient parser errors (corrupted XML during write, file locks)
Exponential backoff retry for transient API connection failures (network timeouts, service restarts)
Agent log streaming to backend (via SSE or dedicated endpoint) for centralized monitoring
Enhanced error categorization (transient vs permanent failures)
Structured metrics: processing latency percentiles (p50, p95, p99), retry distributions, error counts
Enhanced SSE instructions: detailed statistics, orphan inspection, error reporting
End-to-end tests using fsrecorder playback with multiple timing modes (–fast, –dev-mode, –exact)
Full-scale E2E test with bi37708-42 dataset (8,389 events)
Validation:
Exponential backoff correctly retries transient errors without blocking event processing
Permanent errors (corrupt files, missing parents after timeout) logged appropriately without retry
Orphan timeout detection generates warnings but does NOT evict orphans (supports 48h+ sessions)
End-to-end tests pass with all fsrecorder timing modes (100x fast, 1000x dev-mode, 1x exact)
Agent logs successfully stream to backend for centralized monitoring
Processing metrics visible in logs and SSE instruction responses
Phase 4: Performance Optimisation and Production Hardening (Week 7-8)#
Deliverables:
Performance profiling and optimisation based on fsrecorder playback results
Configuration externalisation (queue sizes, retry parameters, batch sizes)
Comprehensive error recovery testing (API failures, parser errors, corrupted files)
Documentation updates (README, API documentation, deployment guides)
Windows executable build and deployment testing
Validation:
Performance targets met: process 200 entities/second (serial), orphan resolution within 5 seconds for 95th percentile
Configuration changes apply without code modification
Error recovery scenarios handled gracefully without data loss
Windows executable deploys and operates correctly on Win10/11 test systems
Testing Strategy#
Testing employs a multi-layered approach progressing from unit tests to full end-to-end validation:
Unit Tests#
Each component (EventClassifier, EventQueue, OrphanManager, EventProcessor) receives dedicated unit test
coverage with mocked dependencies. Focus areas:
Event Classifier: File path pattern matching, entity type detection, priority assignment, natural ID extraction
Event Queue: Priority ordering, batch retrieval, size limit enforcement, edge cases (empty queue, single-element batches)
Orphan Manager: Orphan registration, event-driven resolution, timeout detection (warning-only, no eviction), orphan statistics by type and age
Event Processor: Batch processing, parser delegation, parent checking, orphan registration, Data Store interaction
Target: >90% code coverage across new components.
Integration Tests#
Integration tests use fsrecorder playback to simulate EPU filesystem output patterns with controlled file ordering and multiple timing modes:
Timing modes (fsrecorder replay options):
--fast(100x speed, 1s max delays): Balanced mode for realistic testing, DEFAULT--dev-mode(1000x speed, burst): Maximum acceleration for rapid iteration and stress testing--exact(1x speed): Preserve original timing for debugging timing-dependent issues
File ordering scenarios:
Normal ordering: Parents before children (Grid → Atlas → GridSquare → FoilHole → Micrograph)
Reverse ordering: Children before parents (Micrograph before FoilHole, FoilHole before GridSquare, etc.)
Mixed ordering: Random interleaving of parent and child files
Bursty writes: Hundreds of files appearing simultaneously
Incomplete hierarchies: Missing parent files (Atlas absent, GridSquare metadata missing)
Validation:
Zero entity loss across all ordering scenarios and timing modes
Orphan resolution success rate >99% for complete hierarchies
Orphaned entities with missing parents log timeout warnings (no eviction)
Processing latency <5 seconds for 95th percentile entities (fast mode)
End-to-End Tests#
End-to-end tests use full-scale fsrecorder datasets (100+ GridSquares, 1000+ FoilHoles) with realistic EPU output patterns. Primary test dataset: bi37708-42 (8,389 events). Tests validate:
Complete entity persistence to backend database
SSE instruction handling during active data collection
Agent restart recovery (orphan re-discovery)
Transient error retry (parser errors with file locks, API connection failures)
Agent log streaming to backend for centralized monitoring
Windows executable deployment and operation (production environment)
Target: All existing end-to-end tests pass with v2 implementation, plus additional tests for:
Orphan timeout warning scenarios (no eviction)
Transient error recovery with exponential backoff
Long-running sessions (48+ hours simulated via extended playback)
All three fsrecorder timing modes (fast, dev-mode, exact)
Open Questions and Future Considerations#
Open Questions for Investigation#
1. Optimal Batch Size#
Question: Is 50 events/batch optimal for throughput and orphan resolution trade-off?
Investigation Plan: Benchmark processing latency and orphan resolution time across batch sizes: 10, 25, 50, 100, 200 events/batch using fsrecorder playback with large datasets. Measure:
Total processing time for complete dataset
95th percentile orphan resolution time
Memory consumption during processing
Decision Criteria: Select batch size minimising orphan resolution time whilst maintaining total processing time <10% above minimum observed (to avoid sacrificing throughput for marginal resolution improvements).
2. Persistent Orphan Storage#
Question: Should orphans persist across agent restarts for long-running orphan scenarios?
Current Decision: No persistent storage (in-memory only).
Reconsideration Triggers:
If operational data reveals orphans persisting beyond 5 minutes (indicating parent files genuinely missing rather than delayed)
If agent restarts during active data collection prove common in production
If filesystem re-scanning performance becomes problematic for large datasets
Implementation Path: If persistence becomes necessary, store orphans in SQLite database co-located with agent, avoiding backend API dependency for transient processing state.
3. Hierarchical Parallelism#
Question: Could concurrent processing across independent Grid hierarchies improve throughput without introducing parent-child race conditions?
Current Decision: Serial processing (see Decision 1).
Future Investigation: If performance profiling reveals serial processing bottlenecks (unlikely based on preliminary analysis), investigate:
Processing batches grouped by Grid UUID (parallel across Grids, serial within Grid)
Thread pool or async/await implementation for Grid-level concurrency
Lock-free Data Store design for concurrent Grid insertion
Decision Criteria: Pursue only if serial processing demonstrably fails to meet throughput requirements (>200 entities/second sustained).
Future Enhancements (Out of Scope for v2)#
The following features are explicitly deferred to future iterations:
Athena API Integration Modes#
Description: Three-mode operation for microscope control:
Dry-run mode: Data ingestion only, no microscope control
Semi-automatic mode: Athena API generates recommendations, operator approves
Fully automatic mode: Athena API directly controls microscope via SSE instructions
Deferral Rationale: Core orphan handling and event processing must stabilise before introducing complex microscope control workflows. Athena API integration builds upon reliable data ingestion foundation.
Data Intake Only Mode#
Description: Disable SSE instruction streaming, operate purely as data ingestion service for offline analysis.
Deferral Rationale: Current SSE implementation is stable and non-invasive. Disabling SSE provides minimal value until specific use cases emerge requiring intake-only operation.
Session Completion Detection#
Description: Automatically detect EPU session completion (no new files for threshold duration, session manifest indicates completion, etc.) and trigger finalisation workflows.
Deferral Rationale: Session completion detection requires understanding EPU’s session lifecycle signals, which remain partially undocumented. Reliable detection mechanisms require field observation across multiple sessions.
Multi-Agent Coordination#
Description: Support multiple agents monitoring different microscopes within single backend instance, with session-based isolation and cross-agent coordination.
Deferral Rationale: Multi-agent scenarios are not yet tested in production. Backend session isolation mechanisms require validation before agent-side coordination features become relevant.
Conclusion#
The SmartEM Agent design addresses critical orphan handling deficiencies through systematic component separation, intelligent error handling with retry logic, and robust event buffering. The implementation provides reliable entity processing across all tested scenarios.
Key innovations implemented:
Orphan Manager with event-driven resolution and timeout warnings
Event Queue priority-based buffering for bursty write handling
Event Processor orchestration with clear separation of concerns
Error Handler with categorisation and exponential backoff retry
Preservation of proven components (Parser, Data Store)
The implementation supports existing deployment patterns (Windows executable, fsrecorder testing), and provides clear extension points for future features (Athena API integration, session completion detection, multi-agent coordination).
The system has been validated with comprehensive testing demonstrating zero-entity-loss reliability across all test scenarios including various file ordering patterns and timing modes.