The original TwinSight proposal describes the ROS Adapter as a single container that connects directly to ROS 2 via DDS. This works for a demo with 3–5 robots on one network, but it breaks down in production for three fundamental reasons:
Many robot manufacturers ship robots as sealed systems. They expose ROS 2 topics on the network but do not allow custom software installation. If the adapter must run ON the robot, you can't deploy it on half your fleet.
BlockerDDS discovery uses multicast on the local subnet. Robots on different VLANs, buildings, or sites cannot be discovered by a single centralized adapter. The proposal ignores network topology entirely.
Architecture gapOne adapter container handling all robots means: if it crashes, the entire fleet goes dark. If it's CPU-bound processing 50 robots' telemetry, latency degrades for all. No horizontal scaling.
Reliability riskThe entire architecture is built on one non-negotiable rule:
The robots run their standard ROS 2 stack — untouched, unmodified. They publish topics and accept commands via DDS on the local network, just as their manufacturer intended. Everything else happens outside the robot, on infrastructure you control.
This is possible because DDS is a network protocol. Any process on the same network can subscribe to a robot's topics without the robot knowing or caring. The robot doesn't need a "client" installed — it's already broadcasting. We just need to listen from the right place.
The solution splits the monolithic "ROS Adapter" into four distinct tiers, each with a clear job and independent scaling characteristics.
| Tier | What | Where It Runs | Scales By |
|---|---|---|---|
| TIER 1 | Robots — standard ROS 2 stack, untouched | On the physical robots | Adding more robots (each is independent) |
| TIER 2 | Edge Bridge — Zenoh bridge capturing DDS traffic | Small server/NUC on the robot's LAN at each site | One per site/network segment |
| TIER 3 | Adapter Pool — Zenoh Router + stateless adapter instances | Datacenter / cloud alongside TwinSight backend | Horizontal: add adapter instances per load |
| TIER 4 | TwinSight Platform — existing Kafka + microservices + UI | Datacenter / cloud | Existing scaling strategy (unchanged) |
Robots continue running their standard ROS 2 software stack. They publish topics (telemetry, sensor data, diagnostics) and accept commands (actions, services) via DDS on the local network. Nothing is installed, configured, or modified on the robot.
This approach works with any ROS 2 robot from any manufacturer. Whether it's a custom-built AGV, a commercial AMR from a vendor like MiR or Locus, or a simulated robot in Gazebo — if it speaks ROS 2 on a network, TwinSight can connect to it. You never need to ask a manufacturer for permission to install software. You never need to maintain adapter code on 50 different robots. The robots are treated as pure data sources and command sinks.
Requirements from the robot: Expose standard ROS 2 topics on the local network using any DDS implementation (Fast DDS, Cyclone DDS, Connext). Use ROS 2 namespaces (e.g., /robot_01/) for multi-robot disambiguation. That's it.
This is the tier that makes everything work. Each physical site where robots operate gets a small, dedicated piece of hardware running a Zenoh bridge. This bridge sits on the same LAN as the robots, listens to their DDS traffic, and forwards it over the WAN to the centralized adapter pool.
Eclipse Zenoh is a protocol designed for exactly this problem: moving robot data efficiently across networks. The zenoh-plugin-ros2dds (the successor to zenoh-bridge-dds) is a standalone process that:
1. Discovers all DDS participants on the local network automatically (via DDS SPDP/SEDP discovery)
2. Subscribes to configured ROS 2 topics
3. Translates DDS messages into Zenoh wire format (compressed, batched)
4. Forwards them over TCP/QUIC to a remote Zenoh router with built-in NAT traversal
DDS was designed for LANs — it uses multicast discovery, which doesn't cross subnets, VPNs, or the internet. Zenoh was designed for geo-distributed systems:
• NAT traversal — works through firewalls without port forwarding
• Bandwidth efficiency — compresses and batches messages (critical for WAN)
• Selective forwarding — only sends topics that someone is actually consuming
• QUIC transport — modern protocol, faster than TCP for lossy networks
• Bidirectional — commands flow back from cloud to robots on the same connection
The Zenoh bridge is lightweight. It doesn't process or transform data — it just forwards it. Hardware requirements are minimal:
| Fleet Size per Site | Hardware | Estimated Cost |
|---|---|---|
| 1–10 robots | Intel NUC / Raspberry Pi 5 / any x86 mini-PC | ~$150–$400 |
| 10–50 robots | Small server (4-core, 8GB RAM) | ~$400–$800 |
| 50+ robots | Dedicated edge server or 2x bridges with load sharing | ~$800–$2000 |
The edge bridge is stateless and replaceable. If it dies, you swap the hardware, run the same Docker container, and it auto-discovers all robots on the LAN again. No configuration per-robot. No state to lose. Recovery time: minutes.
# docker-compose.edge-bridge.yml — deployed at each site
services:
zenoh-bridge:
image: eclipse/zenoh-bridge-ros2dds:latest
network_mode: host # CRITICAL: must see DDS multicast
environment:
- ZENOH_ROUTER=tcp/twinsight-cloud.example.com:7447
- ROS_DOMAIN_ID=0
volumes:
- ./zenoh-bridge-config.json5:/config.json5
restart: always
# zenoh-bridge-config.json5 — topic selection
# Only forward what TwinSight actually needs
{
allowance: {
pub: [
"/*/cmd_vel", # teleoperation commands (cloud→robot)
],
sub: [
"/*/robot_state", # state topics (robot→cloud)
"/*/battery_state",
"/*/odom",
"/*/diagnostics",
"/*/camera/image_compressed",
"/*/scan", # LIDAR
]
}
}
The /* prefix matches any robot namespace. The bridge auto-discovers all robots and forwards only the listed topics. network_mode: host is required so the container can see DDS multicast on the LAN.
This is where the original proposal's "ROS Adapter" logic lives — but redesigned as a horizontally scalable pool of stateless instances behind a Zenoh Router.
A single (or HA pair of) Zenoh Router acts as the entry point. All edge bridges connect here. The router handles topic routing, access control, and connection management. Think of it as a "switchboard" that knows which adapter instance handles which site.
InfrastructureStateless containers that subscribe to Zenoh topics (forwarded from edge bridges), perform the actual translation (Zenoh/ROS → JSON events), and publish to Kafka. Each instance handles a configured subset of robots (by site or robot range).
Horizontally scalableA lightweight coordination service that assigns robots to adapter instances, detects instance failures, and rebalances. Similar to a Kafka consumer group coordinator. Can use Redis for state or run as part of the Zenoh Router config.
OrchestrationThe adapter instance is the component that contains the actual translation logic. It's written in Python (with Zenoh SDK) and is stateless — all state goes to Redis/Kafka.
Data routing decision: Structured telemetry (position, battery, state → small JSON) goes to Kafka. Binary blobs (camera images, LIDAR scans, point clouds → large binary) go directly to ReductStore. This prevents Kafka from being choked by multi-megabyte messages.
When an operator sends a command (start mission, teleoperation input), the flow reverses:
mission.start event with robot_id and mission parameters
The adapter responsible for that robot_id picks up the event
The command is sent as a Zenoh message (ROS 2 action goal or service call)
The Zenoh bridge at the robot's site converts back to DDS and publishes on the LAN
From the robot's perspective, it's receiving a normal ROS 2 message — it has no idea the command originated from a web browser.
The beauty of this design: the existing TwinSight architecture doesn't change at all. The adapter pool produces the exact same Kafka events the original proposal specified (robot.state.updated, robot.telemetry.pose, mission.progress, etc.). From the backend's perspective, it's still consuming events from Kafka and pushing commands to Kafka. It doesn't know or care that those events traveled through Zenoh bridges across the internet.
This means you can develop and test the TwinSight platform (backend + frontend) completely independently of the bridge infrastructure. Use a mock Kafka producer during development. Deploy the real bridge infrastructure only when connecting to actual robot sites. Swap bridge technologies in the future without touching a single line of platform code.
The four-tier architecture supports multiple deployment configurations depending on the customer's environment:
Scenario: One warehouse, all robots on the same LAN, TwinSight running on a local server.
Deployment: Tier 2 + Tier 3 + Tier 4 all run on the same machine (or cluster). The Zenoh bridge runs in network_mode: host to see DDS traffic, and connects to the adapter pool via localhost. Effectively collapses Tiers 2–4 into a single Docker Compose deployment. Zero WAN traffic.
Scenario: Three warehouses in different cities, TwinSight in the cloud.
Deployment: Each warehouse gets a small edge server running the Zenoh bridge (Tier 2). The bridge connects over VPN or direct internet (Zenoh handles NAT) to the cloud-hosted Zenoh Router + Adapter Pool + TwinSight (Tiers 3 + 4). This is the primary production pattern for distributed fleets.
Most common production patternScenario: Low-latency teleoperation needed, but central monitoring is also required.
Deployment: Each site runs its own Zenoh bridge AND a local adapter instance (for low-latency commands and teleoperation). A second Zenoh connection forwards telemetry to the cloud for central monitoring, reporting, and ML. This gives you sub-10ms command latency locally while still having the centralized fleet overview.
Best for teleoperationScenario: Military or high-security environment, no internet connection allowed.
Deployment: Everything runs on-premises on an isolated network. Tier 2 + 3 + 4 in a single rack. Zenoh bridges connect via internal LAN to the adapter pool. The architecture still provides multi-site support if sites are connected via a private WAN (no internet needed).
Security-sensitive environments| Fleet Scale | Edge Bridges | Adapter Instances | Zenoh Router | Notes |
|---|---|---|---|---|
| 1–20 robots 1 site |
1 (can share with TwinSight host) | 1 | Embedded in adapter | Everything in one Docker Compose. Development / small production. |
| 20–100 robots 1–3 sites |
1 per site | 2–4 (one per site + spare) | 1 dedicated | Standard production. Auto-assignment of robots to adapters by site. |
| 100–500 robots 3–10 sites |
1–2 per site | 5–15 (auto-scaled) | HA pair (active/standby) | Kubernetes deployment. HPA (Horizontal Pod Autoscaler) on adapter pods based on message throughput. |
| 500+ robots 10+ sites |
2+ per large site | 15+ (auto-scaled) | Zenoh router mesh | Multiple Zenoh routers with geographic routing. Kafka partitioning by site_id + robot_id. |
The bottleneck shifts as you scale: at 1–50 robots, the bottleneck is network bandwidth (especially for camera streams). At 50–200 robots, it's adapter CPU (deserializing and normalizing messages). At 200+ robots, it's Kafka throughput and TimescaleDB write speed. The four-tier architecture lets you address each bottleneck independently without redesigning the system.
Bridging robot networks to the internet introduces serious attack surface. Here's how to lock it down:
All Zenoh connections (edge → router → adapters) use TLS 1.3 with mutual TLS (mTLS). Each edge bridge has a unique certificate. The Zenoh router only accepts connections from known certificates. A compromised bridge can be revoked without affecting others.
The Zenoh router enforces ACLs per bridge. Site A's bridge can only publish/subscribe to topics for Site A's robots. Even if an attacker compromises Site A's bridge, they can't access Site B's robot data or send commands to Site B's robots.
Commands flowing from platform → robots are signed by the adapter with a per-adapter key. The edge bridge can optionally verify signatures before forwarding to DDS. This prevents command injection even if the Zenoh router is compromised.
Robot LANs remain isolated — no direct internet access. The edge bridge is the only connection point, and it only forwards configured topics. DDS discovery traffic stays local. The bridge acts as a controlled, auditable gateway.
| What Fails | Impact | Recovery | Data Loss |
|---|---|---|---|
| Edge bridge crashes | That site's telemetry stops flowing. Robots continue operating autonomously. Platform marks robots as "stale". | Auto-restart (Docker restart policy). Auto-rediscovers DDS topics. Recovery: ~5–15 seconds. | Telemetry during outage is lost (acceptable — it's ephemeral). Robots buffer locally if configured. |
| WAN connection lost | Same as bridge crash from platform perspective. Edge bridge buffers locally (Zenoh supports message queuing). | Automatic reconnection when WAN recovers. Buffered messages are forwarded. | Minimal — Zenoh queue depth is configurable. Oldest messages dropped if queue fills. |
| Adapter instance crashes | Robots assigned to that instance go "stale" on the platform. | Assignment Manager detects failure (heartbeat timeout) and reassigns robots to surviving instances. K8s auto-replaces the pod. | Events during reassignment window (~5–10s) are buffered in Zenoh Router. |
| Zenoh Router crashes | All sites disconnected from platform. Robots unaffected (they operate autonomously). | HA standby router takes over (if deployed). Edge bridges auto-reconnect to backup. Recovery: ~10–30 seconds. | Buffered in edge bridges during outage. |
| Kafka is down | Adapter instances can't publish events. They buffer internally and retry. | Standard Kafka HA (replication factor ≥ 2). Adapters retry with exponential backoff. | None — adapters buffer until Kafka recovers. |
Robots never stop working because the platform is down. The architecture ensures that telemetry flow is best-effort (lost data during outages is acceptable — you can't control a robot based on 30-second-old data anyway), while commands are guaranteed-delivery (buffered and retried until confirmed). A robot that loses connection to the platform continues its current mission autonomously — it doesn't stop or enter an error state.
| Phase | Deliverables | Duration | Dependencies |
|---|---|---|---|
| Phase 1 Single-site MVP |
• Zenoh bridge Docker image with ROS 2 topic configuration • Single adapter instance (Python + Zenoh SDK + Kafka producer) • Deserializer for core ROS 2 message types (Odometry, BatteryState, DiagnosticArray, sensor_msgs) • Kafka event schema definitions (matching TwinSight proposal) • Docker Compose for single-site deployment |
3–4 weeks | Kafka schema definitions from backend team |
| Phase 2 Multi-site + scaling |
• Zenoh Router deployment (standalone container) • Assignment Manager (robot → adapter instance mapping) • Multi-adapter instance support with health monitoring • Edge bridge deployment automation (Ansible/Terraform scripts for site setup) • mTLS certificate provisioning workflow |
2–3 weeks | Phase 1 complete, VPN/network access to test sites |
| Phase 3 Commands + teleoperation |
• Reverse command flow (Kafka → adapter → Zenoh → edge → DDS → robot) • ROS 2 action client support (for missions with feedback) • Teleoperation WebSocket → Zenoh path with latency optimization • Command signing and verification • Deadman switch integration |
3–4 weeks | Phase 2 complete, teleoperation UI from frontend team |
| Phase 4 Production hardening |
• ReductStore integration for binary blob recording • Kubernetes Helm charts for adapter pool + Zenoh router • HPA policies based on message throughput metrics • Prometheus metrics exporter for all bridge components • Failover testing and chaos engineering • Documentation: edge bridge deployment guide per site |
2–3 weeks | Phase 3 complete, K8s cluster available |
| Component | Technology | Why This One |
|---|---|---|
| Edge bridge | zenoh-plugin-ros2dds | Production-grade, no robot install needed, NAT traversal, selective topic forwarding, bidirectional |
| WAN transport | Zenoh protocol (TCP/QUIC) | Designed for geo-distributed robotics, compression, batching, handles unreliable networks |
| Central router | Zenoh Router | ACL support, connection management, HA pairing, topic routing |
| Adapter runtime | Python + zenoh-python + kafka-python | Matches TwinSight backend stack (FastAPI/Python), strong ROS 2 message deserialization ecosystem |
| Blob storage | ReductStore + ReductStore Agent | Time-indexed binary storage, native ROS 2 support, FIFO retention (as discussed in previous analysis) |
| Orchestration | Kubernetes (production) / Docker Compose (dev/small) | HPA for adapter auto-scaling, pod replacement on failure, Helm for reproducible deployments |
This architecture transforms the ROS Adapter from a monolithic single-point-of-failure into a production-grade, multi-site bridge infrastructure that scales from 5 robots in a lab to 500+ robots across continents. The core principle — zero footprint on the robot — removes the manufacturer approval bottleneck entirely. The Zenoh-based transport is purpose-built for this exact problem (robot-to-cloud communication), and the stateless adapter pool scales horizontally with standard Kubernetes tooling. Most importantly, the existing TwinSight backend architecture doesn't change — the adapter pool produces the same Kafka events, consumes the same command events, and remains invisible to the platform's microservices.