TwinSight ROS Adapter — Production Architecture Plan

The Problem

Why the Current Proposal Doesn't Scale

The original TwinSight proposal describes the ROS Adapter as a single container that connects directly to ROS 2 via DDS. This works for a demo with 3–5 robots on one network, but it breaks down in production for three fundamental reasons:

Manufacturer Lock-Out

Many robot manufacturers ship robots as sealed systems. They expose ROS 2 topics on the network but do not allow custom software installation. If the adapter must run ON the robot, you can't deploy it on half your fleet.

Blocker

DDS Doesn't Cross Networks

DDS discovery uses multicast on the local subnet. Robots on different VLANs, buildings, or sites cannot be discovered by a single centralized adapter. The proposal ignores network topology entirely.

Architecture gap

Single Point of Failure

One adapter container handling all robots means: if it crashes, the entire fleet goes dark. If it's CPU-bound processing 50 robots' telemetry, latency degrades for all. No horizontal scaling.

Reliability risk

The real-world problem: Imagine TwinSight is deployed across three warehouse locations. Each warehouse has 20 robots from two different manufacturers. The robots are on isolated networks (for safety). A single adapter container sitting in the cloud can't see any of them — DDS multicast doesn't cross the internet. And Manufacturer A won't let you install anything on their robots. You need a fundamentally different approach.

Design Philosophy

The Core Principle: Never Touch the Robot

The entire architecture is built on one non-negotiable rule:

Zero footprint on the robot.

The robots run their standard ROS 2 stack — untouched, unmodified. They publish topics and accept commands via DDS on the local network, just as their manufacturer intended. Everything else happens outside the robot, on infrastructure you control.

This is possible because DDS is a network protocol. Any process on the same network can subscribe to a robot's topics without the robot knowing or caring. The robot doesn't need a "client" installed — it's already broadcasting. We just need to listen from the right place.

Analogy: A robot publishing ROS 2 topics is like a radio station broadcasting music. You don't need to install anything at the radio station to listen — you just put a radio receiver nearby. Our architecture places "receivers" (edge bridges) near the robots, and pipes the signal back to the command center.

Section 01

Four-Tier Bridge Architecture

The solution splits the monolithic "ROS Adapter" into four distinct tiers, each with a clear job and independent scaling characteristics.

Full Architecture Overview

Tier	What	Where It Runs	Scales By
TIER 1	Robots — standard ROS 2 stack, untouched	On the physical robots	Adding more robots (each is independent)
TIER 2	Edge Bridge — Zenoh bridge capturing DDS traffic	Small server/NUC on the robot's LAN at each site	One per site/network segment
TIER 3	Adapter Pool — Zenoh Router + stateless adapter instances	Datacenter / cloud alongside TwinSight backend	Horizontal: add adapter instances per load
TIER 4	TwinSight Platform — existing Kafka + microservices + UI	Datacenter / cloud	Existing scaling strategy (unchanged)

Tier 1

The Robots — Zero Changes

Robots continue running their standard ROS 2 software stack. They publish topics (telemetry, sensor data, diagnostics) and accept commands (actions, services) via DDS on the local network. Nothing is installed, configured, or modified on the robot.

Why This Matters

This approach works with any ROS 2 robot from any manufacturer. Whether it's a custom-built AGV, a commercial AMR from a vendor like MiR or Locus, or a simulated robot in Gazebo — if it speaks ROS 2 on a network, TwinSight can connect to it. You never need to ask a manufacturer for permission to install software. You never need to maintain adapter code on 50 different robots. The robots are treated as pure data sources and command sinks.

Requirements from the robot: Expose standard ROS 2 topics on the local network using any DDS implementation (Fast DDS, Cyclone DDS, Connext). Use ROS 2 namespaces (e.g., /robot_01/) for multi-robot disambiguation. That's it.

Tier 2

Edge Bridges — The Critical Innovation

This is the tier that makes everything work. Each physical site where robots operate gets a small, dedicated piece of hardware running a Zenoh bridge. This bridge sits on the same LAN as the robots, listens to their DDS traffic, and forwards it over the WAN to the centralized adapter pool.

What Is a Zenoh Bridge?

Eclipse Zenoh is a protocol designed for exactly this problem: moving robot data efficiently across networks. The zenoh-plugin-ros2dds (the successor to zenoh-bridge-dds) is a standalone process that:

1. Discovers all DDS participants on the local network automatically (via DDS SPDP/SEDP discovery)
2. Subscribes to configured ROS 2 topics
3. Translates DDS messages into Zenoh wire format (compressed, batched)
4. Forwards them over TCP/QUIC to a remote Zenoh router with built-in NAT traversal

Why Zenoh Over Raw DDS?

DDS was designed for LANs — it uses multicast discovery, which doesn't cross subnets, VPNs, or the internet. Zenoh was designed for geo-distributed systems:

• NAT traversal — works through firewalls without port forwarding
• Bandwidth efficiency — compresses and batches messages (critical for WAN)
• Selective forwarding — only sends topics that someone is actually consuming
• QUIC transport — modern protocol, faster than TCP for lossy networks
• Bidirectional — commands flow back from cloud to robots on the same connection

Edge Bridge Hardware

The Zenoh bridge is lightweight. It doesn't process or transform data — it just forwards it. Hardware requirements are minimal:

Fleet Size per Site	Hardware	Estimated Cost
1–10 robots	Intel NUC / Raspberry Pi 5 / any x86 mini-PC	~$150–$400
10–50 robots	Small server (4-core, 8GB RAM)	~$400–$800
50+ robots	Dedicated edge server or 2x bridges with load sharing	~$800–$2000

Key Advantage

The edge bridge is stateless and replaceable. If it dies, you swap the hardware, run the same Docker container, and it auto-discovers all robots on the LAN again. No configuration per-robot. No state to lose. Recovery time: minutes.

Edge Bridge Configuration (Docker Compose Snippet)

# docker-compose.edge-bridge.yml — deployed at each site
services:
  zenoh-bridge:
    image: eclipse/zenoh-bridge-ros2dds:latest
    network_mode: host  # CRITICAL: must see DDS multicast
    environment:
      - ZENOH_ROUTER=tcp/twinsight-cloud.example.com:7447
      - ROS_DOMAIN_ID=0
    volumes:
      - ./zenoh-bridge-config.json5:/config.json5
    restart: always

# zenoh-bridge-config.json5 — topic selection
# Only forward what TwinSight actually needs
{
  allowance: {
    pub: [
      "/*/cmd_vel",            # teleoperation commands (cloud→robot)
    ],
    sub: [
      "/*/robot_state",        # state topics (robot→cloud)
      "/*/battery_state",
      "/*/odom",
      "/*/diagnostics",
      "/*/camera/image_compressed",
      "/*/scan",               # LIDAR
    ]
  }
}

The /* prefix matches any robot namespace. The bridge auto-discovers all robots and forwards only the listed topics. network_mode: host is required so the container can see DDS multicast on the LAN.

Tier 3

Centralized Adapter Pool — The Brain

This is where the original proposal's "ROS Adapter" logic lives — but redesigned as a horizontally scalable pool of stateless instances behind a Zenoh Router.

Zenoh Router

A single (or HA pair of) Zenoh Router acts as the entry point. All edge bridges connect here. The router handles topic routing, access control, and connection management. Think of it as a "switchboard" that knows which adapter instance handles which site.

Infrastructure

Adapter Instances

Stateless containers that subscribe to Zenoh topics (forwarded from edge bridges), perform the actual translation (Zenoh/ROS → JSON events), and publish to Kafka. Each instance handles a configured subset of robots (by site or robot range).

Horizontally scalable

Assignment Manager

A lightweight coordination service that assigns robots to adapter instances, detects instance failures, and rebalances. Similar to a Kafka consumer group coordinator. Can use Redis for state or run as part of the Zenoh Router config.

Orchestration

What Each Adapter Instance Does

The adapter instance is the component that contains the actual translation logic. It's written in Python (with Zenoh SDK) and is stateless — all state goes to Redis/Kafka.

Adapter Instance Internal Pipeline

Data routing decision: Structured telemetry (position, battery, state → small JSON) goes to Kafka. Binary blobs (camera images, LIDAR scans, point clouds → large binary) go directly to ReductStore. This prevents Kafka from being choked by multi-megabyte messages.

Command Flow (Reverse Direction)

When an operator sends a command (start mission, teleoperation input), the flow reverses:

1

Mission Orchestrator publishes command to Kafka

mission.start event with robot_id and mission parameters

↓

2

Assigned adapter instance consumes the command

The adapter responsible for that robot_id picks up the event

↓

3

Adapter serializes to ROS 2 format and publishes via Zenoh

The command is sent as a Zenoh message (ROS 2 action goal or service call)

↓

4

Edge bridge forwards to local DDS

The Zenoh bridge at the robot's site converts back to DDS and publishes on the LAN

↓

5

Robot receives the ROS 2 action/service call

From the robot's perspective, it's receiving a normal ROS 2 message — it has no idea the command originated from a web browser.

Tier 4

TwinSight Platform — Unchanged

The beauty of this design: the existing TwinSight architecture doesn't change at all. The adapter pool produces the exact same Kafka events the original proposal specified (robot.state.updated, robot.telemetry.pose, mission.progress, etc.). From the backend's perspective, it's still consuming events from Kafka and pushing commands to Kafka. It doesn't know or care that those events traveled through Zenoh bridges across the internet.

Clean Separation

This means you can develop and test the TwinSight platform (backend + frontend) completely independently of the bridge infrastructure. Use a mock Kafka producer during development. Deploy the real bridge infrastructure only when connecting to actual robot sites. Swap bridge technologies in the future without touching a single line of platform code.

Section 05

Deployment Patterns

The four-tier architecture supports multiple deployment configurations depending on the customer's environment:

Pattern A: Single-Site On-Premises

Scenario: One warehouse, all robots on the same LAN, TwinSight running on a local server.

Deployment: Tier 2 + Tier 3 + Tier 4 all run on the same machine (or cluster). The Zenoh bridge runs in network_mode: host to see DDS traffic, and connects to the adapter pool via localhost. Effectively collapses Tiers 2–4 into a single Docker Compose deployment. Zero WAN traffic.

Simplest deployment

Pattern B: Multi-Site with Central Cloud

Scenario: Three warehouses in different cities, TwinSight in the cloud.

Deployment: Each warehouse gets a small edge server running the Zenoh bridge (Tier 2). The bridge connects over VPN or direct internet (Zenoh handles NAT) to the cloud-hosted Zenoh Router + Adapter Pool + TwinSight (Tiers 3 + 4). This is the primary production pattern for distributed fleets.

Most common production pattern

Pattern C: Hybrid (Edge Processing + Cloud)

Scenario: Low-latency teleoperation needed, but central monitoring is also required.

Deployment: Each site runs its own Zenoh bridge AND a local adapter instance (for low-latency commands and teleoperation). A second Zenoh connection forwards telemetry to the cloud for central monitoring, reporting, and ML. This gives you sub-10ms command latency locally while still having the centralized fleet overview.

Best for teleoperation

Pattern D: Air-Gapped / Classified

Scenario: Military or high-security environment, no internet connection allowed.

Deployment: Everything runs on-premises on an isolated network. Tier 2 + 3 + 4 in a single rack. Zenoh bridges connect via internal LAN to the adapter pool. The architecture still provides multi-site support if sites are connected via a private WAN (no internet needed).

Security-sensitive environments

Section 06

Scaling Strategy

Fleet Scale	Edge Bridges	Adapter Instances	Zenoh Router	Notes
1–20 robots 1 site	1 (can share with TwinSight host)	1	Embedded in adapter	Everything in one Docker Compose. Development / small production.
20–100 robots 1–3 sites	1 per site	2–4 (one per site + spare)	1 dedicated	Standard production. Auto-assignment of robots to adapters by site.
100–500 robots 3–10 sites	1–2 per site	5–15 (auto-scaled)	HA pair (active/standby)	Kubernetes deployment. HPA (Horizontal Pod Autoscaler) on adapter pods based on message throughput.
500+ robots 10+ sites	2+ per large site	15+ (auto-scaled)	Zenoh router mesh	Multiple Zenoh routers with geographic routing. Kafka partitioning by site_id + robot_id.

Scaling Bottleneck Analysis

The bottleneck shifts as you scale: at 1–50 robots, the bottleneck is network bandwidth (especially for camera streams). At 50–200 robots, it's adapter CPU (deserializing and normalizing messages). At 200+ robots, it's Kafka throughput and TimescaleDB write speed. The four-tier architecture lets you address each bottleneck independently without redesigning the system.

Section 07

Security Model

Bridging robot networks to the internet introduces serious attack surface. Here's how to lock it down:

Zenoh TLS + Mutual Authentication

All Zenoh connections (edge → router → adapters) use TLS 1.3 with mutual TLS (mTLS). Each edge bridge has a unique certificate. The Zenoh router only accepts connections from known certificates. A compromised bridge can be revoked without affecting others.

Topic-Level Access Control

The Zenoh router enforces ACLs per bridge. Site A's bridge can only publish/subscribe to topics for Site A's robots. Even if an attacker compromises Site A's bridge, they can't access Site B's robot data or send commands to Site B's robots.

Command Signing

Commands flowing from platform → robots are signed by the adapter with a per-adapter key. The edge bridge can optionally verify signatures before forwarding to DDS. This prevents command injection even if the Zenoh router is compromised.

Network Isolation

Robot LANs remain isolated — no direct internet access. The edge bridge is the only connection point, and it only forwards configured topics. DDS discovery traffic stays local. The bridge acts as a controlled, auditable gateway.

Security Boundaries

Section 08

Failover & Resilience

What Fails	Impact	Recovery	Data Loss
Edge bridge crashes	That site's telemetry stops flowing. Robots continue operating autonomously. Platform marks robots as "stale".	Auto-restart (Docker restart policy). Auto-rediscovers DDS topics. Recovery: ~5–15 seconds.	Telemetry during outage is lost (acceptable — it's ephemeral). Robots buffer locally if configured.
WAN connection lost	Same as bridge crash from platform perspective. Edge bridge buffers locally (Zenoh supports message queuing).	Automatic reconnection when WAN recovers. Buffered messages are forwarded.	Minimal — Zenoh queue depth is configurable. Oldest messages dropped if queue fills.
Adapter instance crashes	Robots assigned to that instance go "stale" on the platform.	Assignment Manager detects failure (heartbeat timeout) and reassigns robots to surviving instances. K8s auto-replaces the pod.	Events during reassignment window (~5–10s) are buffered in Zenoh Router.
Zenoh Router crashes	All sites disconnected from platform. Robots unaffected (they operate autonomously).	HA standby router takes over (if deployed). Edge bridges auto-reconnect to backup. Recovery: ~10–30 seconds.	Buffered in edge bridges during outage.
Kafka is down	Adapter instances can't publish events. They buffer internally and retry.	Standard Kafka HA (replication factor ≥ 2). Adapters retry with exponential backoff.	None — adapters buffer until Kafka recovers.

Key Resilience Principle

Robots never stop working because the platform is down. The architecture ensures that telemetry flow is best-effort (lost data during outages is acceptable — you can't control a robot based on 30-second-old data anyway), while commands are guaranteed-delivery (buffered and retried until confirmed). A robot that loses connection to the platform continues its current mission autonomously — it doesn't stop or enter an error state.

Section 09

Implementation Plan

Phase	Deliverables	Duration	Dependencies
Phase 1 Single-site MVP	• Zenoh bridge Docker image with ROS 2 topic configuration • Single adapter instance (Python + Zenoh SDK + Kafka producer) • Deserializer for core ROS 2 message types (Odometry, BatteryState, DiagnosticArray, sensor_msgs) • Kafka event schema definitions (matching TwinSight proposal) • Docker Compose for single-site deployment	3–4 weeks	Kafka schema definitions from backend team
Phase 2 Multi-site + scaling	• Zenoh Router deployment (standalone container) • Assignment Manager (robot → adapter instance mapping) • Multi-adapter instance support with health monitoring • Edge bridge deployment automation (Ansible/Terraform scripts for site setup) • mTLS certificate provisioning workflow	2–3 weeks	Phase 1 complete, VPN/network access to test sites
Phase 3 Commands + teleoperation	• Reverse command flow (Kafka → adapter → Zenoh → edge → DDS → robot) • ROS 2 action client support (for missions with feedback) • Teleoperation WebSocket → Zenoh path with latency optimization • Command signing and verification • Deadman switch integration	3–4 weeks	Phase 2 complete, teleoperation UI from frontend team
Phase 4 Production hardening	• ReductStore integration for binary blob recording • Kubernetes Helm charts for adapter pool + Zenoh router • HPA policies based on message throughput metrics • Prometheus metrics exporter for all bridge components • Failover testing and chaos engineering • Documentation: edge bridge deployment guide per site	2–3 weeks	Phase 3 complete, K8s cluster available

    Technology Choices Summary
    
      ComponentTechnologyWhy This One

      Edge bridgezenoh-plugin-ros2ddsProduction-grade, no robot install needed, NAT traversal, selective topic forwarding, bidirectional
WAN transportZenoh protocol (TCP/QUIC)Designed for geo-distributed robotics, compression, batching, handles unreliable networks
Central routerZenoh RouterACL support, connection management, HA pairing, topic routing
Adapter runtimePython + zenoh-python + kafka-pythonMatches TwinSight backend stack (FastAPI/Python), strong ROS 2 message deserialization ecosystem
Blob storageReductStore + ReductStore AgentTime-indexed binary storage, native ROS 2 support, FIFO retention (as discussed in previous analysis)
OrchestrationKubernetes (production) / Docker Compose (dev/small)HPA for adapter auto-scaling, pod replacement on failure, Helm for reproducible deployments

    
  

Component	Technology	Why This One
Edge bridge	`zenoh-plugin-ros2dds`	Production-grade, no robot install needed, NAT traversal, selective topic forwarding, bidirectional
WAN transport	Zenoh protocol (TCP/QUIC)	Designed for geo-distributed robotics, compression, batching, handles unreliable networks
Central router	Zenoh Router	ACL support, connection management, HA pairing, topic routing
Adapter runtime	Python + `zenoh-python` + `kafka-python`	Matches TwinSight backend stack (FastAPI/Python), strong ROS 2 message deserialization ecosystem
Blob storage	ReductStore + ReductStore Agent	Time-indexed binary storage, native ROS 2 support, FIFO retention (as discussed in previous analysis)
Orchestration	Kubernetes (production) / Docker Compose (dev/small)	HPA for adapter auto-scaling, pod replacement on failure, Helm for reproducible deployments

Final Assessment

This architecture transforms the ROS Adapter from a monolithic single-point-of-failure into a production-grade, multi-site bridge infrastructure that scales from 5 robots in a lab to 500+ robots across continents. The core principle — zero footprint on the robot — removes the manufacturer approval bottleneck entirely. The Zenoh-based transport is purpose-built for this exact problem (robot-to-cloud communication), and the stateless adapter pool scales horizontally with standard Kubernetes tooling. Most importantly, the existing TwinSight backend architecture doesn't change — the adapter pool produces the same Kafka events, consumes the same command events, and remains invisible to the platform's microservices.

ROS Adapter Architecture