title: "062: Agent Resilience & Self-Healing"
type: ticket
status: in_progress
priority: critical
sprint: S001_base_and_mvp1
assignee: CM0 + CC0
estimated_hours: 3
last_updated: "2026-03-23"

062: Agent Resilience & Self-Healing

Problem

Three core agents (B0, CM0, CC0) break frequently and require Shai's manual intervention to recover. In today's session alone:

  1. Bus went down (FastMCP bound to localhost inside Docker)
  2. CM0 settings.json crashed (extraKnownMarketplaces schema error)
  3. CC0 didn't know its identity (no CLAUDE.md on Hetzner)
  4. CM0 couldn't find Node.js (bash PATH vs zsh PATH)
  5. Paperclip update/comment routes 404 (wrong API paths)
  6. Bus went down AGAIN (auth error after rebuild)

Root Causes

Solution: 3-Layer Resilience

Layer 1: Hetzner Health Monitor (CC0 deploys)

A cron script that runs every 5 minutes on Hetzner:

File: /opt/struxio/scripts/health_check.sh

#!/bin/bash
set -euo pipefail

LOG="/opt/struxio/logs/health_check.log"
TIMESTAMP=$(date '+%Y-%m-%d %H:%M:%S')

# Check Bus MCP
BUS_STATUS=$(curl -s -o /dev/null -w "%{http_code}" https://bus.struxio.ai/healthz 2>/dev/null || echo "000")
if [ "$BUS_STATUS" != "200" ]; then
    echo "[$TIMESTAMP] BUS DOWN (HTTP $BUS_STATUS) — restarting..." >> "$LOG"
    cd /opt/struxio/bus/deploy
    docker compose restart bus
    sleep 10
    BUS_RETRY=$(curl -s -o /dev/null -w "%{http_code}" https://bus.struxio.ai/healthz 2>/dev/null || echo "000")
    if [ "$BUS_RETRY" != "200" ]; then
        echo "[$TIMESTAMP] BUS RESTART FAILED (HTTP $BUS_RETRY) — full rebuild..." >> "$LOG"
        docker compose up -d --build bus
    else
        echo "[$TIMESTAMP] BUS recovered after restart" >> "$LOG"
    fi
else
    echo "[$TIMESTAMP] BUS OK" >> "$LOG"
fi

# Check Paperclip
PC_STATUS=$(curl -s -o /dev/null -w "%{http_code}" https://paperclip.struxio.ai 2>/dev/null || echo "000")
if [ "$PC_STATUS" != "200" ] && [ "$PC_STATUS" != "301" ] && [ "$PC_STATUS" != "302" ]; then
    echo "[$TIMESTAMP] PAPERCLIP DOWN (HTTP $PC_STATUS) — restarting..." >> "$LOG"
    cd /opt/struxio/bus/deploy
    docker compose restart paperclip
else
    echo "[$TIMESTAMP] PAPERCLIP OK" >> "$LOG"
fi

# Check Postgres
PG_STATUS=$(docker exec deploy-postgres-1 pg_isready -U bus 2>/dev/null && echo "OK" || echo "DOWN")
if [ "$PG_STATUS" != "OK" ]; then
    echo "[$TIMESTAMP] POSTGRES DOWN — restarting..." >> "$LOG"
    cd /opt/struxio/bus/deploy
    docker compose restart postgres
else
    echo "[$TIMESTAMP] POSTGRES OK" >> "$LOG"
fi

# Check Caddy
CADDY_STATUS=$(curl -s -o /dev/null -w "%{http_code}" http://127.0.0.1:80 2>/dev/null || echo "000")
if [ "$CADDY_STATUS" = "000" ]; then
    echo "[$TIMESTAMP] CADDY DOWN — restarting..." >> "$LOG"
    cd /opt/struxio/bus/deploy
    docker compose restart caddy
fi

# Trim log to last 1000 lines
tail -1000 "$LOG" > "$LOG.tmp" && mv "$LOG.tmp" "$LOG"

Cron: */5 * * * * struxio /opt/struxio/scripts/health_check.sh

Layer 2: Mac Startup Validation (CM0 runs)

File: /Volumes/CORSAIR/STRUXIO_HardDrive/STRUXIO_Workspace/STRUXIO_OS/scripts/mac_startup_check.sh

#!/bin/bash
# Run BEFORE starting Claude Code on Mac

echo "=== STRUXIO Mac Startup Check ==="

# 1. Check Node.js
if ! command -v node &>/dev/null; then
    echo "[FAIL] Node.js not in PATH"
    echo "Fix: export PATH=\"/opt/homebrew/bin:\$PATH\" or use zsh"
    exit 1
fi
echo "[OK] Node.js $(node --version)"

# 2. Check settings.json validity
SETTINGS="$HOME/.claude/settings.json"
if [ -f "$SETTINGS" ]; then
    if python3 -c "import json; json.load(open('$SETTINGS'))" 2>/dev/null; then
        echo "[OK] settings.json valid JSON"
    else
        echo "[FAIL] settings.json invalid — restoring from backup"
        cp /Volumes/CORSAIR/STRUXIO_HardDrive/STRUXIO_Workspace/STRUXIO_OS/01_state/GLOBAL_claude_settings.json "$SETTINGS"
        echo "[FIXED] settings.json restored"
    fi
else
    echo "[WARN] settings.json missing — creating from template"
    cp /Volumes/CORSAIR/STRUXIO_HardDrive/STRUXIO_Workspace/STRUXIO_OS/01_state/GLOBAL_claude_settings.json "$SETTINGS"
    echo "[FIXED] settings.json created"
fi

# 3. Check CORSAIR mounted
if [ ! -d "/Volumes/CORSAIR/STRUXIO_HardDrive/STRUXIO_Workspace" ]; then
    echo "[FAIL] CORSAIR drive not mounted"
    exit 1
fi
echo "[OK] CORSAIR workspace accessible"

# 4. Check Bus reachable
BUS=$(curl -s -o /dev/null -w "%{http_code}" https://bus.struxio.ai/healthz 2>/dev/null || echo "000")
if [ "$BUS" = "200" ]; then
    echo "[OK] Bus MCP Server reachable"
else
    echo "[WARN] Bus MCP Server unreachable (HTTP $BUS) — CM0 can work offline"
fi

echo "=== All checks passed — starting Claude Code ==="
cd /Volumes/CORSAIR/STRUXIO_HardDrive/STRUXIO_Workspace
claude --dangerously-skip-permissions

Layer 3: Docker Compose Hardening (CC0 applies)

Add to docker-compose.yml:

services:
  bus:
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://127.0.0.1:8088/healthz"]
      interval: 30s
      timeout: 5s
      retries: 3
      start_period: 15s
    deploy:
      resources:
        limits:
          memory: 512M

  postgres:
    restart: unless-stopped
    # healthcheck already present

  paperclip:
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://127.0.0.1:3100"]
      interval: 30s
      timeout: 5s
      retries: 3

  caddy:
    restart: unless-stopped

Agent Startup Scripts

CM0 startup (Mac):

# Save as alias in ~/.zshrc:
alias cm0='/Volumes/CORSAIR/STRUXIO_HardDrive/STRUXIO_Workspace/STRUXIO_OS/scripts/mac_startup_check.sh'

Then just type cm0 in terminal.

CC0 startup (Hetzner):

ssh struxio@89.167.96.154
cd /opt/struxio/bus && claude --dangerously-skip-permissions
# Type CC0

B0 startup (Desktop):

B0 starts automatically when Claude Desktop opens with the STRUXIO.ai project.
First action: read ACTIVATE_B0.md → heartbeat → poll.

Acceptance Criteria


STRUXIO.ai // Confidential & Proprietary // © 2026