Daemon/Boot/Deacon Watchdog Chain
> Autonomous health monitoring and recovery in Gas Town.
Overview
Gas Town uses a three-tier watchdog chain for autonomous health monitoring:
Daemon (Go process) ← Dumb transport, 3-min heartbeat
│
└─► Boot (AI agent) ← Intelligent triage, fresh each tick
│
└─► Deacon (AI agent) ← Continuous patrol, long-running
│
└─► Witnesses & Refineries ← Per-rig agents
Key insight: The daemon is mechanical (can't reason), but health decisions need
intelligence (is the agent stuck or just thinking?). Boot bridges this gap.
Design Rationale: Why Two Agents?
The Problem
The daemon needs to ensure the Deacon is healthy, but:
- Daemon can't reason - It's Go code following the ZFC principle (don't reason
- Waking costs context - Each time you spawn an AI agent, you consume context
- Observation requires intelligence - Distinguishing "agent composing large
The Solution: Boot as Triage
Boot is a narrow, ephemeral AI agent that:
- Runs fresh each daemon tick (no accumulated context debt)
- Makes a single decision: should Deacon wake?
- Exits immediately after deciding
This gives us intelligent triage without the cost of keeping a full AI running.
Why Not Merge Boot into Deacon?
We could have Deacon handle its own "should I be awake?" logic, but:
- Deacon can't observe itself - A hung Deacon can't detect it's hung
- Context accumulation - Deacon runs continuously; Boot restarts fresh
- Cost in idle towns - Boot only costs tokens when it runs; Deacon costs
Why Not Replace with Go Code?
The daemon could directly monitor agents without AI, but:
- Can't observe panes - Go code can't interpret tmux output semantically
- Can't distinguish stuck vs working - No reasoning about agent state
- Escalation is complex - When to notify? When to force-restart? AI handles
Session Ownership
| Agent | Session Name | Location | Lifecycle |
| Daemon | (Go process) | ~/gt/daemon/ | Persistent, auto-restart |
| Boot | gt-boot | ~/gt/deacon/dogs/boot/ | Ephemeral, fresh each tick |
| Deacon | hq-deacon | ~/gt/deacon/ | Long-running, handoff loop |
gt-boot, NOT hq-deacon. This prevents Boot
from conflicting with a running Deacon session.
Heartbeat Mechanics
Daemon Heartbeat (3 minutes)
The daemon runs a heartbeat tick every 3 minutes:
func (d *Daemon) heartbeatTick() {
d.ensureBootRunning() // 1. Spawn Boot for triage
d.checkDeaconHeartbeat() // 2. Belt-and-suspenders fallback
d.ensureWitnessesRunning() // 3. Witness health (checks tmux directly)
d.ensureRefineriesRunning() // 4. Refinery health (checks tmux directly)
d.triggerPendingSpawns() // 5. Bootstrap polecats
d.processLifecycleRequests() // 6. Cycle/restart requests
// Agent state derived from tmux, not recorded in beads (gt-zecmc)
}
Deacon Heartbeat (continuous)
The Deacon updates ~/gt/deacon/heartbeat.json at the start of each patrol cycle:
{
"timestamp": "2026-01-02T18:30:00Z",
"cycle": 42,
"last_action": "health-scan",
"healthy_agents": 3,
"unhealthy_agents": 0
}
Heartbeat Freshness
| Age | State | Boot Action |
| < 5 min | Fresh | Nothing (Deacon active) |
| 5-15 min | Stale | Nudge if pending mail |
| > 15 min | Very stale | Wake (Deacon may be stuck) |
Boot Decision Matrix
When Boot runs, it observes:
- Is Deacon session alive?
- How old is Deacon's heartbeat?
- Is there pending mail for Deacon?
- What's in Deacon's tmux pane?
Then decides:
| Condition | Action | Command |
| Session dead | START | Exit; daemon calls ensureDeaconRunning() |
| Heartbeat > 15 min | WAKE | gt nudge deacon "Boot wake: check your inbox" |
| Heartbeat 5-15 min + mail | NUDGE | gt nudge deacon "Boot check-in: pending work" |
| Heartbeat fresh | NOTHING | Exit silently |
Handoff Flow
Deacon Handoff
The Deacon runs continuous patrol cycles. After N cycles or high context:
End of patrol cycle:
│
├─ Squash wisp to digest (ephemeral → permanent)
├─ Write summary to molecule state
└─ gt handoff -s "Routine cycle" -m "Details"
│
└─ Creates mail for next session
Next daemon tick:
Daemon → ensureDeaconRunning()
│
└─ Spawns fresh Deacon in gt-deacon
│
└─ SessionStart hook: gt mail check --inject
│
└─ Previous handoff mail injected
│
└─ Deacon reads and continues
Boot Handoff (Rare)
Boot is ephemeral - it exits after each tick. No persistent handoff needed.
However, Boot uses a marker file to prevent double-spawning:
- Marker:
~/gt/deacon/dogs/boot/.boot-running(TTL: 5 minutes) - Status:
~/gt/deacon/dogs/boot/.boot-status.json(last action/result)
If the marker exists and is recent, daemon skips Boot spawn for that tick.
Degraded Mode
When tmux is unavailable, Gas Town enters degraded mode:
| Capability | Normal | Degraded |
| Boot runs | As AI in tmux | As Go code (mechanical) |
| Observe panes | Yes | No |
| Nudge agents | Yes | No |
| Start agents | tmux sessions | Direct spawn |
Degraded Boot triage is purely mechanical:
- Session dead → start
- Heartbeat stale → restart
- No reasoning, just thresholds
Fallback Chain
Multiple layers ensure recovery:
- Boot triage - Intelligent observation, first line
- Daemon checkDeaconHeartbeat() - Belt-and-suspenders if Boot fails
- Tmux-based discovery - Daemon checks tmux sessions directly (no bead state)
- Human escalation - Mail to overseer for unrecoverable states
State Files
| File | Purpose | Updated By |
deacon/heartbeat.json | Deacon freshness | Deacon (each cycle) |
deacon/dogs/boot/.boot-running | Boot in-progress marker | Boot spawn |
deacon/dogs/boot/.boot-status.json | Boot last action | Boot triage |
deacon/health-check-state.json | Agent health tracking | gt deacon health-check |
daemon/daemon.log | Daemon activity | Daemon |
daemon/daemon.pid | Daemon process ID | Daemon startup |
Debugging
# Check Deacon heartbeat
cat ~/gt/deacon/heartbeat.json | jq .
# Check Boot status
cat ~/gt/deacon/dogs/boot/.boot-status.json | jq .
# View daemon log
tail -f ~/gt/daemon/daemon.log
# Manual Boot run
gt boot triage
# Manual Deacon health check
gt deacon health-check
Common Issues
Boot Spawns in Wrong Session
Symptom: Boot runs inhq-deacon instead of gt-boot
Cause: Session name confusion in spawn code
Fix: Ensure gt boot triage specifies --session=gt-boot
Zombie Sessions Block Restart
Symptom: tmux session exists but Claude is dead Cause: Daemon checks session existence, not process health Fix: Kill zombie sessions before recreating:gt session kill hq-deacon
Status Shows Wrong State
Symptom:gt status shows wrong state for agents
Cause: Previously bead state and tmux state could diverge
Fix: As of gt-zecmc, status derives state from tmux directly (no bead state for
observable conditions like running/stopped). Non-observable states (stuck, awaiting-gate) are still stored in beads.
Design Decision: Keep Separation
The issue [gt-1847v] considered three options:
Option A: Keep Boot/Deacon Separation (CHOSEN)
- Boot is ephemeral, spawns fresh each heartbeat
- Boot runs in
gt-boot, exits after triage - Deacon runs in
hq-deacon, continuous patrol - Clear session boundaries, clear lifecycle
Option B: Merge Boot into Deacon (Rejected)
- Single
hq-deaconsession handles everything - Deacon checks "should I be awake?" internally
- Deacon can't observe itself (hung Deacon can't detect hang)
- Context accumulates even when idle (cost in quiet towns)
- No external watchdog means no recovery from Deacon failure
Option C: Replace with Go Watchdog (Rejected)
- Daemon directly monitors witness/refinery
- No Boot, no Deacon AI for health checks
- AI agents only for complex decisions
- Go code can't interpret tmux pane output semantically
- Can't distinguish "stuck" from "thinking deeply"
- Loses the intelligent triage that makes the system resilient
- Escalation decisions are nuanced (when to notify? force-restart?)
Implementation Fixes Needed
The separation is correct; these bugs need fixing:
- Session confusion (gt-sgzsb): Boot spawns in wrong session
- Zombie blocking (gt-j1i0r): Daemon can't kill zombie sessions
- ~~Status mismatch (gt-doih4): Bead vs tmux state divergence~~ → FIXED in gt-zecmc
- Ensure semantics (gt-ekc5u): Start should kill zombies first
Summary
The watchdog chain provides autonomous recovery:
- Daemon: Mechanical heartbeat, spawns Boot
- Boot: Intelligent triage, decides Deacon fate
- Deacon: Continuous patrol, monitors workers
Boot exists because the daemon can't reason and Deacon can't observe itself. The separation costs complexity but enables:
- Intelligent triage without constant AI cost
- Fresh context for each triage decision
- Graceful degradation when tmux unavailable
- Multiple fallback layers for reliability