# RFC-0009 — `system.metrics.subscribe` SSE Wire Format

- **Status:** **DRAFT — self-approvable per architect** (no enum mutation, no scope-vocabulary change). Version `v1.0`.
- **Author:** 🧠 Agentic Architect
- **Sprint:** 2 (KO 2026-04-22; deliverable D5 — closes the `not_implemented` gap on `system.metrics.subscribe`).
- **Audience:** Cloud Agents (LLMs); ☁️ cloudflare-native-edge (gateway SSE relay); 🦀 edge-kubelet-engineer (device-side cadence loop).
- **Depends on:** RFC-0001 v1.3 (`system.metrics` kind contract; projection rule), RFC-0003 v1.3 (`tools:call:read_only`), RFC-0005 (`sysecho.<node_id>.echo.invoke` projection precedent), RFC-0006 v1.0 (read_only enforcement mapping).
- **Does not modify:** any prior RFC. No enum is mutated. No new scope is coined.
- **Scope:** Frame schema, cadence, heartbeat, backpressure, and close-code semantics for the streaming `system.metrics.subscribe` tool.

---

## §1. Motivation

`system.metrics.snapshot` (one-shot) ships in Sprint 1 and is green. `system.metrics.subscribe` currently returns `not_implemented`. This RFC defines exactly what the SSE wire looks like so that the next implementation PR can land without further architectural review.

---

## §2. Tool Identifier & Projection

Per RFC-0001 v1.3 §3 projection rule (`{kind_short}.{node_id}.{cap_id}.{verb}`), with RFC-0001 v1.3 §3.2 `kind_short` registry row `system.metrics → sys` (existing) — consistent with the RFC-0005 §2 `sysecho.<node_id>.echo.invoke` precedent:

```
sys.<node_id, 26-char ULID>.metrics.subscribe       (worst case 51 chars, ≤ 64) ✓
```

- `kind`: `system.metrics`  *(existing)*
- `kind_short`: `sys`        *(existing)*
- `cap_id`: `metrics`        *(existing)*
- `verb`: `subscribe`        *(existing in RFC-0001 verb enum)*
- `safety_class`: `read_only` *(existing; per RFC-0001 v1.1 locked decision #4)*

**Worst-case length arithmetic:** `3 (sys) + 1 + 26 (ULID) + 1 + 7 (metrics) + 1 + 9 (subscribe) = 48`. The 51-char figure in RFC-0001 v1.2 §3.1 reflects the longest `kind_short` in the registry, not this tool. Both are within the 64-char ceiling.

---

## §3. Transport

- **Method:** `POST` to `/mcp/tools/call` (NOT `GET`). Chosen explicitly because (a) the dispatch route already exists for unary calls, (b) request body carries `arguments` cleanly, (c) avoids querystring length / encoding issues for future arg growth, (d) matches the gateway's existing auth middleware path.
- **Request `Accept`:** `text/event-stream` REQUIRED. If absent or `application/json`, the gateway MUST fall back to a one-shot snapshot semantic and return RFC-0001 §4 `E_BAD_REQUEST` (mixing modes is not supported in v1.0).
- **Response `Content-Type`:** `text/event-stream; charset=utf-8`.
- **Response status on open:** `200 OK`. All terminal conditions are signalled via SSE close codes (§7), not HTTP status.
- **Auth:** `Authorization: Bearer <agent_token>` per RFC-0003 v1.3 §V13.1; token MUST carry `tools:call:read_only` (mapped via RFC-0006 v1.0 §2). Cross-tenant check per RFC-0003 v1.3 §5.
- **Request body** (JSON, `additionalProperties: false`):

```json
{
  "tool": "sys.<node_id>.metrics.subscribe",
  "arguments": {
    "interval_ms": 5000
  }
}
```

---

## §4. Arguments Schema (Draft 2020-12)

`schema_ref`: `mcp://schemas/system.metrics.subscribe.input@1.0.0`

```json
{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "mcp://schemas/system.metrics.subscribe.input@1.0.0",
  "type": "object",
  "additionalProperties": false,
  "properties": {
    "interval_ms": {
      "type": "integer",
      "minimum": 1000,
      "maximum": 60000,
      "default": 5000,
      "description": "Sampling cadence on the device. Ceiling 60 s prevents unbounded long-poll abuse; floor 1 s prevents fanout-cost runaway."
    }
  }
}
```

---

## §5. SSE Event Types

Three event types only (closed set). Any other `event:` line is a protocol error and the client MUST close.

### §5.1 `event: metric` — telemetry frame

`schema_ref`: `mcp://schemas/system.metrics.subscribe.frame@1.0.0`

```json
{
  "$schema": "https://json-schema.org/draft/2020-12/schema",
  "$id": "mcp://schemas/system.metrics.subscribe.frame@1.0.0",
  "type": "object",
  "additionalProperties": false,
  "required": ["ts_ms", "node_id", "cpu_pct", "mem_bytes", "mem_total_bytes",
               "disk_pct", "load_1m", "load_5m", "load_15m"],
  "properties": {
    "ts_ms":           { "type": "integer", "minimum": 1700000000000 },
    "node_id":         { "type": "string", "pattern": "^[0-9a-hjkmnp-tv-z]{26}$" },
    "cpu_pct":         { "type": "number",  "minimum": 0, "maximum": 100 },
    "mem_bytes":       { "type": "integer", "minimum": 0 },
    "mem_total_bytes": { "type": "integer", "minimum": 1 },
    "disk_pct":        { "type": "number",  "minimum": 0, "maximum": 100 },
    "load_1m":         { "type": "number",  "minimum": 0 },
    "load_5m":         { "type": "number",  "minimum": 0 },
    "load_15m":        { "type": "number",  "minimum": 0 }
  }
}
```

Wire form:

```
event: metric
data: {"ts_ms":1745300000000,"node_id":"01hzx9k3m4p7q8r9s0t1v2w3xy","cpu_pct":12.4,"mem_bytes":1234567890,"mem_total_bytes":8589934592,"disk_pct":31.2,"load_1m":0.42,"load_5m":0.39,"load_15m":0.35}
```

### §5.2 `event: ping` — heartbeat

Cadence: **every 25 s, deterministic** (server clock; not tied to `interval_ms`). Empty data. Used to keep CF intermediary connections alive and to give the client a liveness signal independent of metric cadence.

```
event: ping
data: {}
```

### §5.3 `event: close` — terminal frame

Carries the close code (§7) so HTTP-status-blind clients still observe the reason. Sent immediately before the server tears the TCP/TLS connection.

```
event: close
data: {"code":1000,"reason":"normal"}
```

`reason` is a closed enum drawn from §7; `additionalProperties: false`.

---

## §6. Backpressure

- The gateway maintains a **3-frame send buffer per stream**.
- If the client has not ACKed (TCP-level) the oldest frame and the buffer overflows (i.e., the client falls **> 3 frames behind**), the server MUST emit `event: close` with `code=4413` and tear down. No retry guidance is included; the client MUST reconnect with adjusted `interval_ms`.
- The 3-frame ceiling is independent of `interval_ms`; it is purely a buffer-depth invariant.

---

## §7. Close Codes (closed enum)

| Code | reason            | When                                                                 |
|-----:|-------------------|----------------------------------------------------------------------|
| 1000 | `normal`          | Client closed; or server graceful shutdown.                          |
| 4408 | `idle_timeout`    | No client read progress for 90 s (independent of heartbeat success). |
| 4413 | `backpressure`    | §6 buffer overflow.                                                  |
| 4429 | `rate_limited`    | Tenant exceeded concurrent-stream quota (out-of-band; advisory v1.0).|
| 4503 | `device_offline`  | Underlying device DO disconnected mid-stream (`E_NODE_OFFLINE`).     |

Codes are SSE-application-layer (carried in the `event: close` JSON `code` field). They mirror but are NOT identical to WebSocket close codes; the 4xxx range is reserved for application use, consistent with RFC-0002 v2.2's WSS code numbering convention.

---

## §8. Auth & safety_class

- `safety_class` of the projected tool: `read_only` — already clamped by RFC-0001 v1.3 §3 `system.metrics` allOf branch and confirmed by RFC-0001 v1.1 locked decision #4.
- Required caller scope: **`tools:call:read_only`** — already in RFC-0003 v1.3 §4 closed enum. Per RFC-0006 v1.0 §2, this is the scope for `(read_only, subscribe)`. **No new scope coined.**
- Phase A (per RFC-0006 §4.1) tolerates legacy hotfix tokens with no `scope` claim — same tolerance applies to this stream until 2026-04-29.

---

## §9. Test Fixtures

| # | scenario                                                          | expected wire trace                                                                                                 |
|---|-------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------------------|
| 1 | Normal stream, `interval_ms=5000`, client reads promptly for 30 s.| ≥6 `event: metric` frames; ≥1 `event: ping` (at ~25 s); on client close → `event: close {"code":1000,"reason":"normal"}`. |
| 2 | Client opens stream and stops reading (TCP receive window stalls). | After 90 s of no read progress → `event: close {"code":4408,"reason":"idle_timeout"}`; connection torn.            |
| 3 | Device DO disconnects mid-stream (kill switch).                    | In-flight frame may be partial; next gateway flush emits `event: close {"code":4503,"reason":"device_offline"}`.   |

---

## §10. Out of Scope (Explicit)

- **Per-tenant concurrent-stream quota** (drives 4429). Advisory only at v1.0; enforcement deferred to Sprint 3.
- **Streaming-cost dimension** orthogonal to `safety_class` — RESERVED per RFC-0001 v1.1 locked decision #4. Not opened here.
- **WebSocket-transport variant** of subscribe. SSE only at v1.0.
- **Last-Event-ID resume.** Out of scope; clients reconnect from current cadence.
- **Compression** (`Content-Encoding: gzip` on the SSE stream). Out of scope; CF edge handles transport-layer compression transparently.

---

## §11. Cross-References

- RFC-0001 v1.3 §2 (`system.metrics` `MetricsSample` shape — fields here match), §3 (projection), §3.1 (name budget).
- RFC-0003 v1.3 §4 (`tools:call:read_only` scope), §V13.1 (agent token).
- RFC-0005 §2 (projection precedent — `sysecho.<node_id>.echo.invoke`).
- RFC-0006 v1.0 §2 (`(read_only, call|subscribe) → tools:call:read_only`), §4.1 (Phase A tolerance window).

---

## §12. Handoff

- **Next persona:** ☁️ cloudflare-native-edge.
- **Artifact:** Worker SSE handler at `/mcp/tools/call` content-negotiating on `Accept: text/event-stream`, device-side cadence loop in 🦀 edge-kubelet `prober` crate emitting frames matching §5.1 schema, Vitest cases per §9.
