# RFC-0016 — Runtime-Token Refresh & Rotation

- **Status:** **Ratified v1.2 — 2026-05-07 (multi-kid taxonomy amendment for RFC-0018; additive, backward-compatible).** CTO sign-off recorded; normative for W3 implementation. Open questions Q1/Q2/Q3 resolved per CTO directive (see Changelog and §2.2/§6). v1.0 body archived at `tracking/work/agentic-architect/archive/rfc-0016-v1.0.md`. Original draft archived at `tracking/work/agentic-architect/archive/rfc-0016-v0.1-draft.md`.
- **Date:** 2026-05-07
- **Author:** 🧠 Agentic Architect
- **Sprint:** Sprint 3 Extended — W2-3 follow-up; unblocks W3 core implementation.
- **Audience:** Cloud Agents (LLMs); 🛡️ devex-protocol-sec; 🦀 edge-kubelet-engineer; ☁️ cloudflare-native-edge; CTO chair.
- **Depends on:** RFC-0013 (PQC Hybrid signature), RFC-0014 (Protocol v2 Wire envelope), RFC-0015 v1.1 (Gateway DID, TTL caps, JWS wire format — incl. security erratum). Extends — does **not** modify — these ratified RFCs.
- **Out of scope:** Tenant-init refresh (24h TTL is single-use bootstrap, no refresh by design). Enroll-token refresh (1h ceremony, no refresh by design). Any change to the hybrid signature algorithm or DID method.

---

## §1. Motivation

RFC-0015 §3 caps the **runtime-token** at **15 min** (900 s). RFC-0015 §3.2 explicitly leaves the renewal mechanism unspecified.

Without a refresh mechanism, every connected edge device would experience a forced disconnect every ≤ 15 minutes, requiring either:

- A full WSS reconnect + re-auth handshake (operationally expensive: O(N_devices) per 15 min, network-thrash on flaky links), or
- A fresh enrollment ceremony (operationally catastrophic: requires per-tenant enroll-token cache or human action).

Neither is acceptable. We need:

- **Short token TTL** for security (bounded blast radius on edge-key compromise — preserves RFC-0015 §3 rationale), AND
- **Long-lived session continuity** for UX and operational sanity.

The standard pattern — short-TTL access token plus refresh mechanism — is well-trodden in cloud auth. This RFC adapts it to the WSS-resident edge stack, with one critical constraint: **refresh MUST happen in-band on the existing WSS connection**, not via a side-channel HTTP call. A side-channel refresh would defeat the purpose by forcing the edge device to maintain a second authenticated path and re-prove identity to a path that lacks the WSS's TLS-pinned session context.

---

## §2. Refresh Model

### §2.1 In-Band Over Existing WSS (NORMATIVE)

The runtime-token refresh MUST occur as message exchange over the **already-established WSS connection** authenticated by the current (about-to-expire) runtime-token. There is no separate HTTP refresh endpoint and no separate authenticated channel.

Justification: the existing WSS connection is already TLS-pinned, already proves continuous possession of the device-leaf private key (via the connection-bound runtime-token), and already has an established envelope protocol (RFC-0014 v2). Introducing a side-channel would (a) double the auth surface, (b) require the device to hold a second credential, (c) lose the "continuous possession" property which is the strongest signal of session liveness.

### §2.2 Server-Initiated Sliding Window (NORMATIVE)

Refresh is **server-initiated by default**. The gateway pushes a `runtime_token_refresh` envelope (§4) to the device approximately **2 minutes before the current token's `exp`** (sliding window). This is "approximately" because the gateway MAY batch refresh emissions on its own ticker; the only hard requirement is:

```
push_at  ≤  current_token.exp − 60 s
push_at  ≥  current_token.exp − 300 s
```

(i.e. push fires somewhere in the 1-to-5-minute pre-expiry window, target ≈ 120 s).

### §2.2a Client-Initiated Refresh (NORMATIVE)

Devices MAY also initiate a refresh by sending a `runtime_token_request` envelope (§4.4) to the gateway over the existing WSS connection. This is REQUIRED to support low-power device classes (AI glasses, sensors with sleep cycles) that cannot reliably depend on server-pushed refresh during deep-sleep wake transitions.

- Payload: `{ "current_jti": "<uuid>", "reason": "wakeup" | "low_power" | "preemptive" }` (closed enum on `reason`).
- The gateway response is the same `runtime_token_refresh` envelope (§4.1) used in the server-initiated path; the device validation flow (§2.3) is identical.
- The frequency cap defined in §6.4 (1 successful refresh per 5 minutes per device) applies to client-initiated AND server-initiated refreshes **combined** — a single counter per `sub`. A client request that would breach the cap MUST be rejected; the gateway MUST NOT silently coalesce.
- Rationale: low-power devices wake from deep sleep with a stale token and no in-flight server push; without an explicit request path they would always fall through to the §5 reconnect grace window (or worse, full enroll-token re-auth). Client-initiated refresh keeps the steady-state path sub-second on wakeup.

### §2.3 Device Validation & Atomic Swap (NORMATIVE)

On receipt of a `runtime_token_refresh` envelope, the device MUST:

1. Verify the new token's JWS per RFC-0015 §4.4 (`alg`, signature length, hybrid verify against the gateway public key currently bound to this connection's `kid`).
2. Verify token claims:
   - `iss == "did:web:api.aethermesh.app"` (RFC-0015 §2.3)
   - `sub == <this device's DID>` (no identity swap; see §3)
   - `kid == <kid currently in use on this connection>` (no key rotation mid-session; see §3)
   - `exp > now` AND `exp − iat ≤ 900 s` (RFC-0015 §3 cap)
   - `prev_jti == <jti of the token currently authenticating this connection>`
3. Atomically swap the token used for any future server-bound assertion that requires it. Swap MUST be all-or-nothing; partial state (some envelopes signed under old token, some under new) is forbidden.
4. Emit a `runtime_token_ack` envelope (§4) containing the new `jti` and `swapped_at` timestamp.

### §2.4 Failure Handling (NORMATIVE)

- **Device fails to ack within 30 s** of the server's push: the gateway MUST treat the connection as unhealthy and drop it. The device, on detecting close, falls back to §5 reconnect.
- **Device emits `runtime_token_nack`** (§4): the gateway MUST log, MUST wait 5 s, MUST mint a fresh refresh token, and MUST push exactly one retry. A second consecutive nack from the same device on the same connection means the gateway MUST drop the connection.
- **Gateway fails to push refresh** (e.g. transient PoP-side error, ticker miss, network blip): the device's runtime-token will expire normally. The device MUST then attempt a WSS reconnect using the most-recent valid token (still valid for up to ~2 min by the sliding-window timing of §2.2). On reconnect, the gateway MUST issue a fresh runtime-token immediately as part of the connection-establishment exchange.

---

## §3. Refresh Token Format

The refresh token IS itself a runtime-token. There is no separate "refresh-token tier"; we re-use the runtime-token format (RFC-0015 §3, §4) with additional constraints:

| Field        | Constraint                                                                              |
|--------------|------------------------------------------------------------------------------------------|
| `alg`        | `Ed25519+ML-DSA-65` (RFC-0015 §4.1). Device MUST reject any other value (see §6).        |
| `iss`        | `did:web:api.aethermesh.app` (RFC-0015 §2.3).                                            |
| `kid`        | MUST equal the `kid` already in use on the current WSS connection. **No key rotation mid-session.** Key rotation requires a fresh connection. |
| `sub`        | MUST equal the device's DID (current connection's `sub`). **No identity swap.**          |
| `iat`        | Mint time, unix seconds.                                                                 |
| `exp`        | `iat + ≤ 900` (RFC-0015 §3 cap re-asserted).                                             |
| `jti`        | Fresh UUID-v4 per refresh.                                                               |
| `prev_jti`   | **NEW (this RFC).** UUID of the runtime-token being replaced. Audit-trail anchor; lets verifiers reconstruct the refresh chain for a session. |

`prev_jti` is the only structural addition vs. RFC-0015 §4. It is OPTIONAL on the very first runtime-token of a session (no predecessor); REQUIRED on every refresh.

---

## §4. WSS Envelope Spec

This RFC defines four new envelope types layered on the RFC-0014 v2 wire: `runtime_token_refresh` (server→device), `runtime_token_ack` (device→server), `runtime_token_nack` (device→server), and `runtime_token_request` (device→server, client-initiated refresh per §2.2a). Envelope framing, header set, and signature placement are unchanged from RFC-0014; only the `type` strings and `payload` shapes are new.

### §4.1 `runtime_token_refresh` (server → device)

```json
{
  "type": "runtime_token_refresh",
  "payload": {
    "token": "<jws compact serialization, RFC-0015 §4>",
    "expires_at": 1735689600,
    "prev_jti": "9f1c4b2a-..."
  }
}
```

- `token`: the new runtime-token, full JWS per RFC-0015 §4.
- `expires_at`: unix seconds; MUST equal `token`'s `exp` claim. Convenience field for devices that defer JWS parse.
- `prev_jti`: MUST equal the `jti` of the runtime-token currently authenticating this connection.

### §4.2 `runtime_token_ack` (device → server)

```json
{
  "type": "runtime_token_ack",
  "payload": {
    "jti": "<new_jti>",
    "swapped_at": 1735688700
  }
}
```

- `jti`: MUST equal the `jti` claim of the just-accepted token.
- `swapped_at`: unix seconds at which the device completed atomic swap.

### §4.3 `runtime_token_nack` (device → server)

```json
{
  "type": "runtime_token_nack",
  "payload": {
    "jti": "<new_jti>",
    "reason": "verify_fail",
    "error": "E_RUNTIME_REFRESH_VERIFY_FAIL"
  }
}
```

- `reason`: closed enum: `verify_fail` | `exp_in_past` | `kid_mismatch` | `sub_mismatch` | `prev_jti_mismatch` | `other`.
- `error`: structured error code from the platform error taxonomy, prefixed `E_RUNTIME_REFRESH_*`.

### §4.4 Envelope Constraints (NORMATIVE)

- All four envelopes (`runtime_token_refresh`, `runtime_token_ack`, `runtime_token_nack`, `runtime_token_request`) MUST be carried inside the standard RFC-0014 v2 envelope; the outer envelope is signed/authenticated by whichever runtime-token is current at the time of emission. (For `runtime_token_ack`/`nack`: the OLD token may still be the connection-binding token at the moment of emission, since the swap is logically atomic but the wire is sequential. Implementations MAY emit the ack/nack under either old or new token; gateway MUST accept both.)
- All four envelope types MUST be rejected on any subprotocol other than v2.
- `additionalProperties: false` on every payload schema. Unknown fields are a hard reject.

### §4.5 `runtime_token_request` (device → server, client-initiated)

```json
{
  "type": "runtime_token_request",
  "payload": {
    "current_jti": "9f1c4b2a-...",
    "reason": "wakeup"
  }
}
```

- `current_jti`: MUST equal the `jti` of the runtime-token currently authenticating this connection.
- `reason`: closed enum: `wakeup` | `low_power` | `preemptive`.
- Gateway response: a `runtime_token_refresh` envelope (§4.1) on success, or connection drop on cap breach / replay (§6.4).

---

## §5. Edge Reconnect Fallback

This section defines the recovery path when in-band refresh fails (§2.4 third bullet) and the device's runtime-token expires before a successful refresh.

### §5.1 Grace Window (NORMATIVE)

If the device's runtime-token has expired by **no more than 120 s** AND the device's most-recent token chains via `prev_jti` to a `jti` the gateway has in its audit log for the same `(sub, kid)` tuple, the gateway MUST accept the expired token for the limited purpose of re-establishing a WSS connection and immediately minting a fresh runtime-token.

```
grace_accept := (now − token.exp) ≤ 120 s
             AND prev_jti chain intact in audit log
             AND sub matches device DID
             AND kid matches a currently-published gateway key
```

The grace-accepted connection MUST receive a fresh runtime-token within the connection-establishment exchange before any other envelope is processed.

### §5.2 Beyond Grace — Enroll-Token Re-Auth

If the grace window is exceeded, the device MUST fall back to enroll-token re-authentication (RFC-0007). This is a full re-handshake but **does not require human re-enrollment**: the enroll-token is per-tenant cached on the device under existing RFC-0007 mechanics. From the operator's perspective, the device simply takes longer to reconnect; no ceremony is needed.

### §5.3 What the Grace Window Is NOT

The grace window is **not** a general TTL extension. It applies only at the moment of WSS reconnection, only for devices proving prior session continuity via `prev_jti`. The 15-min TTL cap in RFC-0015 §3 is unchanged for in-session use.

### §5.4 Post-Reconnect Proactive Refresh

After reconnection (whether via grace window per §5.1 or full enroll-token re-auth per §5.2), a device MAY also use the client-initiated `runtime_token_request` envelope (§2.2a, §4.5) to refresh proactively before the freshly-issued token's `exp` approaches. This is purely a device-side optimization for sleep-cycle planning; the §6.4 frequency cap still applies.

---

## §6. Security Considerations

> **Forward-ref note (v1.1):** Hardened in v1.1 erratum — 3 security amendments (A-1 KV→D1 fail-closed read fallback, A-2 gateway-side `sub`/`kid` identity binding pre-write, A-3 inbound device envelope `alg` check) plus C-2 lifecycle schema resolution (`swap_status` column + idempotent re-issue rule §6.7). See Changelog. Source: `tracking/work/devex-protocol-sec/rfc-0016-security-review.md`.

### §6.1 Replay

Each refresh token has a unique `jti` (UUID-v4). Replay defense uses a two-tier store (per CTO Q3 ratification):

- **D1 (source of truth, audit):** the gateway MUST persist every refresh event in the `runtime_token_audit` table with columns `jti PK, device_id, tenant_id, issued_at, expires_at, prev_jti, swap_status, swap_status_updated_at, created_at`. The table MUST be indexed on `(device_id, created_at)` and `(tenant_id, created_at)` to support cross-tenant forensic queries (e.g. "all anomalous refresh events in the last 24 h").
  - `swap_status TEXT NOT NULL DEFAULT 'pending'` — closed enum: `'pending' | 'acked' | 'nacked' | 'timed_out'`. Lifecycle:
    - **Insert (mint):** row inserted with `swap_status = 'pending'`, `swap_status_updated_at = NULL`.
    - **On `runtime_token_ack` received:** `UPDATE swap_status = 'acked', swap_status_updated_at = now()` WHERE `jti` matches.
    - **On `runtime_token_nack` received:** `UPDATE swap_status = 'nacked', swap_status_updated_at = now()` WHERE `jti` matches.
    - **On 30 s ack-timeout (per §2.4 / §4):** `UPDATE swap_status = 'timed_out', swap_status_updated_at = now()`. Gateway also drops the connection per existing §2.4 spec.
  - `swap_status_updated_at INTEGER` — unix seconds; null until first transition out of `'pending'`.
  - `swap_status` is **server-internal**; it is not transmitted in any envelope. §4 envelope shapes are unaffected.
- **KV (hot revocation cache):** the gateway MUST mirror a per-device "last 10 `jti`" set into KV for sub-millisecond replay rejection at the WSS edge. Key format: `revoked:device:<sub>:jti:<uuid>`. Per-entry TTL: 1 hour.
- **Write ordering on refresh:** the gateway MUST write to D1 first; if the D1 write fails, the refresh MUST be rejected (fail closed). Only after a successful D1 write does the gateway perform a best-effort KV write; KV write failures are logged but do not block the refresh.

Any inbound `runtime_token_ack` whose `jti` matches an already-acked entry (in either store) MUST be rejected as a replay; the gateway MUST log and SHOULD drop the connection.

**KV read-path fail behavior (v1.1 amendment A-1):** If KV read returns error or timeout, the gateway MUST fall back to D1 before proceeding. If D1 is also unavailable, MUST reject `E_RUNTIME_REFRESH_STORE_UNAVAILABLE` — fail closed. MUST NOT allow refresh without a successful replay check against at least one store.

### §6.2 Token Theft

The 15-min cap on the underlying runtime-token (RFC-0015 §3) bounds the blast radius of a stolen token to ≤ 15 min, unchanged by this RFC. A thief MUST also possess the live WSS connection (TLS-pinned, server-side connection state) to receive the next refresh; without it, the stolen token expires and the thief cannot bootstrap a refresh chain on a new connection (no `prev_jti` chain on the gateway side).

### §6.3 Downgrade Attack

The device MUST reject any `runtime_token_refresh` whose `alg` header is not exactly `Ed25519+ML-DSA-65`. In particular: `alg=none`, `alg=Ed25519`, `alg=ML-DSA-65` MUST all be rejected (per RFC-0015 §6.3). This RFC introduces no new algorithm and creates no new downgrade surface beyond what RFC-0015 §6.3 already governs.

**Gateway inbound enforcement (v1.1 amendment A-3):** Gateway MUST apply RFC-0015 §6.3 `alg` check to every inbound device envelope (`runtime_token_request`, `ack`, `nack`). Outer envelope authenticated by a non-`Ed25519+ML-DSA-65` token MUST be rejected and connection dropped.

### §6.4 Refresh Frequency Cap

The gateway MUST enforce a maximum of **1 successful refresh per 5 minutes per device** (per `sub` claim, keyed in KV). The frequency cap MUST be enforced at per-device granularity to prevent a single misbehaving device from exhausting its tenant's refresh budget; per-tenant aggregate caps are out of scope for this RFC. Blast-radius rationale: a compromised or buggy device is contained at the device boundary, while well-behaved sibling devices on the same tenant remain unaffected. Cross-tenant aggregate detection, if needed, belongs in telemetry/anomaly tooling, not the refresh path.

The cap counter is **shared** across server-initiated refreshes (§2.2) and client-initiated `runtime_token_request` events (§2.2a) — a single counter per `sub`. Refresh requests above this rate indicate either a misbehaving device or a session-stealing attacker triggering rapid refreshes. On exceeding the cap, the gateway MUST treat the connection as compromised: drop the connection, do not honor further refreshes for that device for at least 60 s, and emit a security telemetry event.

### §6.5 Connection-Bound Identity Invariants

Per §3, refresh MUST NOT change `kid` or `sub`. This prevents: (a) mid-session key rotation (which would obscure compromise forensics), (b) identity-swap attacks (a compromised gateway path attempting to bind a different device to an existing connection).

#### §6.5.1 Identity Binding (v1.1 amendment A-2; v1.2 multi-kid amendment)

Gateway MUST assert `minted_token.sub == connection.authenticated_sub` AND `minted_token.kid == connection.binding_kid` **before writing to D1 or transmitting the refresh envelope**. Mismatch → abort, drop connection, CRITICAL telemetry `E_RUNTIME_REFRESH_IDENTITY_MISMATCH`.

The `kid` claim MUST match the following union regex (v1.2, accommodating RFC-0018 §3 cross-region kid taxonomy):

```
^(gw-sig-1|gw-sig\.[a-z0-9]+\.edge-signer\.\d+)$
```

- Branch 1 (`gw-sig-1`): legacy issuer-global single-kid form. Aliased to `gw-sig.global.edge-signer.1` for the 90-day RFC-0018 transition window (see §6.5.1.1).
- Branch 2 (`gw-sig.<region>.edge-signer.<n>`): new per-region taxonomy per RFC-0018 §3. `<region>` is a lowercased PoP/region code (e.g. `iad`, `nrt`, `fra`); `<n>` is a monotonic per-region rotation counter aligned with the SignerDO instance per RFC-0020 v2 §4.0.

Verifiers (gateway, edge device, audit replay tooling) MUST reject any token whose `kid` does not match this union with `E_RUNTIME_REFRESH_VERIFY_FAIL` (sub-reason `kid_mismatch`).

#### §6.5.1.1 Verifier behavior under multi-kid (v1.2)

- Verifiers MUST resolve `kid` → public key via JWKS lookup against the issuer DID document (per RFC-0015 §4.4). Verifiers MUST NOT hardcode `gw-sig-1` (or any other specific `kid`) as the assumed signing key.
- **Transition window** (90 days from RFC-0018 ratification): a verifier seeing `kid == "gw-sig-1"` MUST treat it as an alias of `gw-sig.global.edge-signer.1`. The JWKS published at `did:web:api.aethermesh.app` MUST contain both entries with **identical public-key material** for the duration of the window (publisher-side guarantee; see RFC-0018 §3).
- **Post-transition** (T+90d from RFC-0018 ratification): `gw-sig-1` MUST NOT appear in the published JWKS. Any token presenting `kid == "gw-sig-1"` after this cutover MUST be rejected with new error code `E_KID_RETIRED` (registered in §6.8) and the connection dropped. The `swap_status` audit row reason MUST record `kid_retired`.
- The cutover date is computed deterministically from the RFC-0018 ratification timestamp; gateway MUST NOT continue serving the alias past T+90d even if the JWKS publisher lags.

#### §6.5.1.2 Per-region kid lifecycle (v1.2)

- Each region's signing kid (`gw-sig.<region>.edge-signer.<n>`) rotates **independently** per RFC-0020 v2 §6 (dual-kid overlap window: old + new published simultaneously for ≥ 24 h before old retirement).
- Verifiers MUST tolerate JWKS responses containing N regional kids simultaneously, where steady-state N is estimated at **5–20** entries (1 per active region × ≤ 2 during overlap windows, plus the legacy alias during the §6.5.1.1 transition).
- Verifiers MUST NOT cache JWKS beyond the `Cache-Control: max-age` advertised by the issuer (RFC-0015 §4.4); regional rotations within an overlap window will not be observed otherwise.
- A connection's `binding_kid` (per §6.5.1) MUST remain stable for the lifetime of that WSS connection; cross-regional re-binding mid-session is forbidden under §3 / §6.5. Region migration requires a fresh connection.

**Cross-references:** RFC-0018 §3 (kid regex normative source), RFC-0020 v2 §4.0 (SignerDO instance per regional kid), RFC-0020 v2 §6 (dual-kid overlap window).

### §6.6 Audit Trail

The `prev_jti` chain (§3) MUST be persisted on the gateway side with sufficient retention to reconstruct any session's refresh chain for forensics. Per CTO ratification (was §8 Q3), audit storage uses **D1 as source of truth** (`runtime_token_audit` table per §6.1) with **KV as a 1-hour hot revocation cache**. D1 write failures MUST cause the refresh to fail closed. KV write failures are best-effort logged but do not block the refresh.

The `swap_status` column (§6.1) makes the lifecycle of every audit row explicit, so post-incident chain walks can distinguish a successfully-completed refresh (`acked`) from a wedged or failed one (`pending`, `timed_out`, `nacked`). Audit history MUST be preserved across idempotent re-issues per §6.7 — i.e. the original row's `swap_status` is not overwritten by the successor row.

### §6.7 Idempotent Re-issue on Retried `prev_jti` (resolves C-2)

If the gateway receives a `runtime_token_request` with `current_jti = X`, AND it observes (via D1 lookup) that an existing audit row for `jti = X` has `swap_status IN ('pending', 'timed_out')`, the gateway MUST treat this as a retry of an unsuccessful prior refresh. Behavior:

- If the most recent successor row (where `prev_jti = X`) has `swap_status = 'pending'` AND was minted within the last 60 s: re-transmit the same `runtime_token_refresh` envelope (same JWS bytes — DO NOT mint a new token). The KV cache holds the JWS for the 60 s window.
- If the successor row is older than 60 s OR has `swap_status IN ('nacked', 'timed_out')`: mint a new refresh token, insert a new audit row with fresh `jti`, set `prev_jti = X`. The original audit row's `swap_status` remains as-was (audit history preserved).
- This idempotency window is per-device (`sub` claim) and capped at 1 retry; second retry within 60 s → `E_RUNTIME_REFRESH_RETRY_LIMIT` and connection drop.
- The idempotency window does NOT consume the §6.4 frequency cap counter (a retry is not a new request).

### §6.8 Error Codes (v1.1)

The following error codes are introduced by this RFC and MUST be added to the platform `TENANT_ERROR_CODES` taxonomy (`errors.ts`) during W3-6 implementation. These are contract surface; the implementation extension is downstream.

| Code | Source | Trigger |
|------|--------|---------|
| `E_RUNTIME_REFRESH_STORE_UNAVAILABLE` | §6.1 (A-1) | Both KV and D1 replay-check stores unavailable; refresh rejected fail-closed. |
| `E_RUNTIME_REFRESH_IDENTITY_MISMATCH` | §6.5.1 (A-2) | Minted token `sub`/`kid` does not match WSS connection binding; abort + drop + CRITICAL telemetry. |
| `E_RUNTIME_REFRESH_RETRY_LIMIT` | §6.7 (C-2) | Second retry of the same `prev_jti` within the 60 s idempotency window; connection drop. |
| `E_KID_RETIRED` | §6.5.1.1 (v1.2) | Token presents `kid == "gw-sig-1"` after T+90d RFC-0018 transition cutover; reject + drop connection. |

Pre-existing error codes referenced elsewhere in this RFC (`E_RUNTIME_REFRESH_VERIFY_FAIL`, etc., per §4.3) are unchanged.

---

## §7. Migration

### §7.1 Pre-RFC-0016 State

Before this RFC ratifies, runtime-tokens have **no refresh mechanism**. Current development environments use a long TTL (TBD; likely 1 h to 24 h) as an operational placeholder. This is incompatible with RFC-0015 §3's 15-min cap.

### §7.2 Cutover

When this RFC ratifies and the gateway W3 implementation lands:

- The gateway MUST start emitting `runtime_token_refresh` envelopes for any connection that successfully negotiated **subprotocol v2** (RFC-0014).
- v1-subprotocol connections (legacy) are unaffected by this RFC and continue under their pre-existing (long-TTL) regime until v1 deprecation.
- Edge devices on v2 MUST implement the §2.3 validation and §4.2 ack flow before the gateway flips to the 15-min cap.

### §7.3 Sequencing Constraint

This RFC MUST be ratified, AND the edge device implementation of §2.3 / §4 MUST be in production, **before** the gateway begins enforcing the RFC-0015 §3 15-min cap on v2 runtime-tokens. Otherwise every v2 device disconnects every 15 minutes — the disaster scenario this RFC exists to prevent.

---

## §8. Open Questions

- **Q-residual (advisory) — Client-initiated `reason` taxonomy.** §2.2a defines a closed enum `wakeup | low_power | preemptive`. As new device classes onboard (industrial actuators, in-vehicle compute, surgical robotics), this taxonomy may need expansion. Non-blocking; revisit when a concrete new class requires a `reason` value not expressible under the current set.

**Open question count: 1.**

*Resolved during v1.0 ratification:*

- **(was Q1) Refresh frequency cap scope** — **RESOLVED**: per-device only. See §6.4 (blast-radius rationale: contain misbehaving device at device boundary, do not exhaust tenant-wide budget).
- **(was Q2) Client-initiated refresh** — **RESOLVED**: MUST SUPPORT. Promoted to normative §2.2a with new `runtime_token_request` envelope (§4.5). Required for low-power device classes (AI glasses, sleep-cycle sensors).
- **(was Q3) Audit storage backend** — **RESOLVED**: D1 as source of truth (`runtime_token_audit` table) + KV as 1-hour hot revocation cache. D1 write failures fail the refresh closed; KV writes are best-effort. See §6.1 and §6.6.

---

## §9. References

- **RFC-0007** — Public enrollment (enroll-token re-auth fallback path, §5.2 here).
- **RFC-0013** — Post-Quantum Cryptography (hybrid signature construction; algorithm identifier).
- **RFC-0014** — Protocol v2 Wire (envelope framing this RFC extends).
- **RFC-0015 v1.1** — Gateway Issuer DID, Token TTL Caps, and Hybrid JWS Signature Wire Format (incl. security erratum). **Parent context.** Specifies the 15-min runtime-token cap (§3) that this RFC's refresh mechanism is required to make operationally viable; specifies the JWS verification path (§4.4) that §2.3 here invokes; specifies the `iss` / `kid` claim semantics that §3 here constrains.
- **W3C DID Core** — referenced via RFC-0015.
- **RFC 7515 (JWS)** — referenced via RFC-0015 §4.

---

## Changelog

- **2026-05-07 v1.2: §6.5.1 amended for RFC-0018 multi-kid taxonomy. Backward-compatible via `gw-sig-1` alias for 90 days post-RFC-0018 ratification.** Added §6.5.1.1 (verifier behavior under multi-kid: JWKS-resolved, alias during transition, `E_KID_RETIRED` post-cutover) and §6.5.1.2 (per-region kid lifecycle, N=5–20 steady-state, dual-kid overlap per RFC-0020 v2 §6). Replaced implicit single-`kid` assumption with explicit union regex `^(gw-sig-1|gw-sig\.[a-z0-9]+\.edge-signer\.\d+)$`. New error code `E_KID_RETIRED` registered in §6.8. No changes to envelope shapes, signature algorithms, DID methods, identity-binding invariants (§6.5), or any other §6 subsection. Cross-links: RFC-0018 §3, RFC-0020 v2 §4.0 / §6.
- **2026-05-07 (domain rebase, CEO directive)** — prod issuer DID rebased to `did:web:api.aethermesh.app` (`.dev` domain not procured; CEO directive). Existing `did:web:api.aethermesh.dev` was never DNS-resolvable in prod — no live verifiers affected. §2.3 and §3 `iss` references rewritten; constraint shape unchanged.
- **v1.1 — 2026-05-05 (security erratum + lifecycle schema, fast-tracked per CTO)** — Applied 3 CRITICAL amendments from devex-protocol-sec review (A-1 KV→D1 fallback fail-closed, A-2 sub/kid identity binding pre-write, A-3 alg check on every device envelope). Resolved C-2 D1 wedge: added `swap_status` lifecycle column to `runtime_token_audit` (`pending | acked | nacked | timed_out`) + `swap_status_updated_at` timestamp + idempotent-reissue rule (§6.7) for retried `prev_jti`. New error codes `E_RUNTIME_REFRESH_STORE_UNAVAILABLE`, `E_RUNTIME_REFRESH_IDENTITY_MISMATCH`, `E_RUNTIME_REFRESH_RETRY_LIMIT` registered in §6.8. No changes to envelope shapes, signature algorithms, DID methods, or any other ratified RFC. v1.0 body archived at `tracking/work/agentic-architect/archive/rfc-0016-v1.0.md`. Source: `tracking/work/devex-protocol-sec/rfc-0016-security-review.md`.
- **v1.0 — 2026-05-05 (Ratified, CTO sign-off)** — Open questions resolved per CTO directive: Q1 frequency cap scope = per-device (blast-radius containment); Q2 client-initiated refresh added as normative §2.2a (low-power device support, AI glasses use case) with new `runtime_token_request` envelope (§4.5); Q3 audit storage = D1 source-of-truth (`runtime_token_audit` table) + KV hot cache (forensic queryability, fail-closed on D1 write failure). Envelope count 3 → 4. §5 reconnect path now references post-reconnect proactive refresh (§5.4). RFC-0015 reference bumped v1.0 → v1.1. One residual advisory question remains (§8): client-initiated `reason` taxonomy may need expansion as new device classes onboard. Original draft archived at `tracking/work/agentic-architect/archive/rfc-0016-v0.1-draft.md`.
- **v0.1 DRAFT (2026-05-05):** Initial draft per CTO directive following RFC-0015 v1.0 ratification (which deferred Q5 — runtime-token refresh — to this RFC). Specifies in-band server-initiated sliding-window refresh over the existing WSS connection, three new v2 envelope types (`runtime_token_refresh` / `runtime_token_ack` / `runtime_token_nack`), `prev_jti` audit chain, 120 s reconnect grace window, 5-min per-device refresh frequency cap, last-10 `jti` replay cache. No new signature algorithms; no new DID methods. Awaiting CTO + 🛡️ devex-protocol-sec review gate. **MUST ratify before W3 lights up the RFC-0015 §3 15-min runtime-token TTL.**
