KTP-Recovery: Disaster Recovery & Resilience¶

"Everything fails, all the time. The goal is not to prevent failure, but to recover without compromising security."

At a Glance¶

Property	Value
Status	Draft
Version	0.1
Dependencies	KTP-Core, KTP-Emergency
Required By	KTP-Audit, KTP-Conformance

The Problem¶

Complex systems are prone to failure—hardware crashes, network partitions, and cyberattacks. In a trust protocol, a failure cannot simply mean "downtime"; it must not result in a security breach. A system that "fails open" (allowing access when it shouldn't) is catastrophic.

The Solution: Resilient Recovery¶

KTP-Recovery defines the protocols for handling failures at every level, from a single node crash to a total zone outage. It prioritizes Security Preservation over Availability.

Recovery Principles¶

Fail Closed: When in doubt, deny. Availability loss is recoverable; a security breach is not.
No Single Point of Failure: Redundancy at every layer (Threshold Cryptography, Distributed Sensors).
Defense in Depth: Backup the backups. Verify the verifications.
Known-Good State: Restore to a verified clean state, not just the "last" state.

Failure Modes & Response¶

Oracle Node FailureMesh PartitionTotal Oracle Loss

Scenario: One node in the Oracle Mesh goes offline.

Impact: Minimal. Threshold cryptography (e.g., 3-of-5) allows operations to continue.
Response:
1. Isolate: Remove node from signing rotation.
2. Alert: Notify operations.
3. Restore: Re-sync state from healthy nodes.

graph LR
    A[Node 1] --- B[Node 2]
    B --- C[Node 3]
    C --- D[Node 4]
    D --- A
    E[Node 5 (Failed)] -.->|Isolated| A

Scenario: Network split divides the Oracle Mesh; no quorum exists.

Impact: Critical. No new Trust Proofs can be issued.
Response:
1. Cache Mode: PEPs honor existing proofs for <5 mins.
2. Conservative Mode: Deny new sessions; allow low-risk actions.
3. Fail-Closed: After 30 mins, deny all actions.

Scenario: All Oracle nodes are unreachable.

Impact: Catastrophic. Zone is effectively offline.
Response:
1. Emergency Mode: Activate pre-configured "Break Glass" policies.
2. Manual Override: Human operators must physically intervene to restore root keys.

Recovery Objectives (RTO/RPO)¶

Component	Recovery Time (RTO)	Data Loss (RPO)
Single Oracle	< 15 mins	0 (Real-time)
Oracle Mesh	< 1 hour	< 1 min
Flight Recorder	< 4 hours	0 (Real-time)
Full Zone	< 8 hours	< 15 mins

Related Specifications

KTP-Core: Baseline trust physics and \(A \leq E\).
KTP-Emergency: Break-glass escalation and emergency modes.
KTP-Audit: Flight Recorder audit trails for recovery actions.
KTP-Conformance: Recovery expectations tied to compliance tiers.

Official RFC Document¶

View Complete RFC Text (ktp-recovery.txt)

Kinetic Trust Protocol                                      C. Perkins
Specification Draft                                           NMCITRA
Version: 0.1                                             November 2025


        Kinetic Trust Protocol (KTP) - Recovery Specification

Abstract

   This document specifies disaster recovery, backup, restoration, and
   failure handling procedures for Kinetic Trust Protocol (KTP)
   deployments. It addresses Oracle mesh failures, zone recovery,
   federation partition handling, and data restoration procedures.

   Security systems must be resilient. This specification ensures KTP
   deployments can recover from failures without compromising security
   properties.

Status of This Memo

   This document is a draft specification developed by the New Mexico
   Cyber Intelligence & Threat Response Alliance (NMCITRA).

Copyright Notice

   Copyright (c) 2025 Chris Perkins / NMCITRA. Licensed under Apache
   License, Version 2.0.

Table of Contents

   1.  Introduction . . . . . . . . . . . . . . . . . . . . . . . . .  2
   2.  Recovery Principles  . . . . . . . . . . . . . . . . . . . . .  3
   3.  Failure Modes  . . . . . . . . . . . . . . . . . . . . . . . .  4
       3.1.  Oracle Node Failure  . . . . . . . . . . . . . . . . . .  4
       3.2.  Oracle Mesh Partition  . . . . . . . . . . . . . . . . .  6
       3.3.  Total Oracle Loss  . . . . . . . . . . . . . . . . . . .  8
       3.4.  Flight Recorder Failure  . . . . . . . . . . . . . . . . 10
       3.5.  Sensor Aggregator Failure  . . . . . . . . . . . . . . . 12
       3.6.  Federation Gateway Failure . . . . . . . . . . . . . . . 13
       3.7.  Zone-Wide Failure  . . . . . . . . . . . . . . . . . . . 14
   4.  Backup Procedures  . . . . . . . . . . . . . . . . . . . . . . 16
       4.1.  What to Back Up  . . . . . . . . . . . . . . . . . . . . 16
       4.2.  Backup Frequency . . . . . . . . . . . . . . . . . . . . 18
       4.3.  Backup Security  . . . . . . . . . . . . . . . . . . . . 19
       4.4.  Backup Verification  . . . . . . . . . . . . . . . . . . 20
   5.  Restoration Procedures . . . . . . . . . . . . . . . . . . . . 21
       5.1.  Oracle Restoration . . . . . . . . . . . . . . . . . . . 21
       5.2.  Key Recovery . . . . . . . . . . . . . . . . . . . . . . 23
       5.3.  State Reconstruction . . . . . . . . . . . . . . . . . . 25
       5.4.  Trajectory Recovery  . . . . . . . . . . . . . . . . . . 27
   6.  Graceful Degradation . . . . . . . . . . . . . . . . . . . . . 28
       6.1.  Degradation Levels . . . . . . . . . . . . . . . . . . . 28
       6.2.  Automatic Failover . . . . . . . . . . . . . . . . . . . 30
       6.3.  Manual Intervention  . . . . . . . . . . . . . . . . . . 31
   7.  Testing and Validation . . . . . . . . . . . . . . . . . . . . 32
   8.  Security Considerations  . . . . . . . . . . . . . . . . . . . 34
   9.  References . . . . . . . . . . . . . . . . . . . . . . . . . . 35
   Appendix A.  Recovery Runbooks . . . . . . . . . . . . . . . . . . 36
   Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 42

1.  Introduction

   "Everything fails, all the time." - Werner Vogels

   KTP systems protect critical operations. When components fail, the
   system must either continue operating safely or recover quickly. This
   specification defines how.

   Recovery objectives:

   1. SECURITY PRESERVATION
      Recovery must not compromise security properties. "Fail secure"
      takes precedence over "fail available."

   2. DATA INTEGRITY
      Recovered state must be consistent and verifiable. No silent data
      corruption or loss.

   3. MINIMAL DISRUPTION
      Recovery should be as fast as possible while meeting objectives
      1 and 2.

   4. AUDITABILITY
      All recovery actions must be logged and attributable.

2.  Recovery Principles

   PRINCIPLE 1: FAIL CLOSED
   When in doubt, deny. A system that fails open is worse than a system
   that fails closed. Availability loss is recoverable; security breach
   may not be.

   PRINCIPLE 2: NO SINGLE POINT OF FAILURE
   Every critical component should have redundancy. Threshold
   cryptography for Oracles. Multiple Flight Recorders. Distributed
   sensors.

   PRINCIPLE 3: DEFENSE IN DEPTH FOR RECOVERY
   Backup your backups. Verify your verifications. Test your tests.
   Recovery procedures themselves can fail.

   PRINCIPLE 4: SECURITY DURING RECOVERY
   Recovery mode is an attractive attack window. Maintain
   authentication, authorization, and audit during recovery.

   PRINCIPLE 5: KNOWN-GOOD STATE
   Recovery should restore to a known-good state, not just "last state."
   Compromised state should not be restored.

   Recovery Time Objectives (RTO):

   +-------------------------------------------------------------------+
   | Component              | Level 1    | Level 2    | Level 3       |
   +-------------------------------------------------------------------+
   | Single Oracle node     | 1 hour     | 15 minutes | 5 minutes     |
   | Oracle mesh (quorum)   | 4 hours    | 1 hour     | 15 minutes    |
   | Flight Recorder        | 24 hours   | 4 hours    | 1 hour        |
   | Federation gateway     | 4 hours    | 1 hour     | 15 minutes    |
   | Full zone              | 24 hours   | 8 hours    | 2 hours       |
   +-------------------------------------------------------------------+

   Recovery Point Objectives (RPO):

   +-------------------------------------------------------------------+
   | Data Type              | Level 1    | Level 2    | Level 3       |
   +-------------------------------------------------------------------+
   | Trust Score state      | 1 hour     | 15 minutes | 1 minute      |
   | Trajectory records     | 24 hours   | 1 hour     | 15 minutes    |
   | Flight Recorder        | 1 hour     | 15 minutes | 0 (real-time) |
   | Configuration          | 24 hours   | 4 hours    | 1 hour        |
   +-------------------------------------------------------------------+

3.  Failure Modes

3.1.  Oracle Node Failure

   Definition: Single Oracle node becomes unavailable while remaining
   nodes continue operating.

3.1.1.  Detection

   -  Heartbeat failure (3 consecutive missed, 30 seconds)
   -  Threshold signing failure (node doesn't contribute)
   -  Health check endpoint returns error or timeout
   -  Network unreachability

3.1.2.  Impact Assessment

   Impact depends on threshold configuration:

   -  5-of-7 threshold: 2 nodes can fail, still operational
   -  3-of-5 threshold: 2 nodes can fail, still operational
   -  2-of-3 threshold: 1 node can fail, still operational

   Single node failure with adequate threshold:
   -  No Trust Proof issuance interruption
   -  Slight latency increase (fewer signing participants)
   -  Reduced fault tolerance margin

3.1.3.  Automatic Response

   1.  DETECT: Monitoring detects node failure
   2.  ISOLATE: Remove node from signing rotation
   3.  ALERT: Notify operations team
   4.  CONTINUE: Remaining nodes continue operation
   5.  REDISTRIBUTE: Rebalance load across healthy nodes

3.1.4.  Manual Recovery

   1.  DIAGNOSE: Determine failure cause
       -  Hardware failure -> Replace hardware
       -  Software crash -> Analyze, restart or redeploy
       -  Network issue -> Resolve network problem
       -  Compromise suspected -> Initiate incident response

   2.  RESTORE: Return node to operation
       -  Restart service if software issue
       -  Redeploy to new hardware if hardware issue
       -  Re-sync state from healthy nodes

   3.  VERIFY: Confirm proper operation
       -  Health check passes
       -  Participates in threshold signing
       -  State matches other nodes

   4.  REINTEGRATE: Add back to rotation
       -  Gradually increase traffic
       -  Monitor for issues
       -  Restore full participation

3.1.5.  Key Share Considerations

   Oracle node has threshold key share. If node is compromised:

   -  DO NOT restore node with same key share
   -  Initiate key rotation procedure (KTP-CRYPTO Section 8.4)
   -  Generate new key shares for all nodes
   -  Old key enters grace period, then expires

   If node failure is NOT compromise:
   -  Key share can be restored from backup
   -  Or regenerated from other shares (if supported)
   -  HSM attestation should verify integrity

3.2.  Oracle Mesh Partition

   Definition: Oracle mesh splits into disconnected groups, neither
   having quorum.

3.2.1.  Detection

   -  Threshold signing fails (insufficient participants)
   -  Nodes report partial connectivity
   -  Network monitoring shows partition

3.2.2.  Impact

   CRITICAL: Trust Proof issuance stops zone-wide

   -  No new Trust Proofs can be issued
   -  Existing proofs continue until expiration
   -  PEPs enter degraded mode (cache or fail-closed)
   -  Operations requiring new proofs fail

3.2.3.  Degradation Behavior

   During partition, PEPs should:

   1.  CACHE MODE (short partition, <5 minutes)
       -  Continue honoring cached Trust Proofs
       -  Allow actions within cached proof validity
       -  Queue proof refresh requests

   2.  CONSERVATIVE MODE (medium partition, 5-30 minutes)
       -  Honor cached proofs for existing sessions
       -  Deny new sessions requiring fresh proofs
       -  Allow only low-risk actions

   3.  FAIL-CLOSED MODE (long partition, >30 minutes)
       -  Deny all actions requiring Trust Proofs
       -  Allow only pre-authorized emergency actions
       -  Alert administrators

3.2.4.  Resolution

   1.  IDENTIFY partition cause
       -  Network failure between sites
       -  BGP issue, firewall misconfiguration
       -  DDoS attack on network links

   2.  RESOLVE network issue
       -  Work with network team
       -  Activate backup network paths
       -  Engage ISP if external

   3.  RECONCILE state after healing
       -  Nodes compare state
       -  Resolve any conflicts (rare, but possible)
       -  Resume threshold signing

3.2.5.  Prevention

   -  Geographic diversity but network redundancy
   -  Multiple network paths between Oracle sites
   -  Monitoring of network health
   -  Regular partition testing

3.3.  Total Oracle Loss

   Definition: All Oracle nodes fail or are unreachable. No threshold
   signing is possible.

   This is the most severe failure mode.

3.3.1.  Detection

   -  All health checks fail
   -  No nodes respond to signing requests
   -  Monitoring shows all nodes down

3.3.2.  Impact

   CRITICAL: Zone is effectively offline for new trust decisions

   -  No new Trust Proofs
   -  Existing proofs expire within seconds
   -  All enforced actions eventually blocked
   -  Zone enters emergency mode

3.3.3.  Emergency Mode Operation

   When total Oracle loss detected:

   1.  PEPs switch to EMERGENCY MODE
       -  Pre-configured emergency policy takes effect
       -  Only explicitly allowed actions permitted
       -  All other actions denied

   2.  Emergency policy should define:
       -  Life-safety actions always allowed
       -  Critical infrastructure maintenance allowed
       -  All other actions denied
       -  Aggressive alerting

   3.  Human intervention REQUIRED
       -  Automated recovery cannot restore from total loss
       -  Key ceremony may be required
       -  Business continuity procedures activated

3.3.4.  Recovery from Total Loss

   SCENARIO A: Infrastructure failure (all nodes down but intact)

   1.  Restore infrastructure (network, power, hardware)
   2.  Restart Oracle nodes
   3.  Nodes recover state from local storage
   4.  Verify threshold signing works
   5.  Exit emergency mode

   SCENARIO B: Data loss (nodes destroyed)

   1.  Provision new Oracle infrastructure
   2.  Restore from backup:
       -  Configuration from backup
       -  State from backup
       -  Key shares from secure backup (if available)
   3.  If key shares not recoverable:
       -  Initiate key recovery ceremony
       -  Recover from trustee-held shares
   4.  Verify and reintegrate

   SCENARIO C: Suspected compromise (all nodes potentially hostile)

   1.  DO NOT restore from current backups (may be compromised)
   2.  Activate incident response
   3.  Rebuild from known-good golden images
   4.  Key ceremony to generate new keys
   5.  Re-enroll all agents (may be required)
   6.  Forensic investigation of compromise

3.4.  Flight Recorder Failure

   Definition: Flight Recorder becomes unavailable or corrupted.

3.4.1.  Detection

   -  Write failures from components
   -  Read failures for queries
   -  Integrity check failures
   -  Storage exhaustion

3.4.2.  Impact

   MEDIUM-HIGH: Audit trail may have gaps

   -  New decisions may not be recorded
   -  Historical queries fail
   -  Compliance impact
   -  Forensic capability degraded

   Note: Flight Recorder failure does NOT stop Trust Proof issuance.
   Operations can continue, but without audit.

3.4.3.  Degradation Behavior

   1.  LOCAL BUFFERING
       -  Components buffer audit records locally
       -  Typical buffer: 1 hour of records
       -  Buffer overflow: oldest records dropped (with count)

   2.  SECONDARY FLIGHT RECORDER
       -  If configured, switch to secondary
       -  Primary failure logged

   3.  ALERTING
       -  Immediate alert on Flight Recorder failure
       -  Escalating alerts as buffer fills

3.4.4.  Recovery

   1.  Restore Flight Recorder service
   2.  Flush buffered records from components
   3.  Verify chain integrity
       -  Chain will have gap marker if records lost
       -  Gap is itself recorded and auditable
   4.  Investigate cause of failure

3.4.5.  Chain Gap Handling

   If records were lost during failure:

   {
     "record_type": "chain_gap",
     "gap_start": "2025-11-25T10:00:00Z",
     "gap_end": "2025-11-25T10:15:00Z",
     "estimated_records_lost": 1500,
     "cause": "flight_recorder_failure",
     "recovery_action": "restored_from_backup",
     "signature": "..."
   }

   Gap records are signed by Oracle and become part of permanent audit
   trail.

3.5.  Sensor Aggregator Failure

   Definition: Sensor aggregator becomes unavailable.

3.5.1.  Impact

   MEDIUM: Context Tensor updates degraded

   -  Sensors cannot deliver readings
   -  Context Tensor uses stale data
   -  Risk Factor calculation affected

3.5.2.  Degradation Behavior

   1.  SENSOR BUFFERING
       -  Sensors buffer readings locally
       -  Typical buffer: 5 minutes

   2.  STALE DATA MARKING
       -  Oracle marks Context Tensor dimensions as stale
       -  Stale dimensions may increase Risk Factor
       -  Or use conservative defaults

   3.  FAILOVER
       -  Sensors switch to backup aggregator (if configured)

3.5.3.  Recovery

   1.  Restore aggregator service
   2.  Sensors flush buffered readings
   3.  Context Tensor returns to real-time
   4.  Trust Scores normalize

3.6.  Federation Gateway Failure

   Definition: Federation gateway becomes unavailable.

3.6.1.  Impact

   MEDIUM: Cross-zone trust affected

   -  Cannot accept foreign Trust Proofs
   -  Cannot issue cross-zone attestations
   -  Federation heartbeat fails
   -  Partner zones see this zone as degraded

3.6.2.  Degradation Behavior

   1.  CACHED FOREIGN PROOFS
       -  Honor cached foreign proofs until expiration
       -  No new foreign proofs accepted

   2.  LOCAL OPERATION
       -  Zone operates independently
       -  Local agents unaffected
       -  Cross-zone agents cannot operate

   3.  PARTNER NOTIFICATION
       -  Heartbeat failure notifies partners
       -  Partners reduce trust factor for this zone

3.6.3.  Recovery

   1.  Restore federation gateway
   2.  Re-establish federation connections
   3.  Resume heartbeat
   4.  Trust factor gradually recovers

3.7.  Zone-Wide Failure

   Definition: Entire zone becomes unavailable (natural disaster,
   massive infrastructure failure, coordinated attack).

3.7.1.  Impact

   CRITICAL: All zone operations cease

   -  All agents in zone cannot operate
   -  All protected resources inaccessible
   -  Federation partners lose connection

3.7.2.  Recovery Options

   OPTION A: Restore in place
   -  Rebuild infrastructure at same location
   -  Restore from backups
   -  Resume operations

   OPTION B: Failover to DR site
   -  Activate disaster recovery site
   -  Restore from replicated data
   -  Update DNS/routing to DR site
   -  Resume operations at DR

   OPTION C: Migrate to partner zone (temporary)
   -  Federated partner accepts refugees
   -  Agents operate with reduced trust (foreign zone penalty)
   -  Temporary until primary zone restored

3.7.3.  DR Site Requirements

   For Level 3 deployments, DR site MUST:

   -  Be geographically separate (>100 miles recommended)
   -  Have independent power and network
   -  Maintain synchronized state (RPO per Section 2)
   -  Have HSMs with key shares
   -  Be tested quarterly

4.  Backup Procedures

4.1.  What to Back Up

4.1.1.  Critical (MUST back up)

   ORACLE SIGNING KEY SHARES
   -  Threshold key shares for Trust Proof signing
   -  Backup encrypted to recovery keys
   -  Stored with trustees (not with Oracle)
   -  Recovery requires M-of-N trustees

   ZONE CONFIGURATION
   -  Security policies, thresholds, weights
   -  Soul constraints
   -  Trust Tier boundaries
   -  Federation agreements

   FLIGHT RECORDER DATA
   -  All audit records
   -  Chain hashes
   -  External anchor references

4.1.2.  Important (SHOULD back up)

   AGENT REGISTRY
   -  Registered agents
   -  Public keys
   -  Lineage information
   -  Current E_base (can be recalculated)

   TRAJECTORY CHAINS
   -  Agent transaction history
   -  Proof of Resilience records
   -  (Large; may use differential backups)

   SENSOR CONFIGURATION
   -  Sensor registrations
   -  Aggregator mappings
   -  Baseline calibrations

4.1.3.  Reconstructible (MAY back up)

   TRUST SCORES
   -  Current E_trust values
   -  Can be recalculated from E_base and Context Tensor

   CONTEXT TENSOR
   -  Current sensor readings
   -  Will be refreshed by live sensors

   CACHED TRUST PROOFS
   -  Short-lived, will be re-issued

4.2.  Backup Frequency

   +-------------------------------------------------------------------+
   | Data Type              | Backup Frequency | Retention            |
   +-------------------------------------------------------------------+
   | Key shares             | On change only   | Forever              |
   | Configuration          | On change + daily| 1 year               |
   | Flight Recorder        | Continuous/hourly| Per policy (7 years) |
   | Agent Registry         | Daily            | 90 days              |
   | Trajectory Chains      | Daily incremental| 1 year full, 7 incr  |
   | Sensor Config          | Daily            | 90 days              |
   +-------------------------------------------------------------------+

4.3.  Backup Security

   ENCRYPTION
   -  All backups encrypted at rest
   -  Encryption key separate from backed-up keys
   -  Key shares backed up to different location than data

   ACCESS CONTROL
   -  Backup access requires authentication
   -  Restore requires multi-person authorization (Level 2+)
   -  All access logged

   INTEGRITY
   -  Backups include cryptographic checksums
   -  Verify integrity before restore
   -  Detect tampering

   ISOLATION
   -  Backups stored separately from production
   -  Air-gapped for highest security (Level 3)
   -  Ransomware cannot reach backups

4.4.  Backup Verification

   VERIFICATION SCHEDULE
   -  Automated integrity check: Daily
   -  Restore test to isolated environment: Monthly
   -  Full DR test: Quarterly (Level 2+), Monthly (Level 3)

   VERIFICATION PROCEDURES
   1.  Verify backup file integrity (checksums)
   2.  Verify backup completeness (all expected files)
   3.  Restore to isolated environment
   4.  Verify restored system functions
   5.  Verify restored data matches source
   6.  Document results

5.  Restoration Procedures

5.1.  Oracle Restoration

5.1.1.  Single Node Restoration

   Prerequisites:
   -  Healthy Oracle mesh (other nodes operational)
   -  Backup of node configuration
   -  Key share (from backup or ceremony)

   Procedure:

   1.  PROVISION infrastructure
       -  Deploy new VM/hardware
       -  Install Oracle software
       -  Configure network

   2.  RESTORE configuration
       -  Apply configuration from backup
       -  Verify configuration matches zone

   3.  RESTORE key share
       -  If HSM available: Import key share
       -  If HSM destroyed: Recover from trustee backup

   4.  SYNC state
       -  Connect to healthy nodes
       -  Sync current Trust Score state
       -  Sync recent trajectory updates

   5.  VERIFY operation
       -  Run health checks
       -  Participate in test signing
       -  Verify state matches other nodes

   6.  REINTEGRATE
       -  Add to load balancer
       -  Enable full traffic

5.1.2.  Full Mesh Restoration

   Prerequisites:
   -  New infrastructure for all nodes
   -  Configuration backups
   -  Key shares (from trustees or ceremony)
   -  State backups

   Procedure:

   1.  PROVISION all infrastructure
       -  Deploy all Oracle nodes
       -  Configure network between nodes

   2.  RESTORE configuration to all nodes
       -  Same configuration on all nodes

   3.  KEY RECOVERY CEREMONY
       -  Convene required trustees
       -  Recover threshold key shares
       -  Install shares in HSMs

   4.  RESTORE state
       -  Restore Agent Registry from backup
       -  Restore trajectory data from backup
       -  Restore last known Trust Scores

   5.  VERIFY threshold signing
       -  Attempt to sign test message
       -  Verify k-of-n nodes can sign

   6.  RESUME operations
       -  Enable PEP connections
       -  Issue Trust Proofs
       -  Monitor closely

5.2.  Key Recovery

5.2.1.  Key Recovery Prerequisites

   Key recovery requires:

   -  M-of-N trustees (typically 3-of-5 or 5-of-7)
   -  Trustees have recovery key shares
   -  Secure environment for ceremony
   -  New HSMs to receive recovered keys

   Trustees should be:
   -  Geographically distributed
   -  Organizationally independent
   -  Personally reliable
   -  Reachable in emergency

5.2.2.  Key Recovery Ceremony

   PHASE 1: CONVENE
   1.  Incident commander declares key recovery needed
   2.  Contact trustees (secure channel)
   3.  Schedule ceremony (virtual or in-person)
   4.  Prepare secure environment

   PHASE 2: AUTHENTICATE
   1.  Verify trustee identity (multi-factor)
   2.  Verify ceremony authorization
   3.  Record ceremony (video + transcript)
   4.  Witnesses present (if required)

   PHASE 3: RECOVER
   1.  Each trustee decrypts their recovery share
   2.  Shares combined in secure environment
   3.  Master key material reconstructed
   4.  New threshold shares generated
   5.  Shares installed in HSMs
   6.  Master key material destroyed (never stored)

   PHASE 4: VERIFY
   1.  Test threshold signing
   2.  Verify all nodes can participate
   3.  Issue test Trust Proof
   4.  Verify signature validates

   PHASE 5: DOCUMENT
   1.  Record ceremony completion
   2.  Update key inventory
   3.  Notify stakeholders
   4.  Archive ceremony recording (secure)

5.2.3.  Key Recovery Security

   -  Recovery environment: Air-gapped if possible
   -  No photography except official recording
   -  All attendees logged
   -  Ceremony room swept for bugs (Level 3)
   -  Shares transmitted encrypted, never in clear

5.3.  State Reconstruction

   If state backups are unavailable or suspect:

5.3.1.  Agent Registry Reconstruction

   If backup unavailable:
   1.  Agents must re-register
   2.  Identity proofing re-verified
   3.  Sponsor relationships re-established
   4.  E_base starts from initial value (not historical)

   Impact: Agents lose accumulated trust. Significant operational
   disruption. Use backup restoration if at all possible.

5.3.2.  Trust Score Reconstruction

   Trust Scores can be recalculated if:
   -  E_base known (from trajectory or backup)
   -  Context Tensor available (from sensors)

   E_trust = E_base x Context_modifier x Risk_factor

   If E_base unknown, must use default for lineage type.

5.3.3.  Trajectory Reconstruction

   Trajectory chains are cryptographically linked. If chain is broken:

   1.  Recover as much chain as possible from backup
   2.  Mark gap in chain
   3.  Continue new records after gap
   4.  Agent's E_base calculation notes gap

   Gaps in trajectory reduce trust (unverifiable history).

5.4.  Trajectory Recovery

   Trajectory data may be large. Recovery strategies:

   FULL RESTORE
   -  Restore complete trajectory from backup
   -  Slowest but most complete
   -  Use for complete zone recovery

   POINT-IN-TIME RESTORE
   -  Restore trajectory to specific point
   -  Useful if recent data corrupted
   -  Records after restore point lost

   DIFFERENTIAL RESTORE
   -  Restore base + incremental backups
   -  Faster than full restore
   -  Standard approach for routine recovery

6.  Graceful Degradation

6.1.  Degradation Levels

   KTP defines four degradation levels:

   LEVEL 0: NORMAL
   -  All components operational
   -  Full functionality
   -  Standard Trust Score calculation

   LEVEL 1: DEGRADED
   -  Some redundancy lost
   -  Full functionality maintained
   -  Increased monitoring
   -  Example: 1 Oracle node down (but quorum intact)

   LEVEL 2: IMPAIRED
   -  Some functionality reduced
   -  Core security maintained
   -  Non-critical features disabled
   -  Example: Sensor aggregator down (stale Context Tensor)

   LEVEL 3: EMERGENCY
   -  Critical functionality only
   -  Fail-closed for non-essential
   -  Human intervention required
   -  Example: Oracle mesh partitioned

   LEVEL 4: OFFLINE
   -  Zone non-operational
   -  All requests denied
   -  Recovery in progress
   -  Example: Total Oracle loss

6.2.  Automatic Failover

6.2.1.  Oracle Failover

   -  Automatic within threshold (node failure)
   -  Automatic leader election if needed
   -  No manual intervention for k-of-n failures

6.2.2.  Flight Recorder Failover

   -  Automatic switch to secondary (if configured)
   -  Local buffering during transition
   -  Alert on primary failure

6.2.3.  Sensor Aggregator Failover

   -  Sensors automatically retry backup aggregator
   -  Oracle uses stale data with marking
   -  Automatic when aggregator returns

6.2.4.  Federation Failover

   -  Automatic failover to backup gateway
   -  Partners notified via heartbeat
   -  Automatic reconnection on recovery

6.3.  Manual Intervention

   Some situations require manual intervention:

   -  Total Oracle loss
   -  Suspected compromise
   -  Key recovery
   -  DR site activation
   -  Configuration rollback

   Manual procedures are documented in Appendix A.

7.  Testing and Validation

7.1.  Recovery Testing Requirements

   +-------------------------------------------------------------------+
   | Test Type              | Level 1    | Level 2    | Level 3       |
   +-------------------------------------------------------------------+
   | Backup verification    | Monthly    | Weekly     | Daily         |
   | Single node recovery   | Quarterly  | Monthly    | Monthly       |
   | Mesh partition test    | Annually   | Quarterly  | Monthly       |
   | Full DR test           | Annually   | Quarterly  | Monthly       |
   | Key recovery drill     | Annually   | Semi-annual| Quarterly     |
   +-------------------------------------------------------------------+

7.2.  Test Procedures

   SINGLE NODE RECOVERY TEST
   1.  Take one Oracle node offline (planned)
   2.  Verify mesh continues operating
   3.  Restore node from backup
   4.  Verify node rejoins mesh
   5.  Document time and issues

   PARTITION TEST
   1.  Simulate network partition (firewall rules)
   2.  Verify PEPs enter degraded mode
   3.  Verify no Trust Proofs issued during partition
   4.  Heal partition
   5.  Verify normal operation resumes

   FULL DR TEST
   1.  Activate DR site
   2.  Restore all components from backup
   3.  Verify full functionality
   4.  Measure RTO and RPO achieved
   5.  Fail back to primary

8.  Security Considerations

8.1.  Recovery as Attack Vector

   Recovery procedures can be exploited:

   -  Attacker triggers failure to invoke recovery
   -  Attacker substitutes malicious backup
   -  Attacker compromises recovery process
   -  Attacker uses recovery mode to bypass controls

   Mitigations:
   -  Authenticate all recovery actions
   -  Verify backup integrity before restore
   -  Maintain audit during recovery
   -  Time-limit recovery mode

8.2.  Backup Security

   Backups are high-value targets:

   -  Contain all secrets (encrypted, but still)
   -  May reveal system architecture
   -  May enable offline attacks

   Mitigations:
   -  Encrypt all backups
   -  Separate backup encryption keys
   -  Access control on backups
   -  Monitor backup access

8.3.  Key Recovery Security

   Key recovery is highest-risk operation:

   -  Reconstitutes master key material
   -  Single point where key exists in full
   -  Attractive target for advanced attackers

   Mitigations:
   -  Ceremony with witnesses
   -  Secure environment
   -  Immediate destruction after use
   -  M-of-N trustee requirement

9.  References

   [KTP-CRYPTO]
              Perkins, C., "Kinetic Trust Protocol - Cryptographic
              Specification", NMCITRA, November 2025.

   [KTP-AUDIT]
              Perkins, C., "Kinetic Trust Protocol - Audit
              Specification", NMCITRA, November 2025.

   [NIST-CP]  National Institute of Standards and Technology,
              "Contingency Planning Guide for Federal Information
              Systems", SP 800-34 Rev. 1, May 2010.

Appendix A.  Recovery Runbooks

A.1.  Runbook: Single Oracle Node Recovery

   TRIGGER: Oracle node health check fails for 5 minutes

   STEPS:

   1.  Verify failure (not monitoring false positive)
       $ ktp-cli oracle status --node oracle-1

   2.  Check if quorum maintained
       $ ktp-cli oracle mesh-status
       Expected: "Mesh operational, X of Y nodes healthy"

   3.  Attempt restart
       $ systemctl restart ktp-oracle
       Wait 60 seconds, check status

   4.  If restart fails, check logs
       $ journalctl -u ktp-oracle -n 100

   5.  If hardware issue, provision new node
       $ terraform apply -target=oracle-node-1

   6.  Restore configuration
       $ ktp-backup restore --target oracle-1 --type config

   7.  Restore key share (requires HSM access)
       $ ktp-hsm import-share --node oracle-1

   8.  Verify and reintegrate
       $ ktp-cli oracle join-mesh --node oracle-1
       $ ktp-cli oracle verify-signing --node oracle-1

   ESCALATION: If not resolved in 30 minutes, page on-call lead

A.2.  Runbook: Oracle Mesh Partition

   TRIGGER: Trust Proof issuance fails, partition detected

   STEPS:

   1.  Verify partition
       $ ktp-cli oracle mesh-status
       Expected: "Mesh partitioned, no quorum"

   2.  Identify partition boundaries
       $ ktp-cli oracle connectivity-matrix

   3.  Check network connectivity
       $ for node in oracle-{1..5}; do
           ping -c 1 $node && echo "$node reachable"
         done

   4.  If network issue, engage network team
       -  Check firewall rules
       -  Check BGP status
       -  Check physical connectivity

   5.  If single site isolated, verify other sites operational
       $ ktp-cli oracle site-status

   6.  When partition heals, verify mesh reforms
       $ ktp-cli oracle mesh-status
       Expected: "Mesh operational"

   7.  Check for state conflicts
       $ ktp-cli oracle state-consistency-check

   8.  Resume normal operations
       $ ktp-cli zone set-degradation-level 0

   ESCALATION: Immediate page to incident commander

A.3.  Runbook: Total Oracle Loss Recovery

   TRIGGER: All Oracle nodes unreachable

   STEPS:

   1.  Declare incident
       -  Page incident commander
       -  Activate incident response team
       -  Begin incident timeline

   2.  Assess situation
       -  Infrastructure failure vs. attack
       -  Data center status
       -  Network status

   3.  If infrastructure failure:
       a.  Restore infrastructure
       b.  Restore from backup (see A.4)

   4.  If suspected attack:
       a.  DO NOT restore from recent backups
       b.  Engage security team
       c.  Preserve evidence
       d.  Rebuild from golden images
       e.  Key recovery ceremony required

   5.  Activate DR site (if available)
       $ ktp-dr activate --site dr-west

   6.  Update DNS/routing to DR
       $ ktp-dns failover --to dr-west

   7.  Verify DR operational
       $ ktp-cli oracle mesh-status --site dr-west

   8.  Communicate status to stakeholders

   ESCALATION: This IS the escalation

A.4.  Runbook: Restore from Backup

   TRIGGER: Recovery requires backup restoration

   STEPS:

   1.  Identify backup to restore
       $ ktp-backup list --type full
       Select most recent known-good backup

   2.  Verify backup integrity
       $ ktp-backup verify --backup-id <id>
       MUST pass before proceeding

   3.  Provision target infrastructure
       $ terraform apply

   4.  Restore configuration
       $ ktp-backup restore --backup-id <id> --type config

   5.  Restore state data
       $ ktp-backup restore --backup-id <id> --type state

   6.  Restore trajectories (may take time)
       $ ktp-backup restore --backup-id <id> --type trajectory

   7.  Key recovery (if needed)
       -  Contact trustees
       -  Schedule ceremony
       -  Execute per Section 5.2.2

   8.  Verify system operational
       $ ktp-cli oracle mesh-status
       $ ktp-cli oracle test-sign

   9.  Gradually restore traffic
       $ ktp-cli zone set-degradation-level 1
       Monitor for 15 minutes
       $ ktp-cli zone set-degradation-level 0

A.5.  Runbook: Key Recovery Ceremony

   TRIGGER: Key shares unrecoverable from backup

   PREPARATION:
   1.  Incident commander authorizes ceremony
   2.  Contact M trustees (need M of N)
   3.  Schedule ceremony time (within RTO)
   4.  Prepare secure environment:
       -  Air-gapped machine
       -  New HSMs
       -  Recording equipment
       -  Witness(es)

   CEREMONY:
   1.  Verify all participants' identity
   2.  Begin recording
   3.  Read ceremony authorization into record
   4.  Each trustee:
       a.  Connects to ceremony machine (isolated)
       b.  Decrypts their recovery share
       c.  Inputs share to recovery software
       d.  Disconnects
   5.  Recovery software reconstructs master key
   6.  Generate new threshold shares
   7.  Install shares in HSMs
   8.  Test threshold signing
   9.  Securely destroy master key material
   10. End recording

   POST-CEREMONY:
   1.  Verify all Oracle nodes operational
   2.  Generate new trustee recovery shares
   3.  Distribute to trustees
   4.  Archive ceremony recording
   5.  Update key inventory

Authors' Addresses

   Chris Perkins
   New Mexico Cyber Intelligence & Threat Response Alliance (NMCITRA)

   Email: cperkins@nmcitra.org