KTP-Recovery: Disaster Recovery & Resilience¶
"Everything fails, all the time. The goal is not to prevent failure, but to recover without compromising security."
At a Glance¶
| Property | Value |
|---|---|
| Status | Draft |
| Version | 0.1 |
| Dependencies | KTP-Core, KTP-Emergency |
| Required By | KTP-Audit, KTP-Conformance |
The Problem¶
Complex systems are prone to failure—hardware crashes, network partitions, and cyberattacks. In a trust protocol, a failure cannot simply mean "downtime"; it must not result in a security breach. A system that "fails open" (allowing access when it shouldn't) is catastrophic.
The Solution: Resilient Recovery¶
KTP-Recovery defines the protocols for handling failures at every level, from a single node crash to a total zone outage. It prioritizes Security Preservation over Availability.
Recovery Principles¶
- Fail Closed: When in doubt, deny. Availability loss is recoverable; a security breach is not.
- No Single Point of Failure: Redundancy at every layer (Threshold Cryptography, Distributed Sensors).
- Defense in Depth: Backup the backups. Verify the verifications.
- Known-Good State: Restore to a verified clean state, not just the "last" state.
Failure Modes & Response¶
Scenario: One node in the Oracle Mesh goes offline.
- Impact: Minimal. Threshold cryptography (e.g., 3-of-5) allows operations to continue.
- Response:
- Isolate: Remove node from signing rotation.
- Alert: Notify operations.
- Restore: Re-sync state from healthy nodes.
graph LR
A[Node 1] --- B[Node 2]
B --- C[Node 3]
C --- D[Node 4]
D --- A
E[Node 5 (Failed)] -.->|Isolated| A
Scenario: Network split divides the Oracle Mesh; no quorum exists.
- Impact: Critical. No new Trust Proofs can be issued.
- Response:
- Cache Mode: PEPs honor existing proofs for <5 mins.
- Conservative Mode: Deny new sessions; allow low-risk actions.
- Fail-Closed: After 30 mins, deny all actions.
Scenario: All Oracle nodes are unreachable.
- Impact: Catastrophic. Zone is effectively offline.
- Response:
- Emergency Mode: Activate pre-configured "Break Glass" policies.
- Manual Override: Human operators must physically intervene to restore root keys.
Recovery Objectives (RTO/RPO)¶
| Component | Recovery Time (RTO) | Data Loss (RPO) |
|---|---|---|
| Single Oracle | < 15 mins | 0 (Real-time) |
| Oracle Mesh | < 1 hour | < 1 min |
| Flight Recorder | < 4 hours | 0 (Real-time) |
| Full Zone | < 8 hours | < 15 mins |
Related Specifications¶
Related Specifications
- KTP-Core: Baseline trust physics and \(A \leq E\).
- KTP-Emergency: Break-glass escalation and emergency modes.
- KTP-Audit: Flight Recorder audit trails for recovery actions.
- KTP-Conformance: Recovery expectations tied to compliance tiers.
Official RFC Document¶
View Complete RFC Text (ktp-recovery.txt)
Kinetic Trust Protocol C. Perkins
Specification Draft NMCITRA
Version: 0.1 November 2025
Kinetic Trust Protocol (KTP) - Recovery Specification
Abstract
This document specifies disaster recovery, backup, restoration, and
failure handling procedures for Kinetic Trust Protocol (KTP)
deployments. It addresses Oracle mesh failures, zone recovery,
federation partition handling, and data restoration procedures.
Security systems must be resilient. This specification ensures KTP
deployments can recover from failures without compromising security
properties.
Status of This Memo
This document is a draft specification developed by the New Mexico
Cyber Intelligence & Threat Response Alliance (NMCITRA).
Copyright Notice
Copyright (c) 2025 Chris Perkins / NMCITRA. Licensed under Apache
License, Version 2.0.
Table of Contents
1. Introduction . . . . . . . . . . . . . . . . . . . . . . . . . 2
2. Recovery Principles . . . . . . . . . . . . . . . . . . . . . 3
3. Failure Modes . . . . . . . . . . . . . . . . . . . . . . . . 4
3.1. Oracle Node Failure . . . . . . . . . . . . . . . . . . 4
3.2. Oracle Mesh Partition . . . . . . . . . . . . . . . . . 6
3.3. Total Oracle Loss . . . . . . . . . . . . . . . . . . . 8
3.4. Flight Recorder Failure . . . . . . . . . . . . . . . . 10
3.5. Sensor Aggregator Failure . . . . . . . . . . . . . . . 12
3.6. Federation Gateway Failure . . . . . . . . . . . . . . . 13
3.7. Zone-Wide Failure . . . . . . . . . . . . . . . . . . . 14
4. Backup Procedures . . . . . . . . . . . . . . . . . . . . . . 16
4.1. What to Back Up . . . . . . . . . . . . . . . . . . . . 16
4.2. Backup Frequency . . . . . . . . . . . . . . . . . . . . 18
4.3. Backup Security . . . . . . . . . . . . . . . . . . . . 19
4.4. Backup Verification . . . . . . . . . . . . . . . . . . 20
5. Restoration Procedures . . . . . . . . . . . . . . . . . . . . 21
5.1. Oracle Restoration . . . . . . . . . . . . . . . . . . . 21
5.2. Key Recovery . . . . . . . . . . . . . . . . . . . . . . 23
5.3. State Reconstruction . . . . . . . . . . . . . . . . . . 25
5.4. Trajectory Recovery . . . . . . . . . . . . . . . . . . 27
6. Graceful Degradation . . . . . . . . . . . . . . . . . . . . . 28
6.1. Degradation Levels . . . . . . . . . . . . . . . . . . . 28
6.2. Automatic Failover . . . . . . . . . . . . . . . . . . . 30
6.3. Manual Intervention . . . . . . . . . . . . . . . . . . 31
7. Testing and Validation . . . . . . . . . . . . . . . . . . . . 32
8. Security Considerations . . . . . . . . . . . . . . . . . . . 34
9. References . . . . . . . . . . . . . . . . . . . . . . . . . . 35
Appendix A. Recovery Runbooks . . . . . . . . . . . . . . . . . . 36
Authors' Addresses . . . . . . . . . . . . . . . . . . . . . . . . 42
1. Introduction
"Everything fails, all the time." - Werner Vogels
KTP systems protect critical operations. When components fail, the
system must either continue operating safely or recover quickly. This
specification defines how.
Recovery objectives:
1. SECURITY PRESERVATION
Recovery must not compromise security properties. "Fail secure"
takes precedence over "fail available."
2. DATA INTEGRITY
Recovered state must be consistent and verifiable. No silent data
corruption or loss.
3. MINIMAL DISRUPTION
Recovery should be as fast as possible while meeting objectives
1 and 2.
4. AUDITABILITY
All recovery actions must be logged and attributable.
2. Recovery Principles
PRINCIPLE 1: FAIL CLOSED
When in doubt, deny. A system that fails open is worse than a system
that fails closed. Availability loss is recoverable; security breach
may not be.
PRINCIPLE 2: NO SINGLE POINT OF FAILURE
Every critical component should have redundancy. Threshold
cryptography for Oracles. Multiple Flight Recorders. Distributed
sensors.
PRINCIPLE 3: DEFENSE IN DEPTH FOR RECOVERY
Backup your backups. Verify your verifications. Test your tests.
Recovery procedures themselves can fail.
PRINCIPLE 4: SECURITY DURING RECOVERY
Recovery mode is an attractive attack window. Maintain
authentication, authorization, and audit during recovery.
PRINCIPLE 5: KNOWN-GOOD STATE
Recovery should restore to a known-good state, not just "last state."
Compromised state should not be restored.
Recovery Time Objectives (RTO):
+-------------------------------------------------------------------+
| Component | Level 1 | Level 2 | Level 3 |
+-------------------------------------------------------------------+
| Single Oracle node | 1 hour | 15 minutes | 5 minutes |
| Oracle mesh (quorum) | 4 hours | 1 hour | 15 minutes |
| Flight Recorder | 24 hours | 4 hours | 1 hour |
| Federation gateway | 4 hours | 1 hour | 15 minutes |
| Full zone | 24 hours | 8 hours | 2 hours |
+-------------------------------------------------------------------+
Recovery Point Objectives (RPO):
+-------------------------------------------------------------------+
| Data Type | Level 1 | Level 2 | Level 3 |
+-------------------------------------------------------------------+
| Trust Score state | 1 hour | 15 minutes | 1 minute |
| Trajectory records | 24 hours | 1 hour | 15 minutes |
| Flight Recorder | 1 hour | 15 minutes | 0 (real-time) |
| Configuration | 24 hours | 4 hours | 1 hour |
+-------------------------------------------------------------------+
3. Failure Modes
3.1. Oracle Node Failure
Definition: Single Oracle node becomes unavailable while remaining
nodes continue operating.
3.1.1. Detection
- Heartbeat failure (3 consecutive missed, 30 seconds)
- Threshold signing failure (node doesn't contribute)
- Health check endpoint returns error or timeout
- Network unreachability
3.1.2. Impact Assessment
Impact depends on threshold configuration:
- 5-of-7 threshold: 2 nodes can fail, still operational
- 3-of-5 threshold: 2 nodes can fail, still operational
- 2-of-3 threshold: 1 node can fail, still operational
Single node failure with adequate threshold:
- No Trust Proof issuance interruption
- Slight latency increase (fewer signing participants)
- Reduced fault tolerance margin
3.1.3. Automatic Response
1. DETECT: Monitoring detects node failure
2. ISOLATE: Remove node from signing rotation
3. ALERT: Notify operations team
4. CONTINUE: Remaining nodes continue operation
5. REDISTRIBUTE: Rebalance load across healthy nodes
3.1.4. Manual Recovery
1. DIAGNOSE: Determine failure cause
- Hardware failure -> Replace hardware
- Software crash -> Analyze, restart or redeploy
- Network issue -> Resolve network problem
- Compromise suspected -> Initiate incident response
2. RESTORE: Return node to operation
- Restart service if software issue
- Redeploy to new hardware if hardware issue
- Re-sync state from healthy nodes
3. VERIFY: Confirm proper operation
- Health check passes
- Participates in threshold signing
- State matches other nodes
4. REINTEGRATE: Add back to rotation
- Gradually increase traffic
- Monitor for issues
- Restore full participation
3.1.5. Key Share Considerations
Oracle node has threshold key share. If node is compromised:
- DO NOT restore node with same key share
- Initiate key rotation procedure (KTP-CRYPTO Section 8.4)
- Generate new key shares for all nodes
- Old key enters grace period, then expires
If node failure is NOT compromise:
- Key share can be restored from backup
- Or regenerated from other shares (if supported)
- HSM attestation should verify integrity
3.2. Oracle Mesh Partition
Definition: Oracle mesh splits into disconnected groups, neither
having quorum.
3.2.1. Detection
- Threshold signing fails (insufficient participants)
- Nodes report partial connectivity
- Network monitoring shows partition
3.2.2. Impact
CRITICAL: Trust Proof issuance stops zone-wide
- No new Trust Proofs can be issued
- Existing proofs continue until expiration
- PEPs enter degraded mode (cache or fail-closed)
- Operations requiring new proofs fail
3.2.3. Degradation Behavior
During partition, PEPs should:
1. CACHE MODE (short partition, <5 minutes)
- Continue honoring cached Trust Proofs
- Allow actions within cached proof validity
- Queue proof refresh requests
2. CONSERVATIVE MODE (medium partition, 5-30 minutes)
- Honor cached proofs for existing sessions
- Deny new sessions requiring fresh proofs
- Allow only low-risk actions
3. FAIL-CLOSED MODE (long partition, >30 minutes)
- Deny all actions requiring Trust Proofs
- Allow only pre-authorized emergency actions
- Alert administrators
3.2.4. Resolution
1. IDENTIFY partition cause
- Network failure between sites
- BGP issue, firewall misconfiguration
- DDoS attack on network links
2. RESOLVE network issue
- Work with network team
- Activate backup network paths
- Engage ISP if external
3. RECONCILE state after healing
- Nodes compare state
- Resolve any conflicts (rare, but possible)
- Resume threshold signing
3.2.5. Prevention
- Geographic diversity but network redundancy
- Multiple network paths between Oracle sites
- Monitoring of network health
- Regular partition testing
3.3. Total Oracle Loss
Definition: All Oracle nodes fail or are unreachable. No threshold
signing is possible.
This is the most severe failure mode.
3.3.1. Detection
- All health checks fail
- No nodes respond to signing requests
- Monitoring shows all nodes down
3.3.2. Impact
CRITICAL: Zone is effectively offline for new trust decisions
- No new Trust Proofs
- Existing proofs expire within seconds
- All enforced actions eventually blocked
- Zone enters emergency mode
3.3.3. Emergency Mode Operation
When total Oracle loss detected:
1. PEPs switch to EMERGENCY MODE
- Pre-configured emergency policy takes effect
- Only explicitly allowed actions permitted
- All other actions denied
2. Emergency policy should define:
- Life-safety actions always allowed
- Critical infrastructure maintenance allowed
- All other actions denied
- Aggressive alerting
3. Human intervention REQUIRED
- Automated recovery cannot restore from total loss
- Key ceremony may be required
- Business continuity procedures activated
3.3.4. Recovery from Total Loss
SCENARIO A: Infrastructure failure (all nodes down but intact)
1. Restore infrastructure (network, power, hardware)
2. Restart Oracle nodes
3. Nodes recover state from local storage
4. Verify threshold signing works
5. Exit emergency mode
SCENARIO B: Data loss (nodes destroyed)
1. Provision new Oracle infrastructure
2. Restore from backup:
- Configuration from backup
- State from backup
- Key shares from secure backup (if available)
3. If key shares not recoverable:
- Initiate key recovery ceremony
- Recover from trustee-held shares
4. Verify and reintegrate
SCENARIO C: Suspected compromise (all nodes potentially hostile)
1. DO NOT restore from current backups (may be compromised)
2. Activate incident response
3. Rebuild from known-good golden images
4. Key ceremony to generate new keys
5. Re-enroll all agents (may be required)
6. Forensic investigation of compromise
3.4. Flight Recorder Failure
Definition: Flight Recorder becomes unavailable or corrupted.
3.4.1. Detection
- Write failures from components
- Read failures for queries
- Integrity check failures
- Storage exhaustion
3.4.2. Impact
MEDIUM-HIGH: Audit trail may have gaps
- New decisions may not be recorded
- Historical queries fail
- Compliance impact
- Forensic capability degraded
Note: Flight Recorder failure does NOT stop Trust Proof issuance.
Operations can continue, but without audit.
3.4.3. Degradation Behavior
1. LOCAL BUFFERING
- Components buffer audit records locally
- Typical buffer: 1 hour of records
- Buffer overflow: oldest records dropped (with count)
2. SECONDARY FLIGHT RECORDER
- If configured, switch to secondary
- Primary failure logged
3. ALERTING
- Immediate alert on Flight Recorder failure
- Escalating alerts as buffer fills
3.4.4. Recovery
1. Restore Flight Recorder service
2. Flush buffered records from components
3. Verify chain integrity
- Chain will have gap marker if records lost
- Gap is itself recorded and auditable
4. Investigate cause of failure
3.4.5. Chain Gap Handling
If records were lost during failure:
{
"record_type": "chain_gap",
"gap_start": "2025-11-25T10:00:00Z",
"gap_end": "2025-11-25T10:15:00Z",
"estimated_records_lost": 1500,
"cause": "flight_recorder_failure",
"recovery_action": "restored_from_backup",
"signature": "..."
}
Gap records are signed by Oracle and become part of permanent audit
trail.
3.5. Sensor Aggregator Failure
Definition: Sensor aggregator becomes unavailable.
3.5.1. Impact
MEDIUM: Context Tensor updates degraded
- Sensors cannot deliver readings
- Context Tensor uses stale data
- Risk Factor calculation affected
3.5.2. Degradation Behavior
1. SENSOR BUFFERING
- Sensors buffer readings locally
- Typical buffer: 5 minutes
2. STALE DATA MARKING
- Oracle marks Context Tensor dimensions as stale
- Stale dimensions may increase Risk Factor
- Or use conservative defaults
3. FAILOVER
- Sensors switch to backup aggregator (if configured)
3.5.3. Recovery
1. Restore aggregator service
2. Sensors flush buffered readings
3. Context Tensor returns to real-time
4. Trust Scores normalize
3.6. Federation Gateway Failure
Definition: Federation gateway becomes unavailable.
3.6.1. Impact
MEDIUM: Cross-zone trust affected
- Cannot accept foreign Trust Proofs
- Cannot issue cross-zone attestations
- Federation heartbeat fails
- Partner zones see this zone as degraded
3.6.2. Degradation Behavior
1. CACHED FOREIGN PROOFS
- Honor cached foreign proofs until expiration
- No new foreign proofs accepted
2. LOCAL OPERATION
- Zone operates independently
- Local agents unaffected
- Cross-zone agents cannot operate
3. PARTNER NOTIFICATION
- Heartbeat failure notifies partners
- Partners reduce trust factor for this zone
3.6.3. Recovery
1. Restore federation gateway
2. Re-establish federation connections
3. Resume heartbeat
4. Trust factor gradually recovers
3.7. Zone-Wide Failure
Definition: Entire zone becomes unavailable (natural disaster,
massive infrastructure failure, coordinated attack).
3.7.1. Impact
CRITICAL: All zone operations cease
- All agents in zone cannot operate
- All protected resources inaccessible
- Federation partners lose connection
3.7.2. Recovery Options
OPTION A: Restore in place
- Rebuild infrastructure at same location
- Restore from backups
- Resume operations
OPTION B: Failover to DR site
- Activate disaster recovery site
- Restore from replicated data
- Update DNS/routing to DR site
- Resume operations at DR
OPTION C: Migrate to partner zone (temporary)
- Federated partner accepts refugees
- Agents operate with reduced trust (foreign zone penalty)
- Temporary until primary zone restored
3.7.3. DR Site Requirements
For Level 3 deployments, DR site MUST:
- Be geographically separate (>100 miles recommended)
- Have independent power and network
- Maintain synchronized state (RPO per Section 2)
- Have HSMs with key shares
- Be tested quarterly
4. Backup Procedures
4.1. What to Back Up
4.1.1. Critical (MUST back up)
ORACLE SIGNING KEY SHARES
- Threshold key shares for Trust Proof signing
- Backup encrypted to recovery keys
- Stored with trustees (not with Oracle)
- Recovery requires M-of-N trustees
ZONE CONFIGURATION
- Security policies, thresholds, weights
- Soul constraints
- Trust Tier boundaries
- Federation agreements
FLIGHT RECORDER DATA
- All audit records
- Chain hashes
- External anchor references
4.1.2. Important (SHOULD back up)
AGENT REGISTRY
- Registered agents
- Public keys
- Lineage information
- Current E_base (can be recalculated)
TRAJECTORY CHAINS
- Agent transaction history
- Proof of Resilience records
- (Large; may use differential backups)
SENSOR CONFIGURATION
- Sensor registrations
- Aggregator mappings
- Baseline calibrations
4.1.3. Reconstructible (MAY back up)
TRUST SCORES
- Current E_trust values
- Can be recalculated from E_base and Context Tensor
CONTEXT TENSOR
- Current sensor readings
- Will be refreshed by live sensors
CACHED TRUST PROOFS
- Short-lived, will be re-issued
4.2. Backup Frequency
+-------------------------------------------------------------------+
| Data Type | Backup Frequency | Retention |
+-------------------------------------------------------------------+
| Key shares | On change only | Forever |
| Configuration | On change + daily| 1 year |
| Flight Recorder | Continuous/hourly| Per policy (7 years) |
| Agent Registry | Daily | 90 days |
| Trajectory Chains | Daily incremental| 1 year full, 7 incr |
| Sensor Config | Daily | 90 days |
+-------------------------------------------------------------------+
4.3. Backup Security
ENCRYPTION
- All backups encrypted at rest
- Encryption key separate from backed-up keys
- Key shares backed up to different location than data
ACCESS CONTROL
- Backup access requires authentication
- Restore requires multi-person authorization (Level 2+)
- All access logged
INTEGRITY
- Backups include cryptographic checksums
- Verify integrity before restore
- Detect tampering
ISOLATION
- Backups stored separately from production
- Air-gapped for highest security (Level 3)
- Ransomware cannot reach backups
4.4. Backup Verification
VERIFICATION SCHEDULE
- Automated integrity check: Daily
- Restore test to isolated environment: Monthly
- Full DR test: Quarterly (Level 2+), Monthly (Level 3)
VERIFICATION PROCEDURES
1. Verify backup file integrity (checksums)
2. Verify backup completeness (all expected files)
3. Restore to isolated environment
4. Verify restored system functions
5. Verify restored data matches source
6. Document results
5. Restoration Procedures
5.1. Oracle Restoration
5.1.1. Single Node Restoration
Prerequisites:
- Healthy Oracle mesh (other nodes operational)
- Backup of node configuration
- Key share (from backup or ceremony)
Procedure:
1. PROVISION infrastructure
- Deploy new VM/hardware
- Install Oracle software
- Configure network
2. RESTORE configuration
- Apply configuration from backup
- Verify configuration matches zone
3. RESTORE key share
- If HSM available: Import key share
- If HSM destroyed: Recover from trustee backup
4. SYNC state
- Connect to healthy nodes
- Sync current Trust Score state
- Sync recent trajectory updates
5. VERIFY operation
- Run health checks
- Participate in test signing
- Verify state matches other nodes
6. REINTEGRATE
- Add to load balancer
- Enable full traffic
5.1.2. Full Mesh Restoration
Prerequisites:
- New infrastructure for all nodes
- Configuration backups
- Key shares (from trustees or ceremony)
- State backups
Procedure:
1. PROVISION all infrastructure
- Deploy all Oracle nodes
- Configure network between nodes
2. RESTORE configuration to all nodes
- Same configuration on all nodes
3. KEY RECOVERY CEREMONY
- Convene required trustees
- Recover threshold key shares
- Install shares in HSMs
4. RESTORE state
- Restore Agent Registry from backup
- Restore trajectory data from backup
- Restore last known Trust Scores
5. VERIFY threshold signing
- Attempt to sign test message
- Verify k-of-n nodes can sign
6. RESUME operations
- Enable PEP connections
- Issue Trust Proofs
- Monitor closely
5.2. Key Recovery
5.2.1. Key Recovery Prerequisites
Key recovery requires:
- M-of-N trustees (typically 3-of-5 or 5-of-7)
- Trustees have recovery key shares
- Secure environment for ceremony
- New HSMs to receive recovered keys
Trustees should be:
- Geographically distributed
- Organizationally independent
- Personally reliable
- Reachable in emergency
5.2.2. Key Recovery Ceremony
PHASE 1: CONVENE
1. Incident commander declares key recovery needed
2. Contact trustees (secure channel)
3. Schedule ceremony (virtual or in-person)
4. Prepare secure environment
PHASE 2: AUTHENTICATE
1. Verify trustee identity (multi-factor)
2. Verify ceremony authorization
3. Record ceremony (video + transcript)
4. Witnesses present (if required)
PHASE 3: RECOVER
1. Each trustee decrypts their recovery share
2. Shares combined in secure environment
3. Master key material reconstructed
4. New threshold shares generated
5. Shares installed in HSMs
6. Master key material destroyed (never stored)
PHASE 4: VERIFY
1. Test threshold signing
2. Verify all nodes can participate
3. Issue test Trust Proof
4. Verify signature validates
PHASE 5: DOCUMENT
1. Record ceremony completion
2. Update key inventory
3. Notify stakeholders
4. Archive ceremony recording (secure)
5.2.3. Key Recovery Security
- Recovery environment: Air-gapped if possible
- No photography except official recording
- All attendees logged
- Ceremony room swept for bugs (Level 3)
- Shares transmitted encrypted, never in clear
5.3. State Reconstruction
If state backups are unavailable or suspect:
5.3.1. Agent Registry Reconstruction
If backup unavailable:
1. Agents must re-register
2. Identity proofing re-verified
3. Sponsor relationships re-established
4. E_base starts from initial value (not historical)
Impact: Agents lose accumulated trust. Significant operational
disruption. Use backup restoration if at all possible.
5.3.2. Trust Score Reconstruction
Trust Scores can be recalculated if:
- E_base known (from trajectory or backup)
- Context Tensor available (from sensors)
E_trust = E_base x Context_modifier x Risk_factor
If E_base unknown, must use default for lineage type.
5.3.3. Trajectory Reconstruction
Trajectory chains are cryptographically linked. If chain is broken:
1. Recover as much chain as possible from backup
2. Mark gap in chain
3. Continue new records after gap
4. Agent's E_base calculation notes gap
Gaps in trajectory reduce trust (unverifiable history).
5.4. Trajectory Recovery
Trajectory data may be large. Recovery strategies:
FULL RESTORE
- Restore complete trajectory from backup
- Slowest but most complete
- Use for complete zone recovery
POINT-IN-TIME RESTORE
- Restore trajectory to specific point
- Useful if recent data corrupted
- Records after restore point lost
DIFFERENTIAL RESTORE
- Restore base + incremental backups
- Faster than full restore
- Standard approach for routine recovery
6. Graceful Degradation
6.1. Degradation Levels
KTP defines four degradation levels:
LEVEL 0: NORMAL
- All components operational
- Full functionality
- Standard Trust Score calculation
LEVEL 1: DEGRADED
- Some redundancy lost
- Full functionality maintained
- Increased monitoring
- Example: 1 Oracle node down (but quorum intact)
LEVEL 2: IMPAIRED
- Some functionality reduced
- Core security maintained
- Non-critical features disabled
- Example: Sensor aggregator down (stale Context Tensor)
LEVEL 3: EMERGENCY
- Critical functionality only
- Fail-closed for non-essential
- Human intervention required
- Example: Oracle mesh partitioned
LEVEL 4: OFFLINE
- Zone non-operational
- All requests denied
- Recovery in progress
- Example: Total Oracle loss
6.2. Automatic Failover
6.2.1. Oracle Failover
- Automatic within threshold (node failure)
- Automatic leader election if needed
- No manual intervention for k-of-n failures
6.2.2. Flight Recorder Failover
- Automatic switch to secondary (if configured)
- Local buffering during transition
- Alert on primary failure
6.2.3. Sensor Aggregator Failover
- Sensors automatically retry backup aggregator
- Oracle uses stale data with marking
- Automatic when aggregator returns
6.2.4. Federation Failover
- Automatic failover to backup gateway
- Partners notified via heartbeat
- Automatic reconnection on recovery
6.3. Manual Intervention
Some situations require manual intervention:
- Total Oracle loss
- Suspected compromise
- Key recovery
- DR site activation
- Configuration rollback
Manual procedures are documented in Appendix A.
7. Testing and Validation
7.1. Recovery Testing Requirements
+-------------------------------------------------------------------+
| Test Type | Level 1 | Level 2 | Level 3 |
+-------------------------------------------------------------------+
| Backup verification | Monthly | Weekly | Daily |
| Single node recovery | Quarterly | Monthly | Monthly |
| Mesh partition test | Annually | Quarterly | Monthly |
| Full DR test | Annually | Quarterly | Monthly |
| Key recovery drill | Annually | Semi-annual| Quarterly |
+-------------------------------------------------------------------+
7.2. Test Procedures
SINGLE NODE RECOVERY TEST
1. Take one Oracle node offline (planned)
2. Verify mesh continues operating
3. Restore node from backup
4. Verify node rejoins mesh
5. Document time and issues
PARTITION TEST
1. Simulate network partition (firewall rules)
2. Verify PEPs enter degraded mode
3. Verify no Trust Proofs issued during partition
4. Heal partition
5. Verify normal operation resumes
FULL DR TEST
1. Activate DR site
2. Restore all components from backup
3. Verify full functionality
4. Measure RTO and RPO achieved
5. Fail back to primary
8. Security Considerations
8.1. Recovery as Attack Vector
Recovery procedures can be exploited:
- Attacker triggers failure to invoke recovery
- Attacker substitutes malicious backup
- Attacker compromises recovery process
- Attacker uses recovery mode to bypass controls
Mitigations:
- Authenticate all recovery actions
- Verify backup integrity before restore
- Maintain audit during recovery
- Time-limit recovery mode
8.2. Backup Security
Backups are high-value targets:
- Contain all secrets (encrypted, but still)
- May reveal system architecture
- May enable offline attacks
Mitigations:
- Encrypt all backups
- Separate backup encryption keys
- Access control on backups
- Monitor backup access
8.3. Key Recovery Security
Key recovery is highest-risk operation:
- Reconstitutes master key material
- Single point where key exists in full
- Attractive target for advanced attackers
Mitigations:
- Ceremony with witnesses
- Secure environment
- Immediate destruction after use
- M-of-N trustee requirement
9. References
[KTP-CRYPTO]
Perkins, C., "Kinetic Trust Protocol - Cryptographic
Specification", NMCITRA, November 2025.
[KTP-AUDIT]
Perkins, C., "Kinetic Trust Protocol - Audit
Specification", NMCITRA, November 2025.
[NIST-CP] National Institute of Standards and Technology,
"Contingency Planning Guide for Federal Information
Systems", SP 800-34 Rev. 1, May 2010.
Appendix A. Recovery Runbooks
A.1. Runbook: Single Oracle Node Recovery
TRIGGER: Oracle node health check fails for 5 minutes
STEPS:
1. Verify failure (not monitoring false positive)
$ ktp-cli oracle status --node oracle-1
2. Check if quorum maintained
$ ktp-cli oracle mesh-status
Expected: "Mesh operational, X of Y nodes healthy"
3. Attempt restart
$ systemctl restart ktp-oracle
Wait 60 seconds, check status
4. If restart fails, check logs
$ journalctl -u ktp-oracle -n 100
5. If hardware issue, provision new node
$ terraform apply -target=oracle-node-1
6. Restore configuration
$ ktp-backup restore --target oracle-1 --type config
7. Restore key share (requires HSM access)
$ ktp-hsm import-share --node oracle-1
8. Verify and reintegrate
$ ktp-cli oracle join-mesh --node oracle-1
$ ktp-cli oracle verify-signing --node oracle-1
ESCALATION: If not resolved in 30 minutes, page on-call lead
A.2. Runbook: Oracle Mesh Partition
TRIGGER: Trust Proof issuance fails, partition detected
STEPS:
1. Verify partition
$ ktp-cli oracle mesh-status
Expected: "Mesh partitioned, no quorum"
2. Identify partition boundaries
$ ktp-cli oracle connectivity-matrix
3. Check network connectivity
$ for node in oracle-{1..5}; do
ping -c 1 $node && echo "$node reachable"
done
4. If network issue, engage network team
- Check firewall rules
- Check BGP status
- Check physical connectivity
5. If single site isolated, verify other sites operational
$ ktp-cli oracle site-status
6. When partition heals, verify mesh reforms
$ ktp-cli oracle mesh-status
Expected: "Mesh operational"
7. Check for state conflicts
$ ktp-cli oracle state-consistency-check
8. Resume normal operations
$ ktp-cli zone set-degradation-level 0
ESCALATION: Immediate page to incident commander
A.3. Runbook: Total Oracle Loss Recovery
TRIGGER: All Oracle nodes unreachable
STEPS:
1. Declare incident
- Page incident commander
- Activate incident response team
- Begin incident timeline
2. Assess situation
- Infrastructure failure vs. attack
- Data center status
- Network status
3. If infrastructure failure:
a. Restore infrastructure
b. Restore from backup (see A.4)
4. If suspected attack:
a. DO NOT restore from recent backups
b. Engage security team
c. Preserve evidence
d. Rebuild from golden images
e. Key recovery ceremony required
5. Activate DR site (if available)
$ ktp-dr activate --site dr-west
6. Update DNS/routing to DR
$ ktp-dns failover --to dr-west
7. Verify DR operational
$ ktp-cli oracle mesh-status --site dr-west
8. Communicate status to stakeholders
ESCALATION: This IS the escalation
A.4. Runbook: Restore from Backup
TRIGGER: Recovery requires backup restoration
STEPS:
1. Identify backup to restore
$ ktp-backup list --type full
Select most recent known-good backup
2. Verify backup integrity
$ ktp-backup verify --backup-id <id>
MUST pass before proceeding
3. Provision target infrastructure
$ terraform apply
4. Restore configuration
$ ktp-backup restore --backup-id <id> --type config
5. Restore state data
$ ktp-backup restore --backup-id <id> --type state
6. Restore trajectories (may take time)
$ ktp-backup restore --backup-id <id> --type trajectory
7. Key recovery (if needed)
- Contact trustees
- Schedule ceremony
- Execute per Section 5.2.2
8. Verify system operational
$ ktp-cli oracle mesh-status
$ ktp-cli oracle test-sign
9. Gradually restore traffic
$ ktp-cli zone set-degradation-level 1
Monitor for 15 minutes
$ ktp-cli zone set-degradation-level 0
A.5. Runbook: Key Recovery Ceremony
TRIGGER: Key shares unrecoverable from backup
PREPARATION:
1. Incident commander authorizes ceremony
2. Contact M trustees (need M of N)
3. Schedule ceremony time (within RTO)
4. Prepare secure environment:
- Air-gapped machine
- New HSMs
- Recording equipment
- Witness(es)
CEREMONY:
1. Verify all participants' identity
2. Begin recording
3. Read ceremony authorization into record
4. Each trustee:
a. Connects to ceremony machine (isolated)
b. Decrypts their recovery share
c. Inputs share to recovery software
d. Disconnects
5. Recovery software reconstructs master key
6. Generate new threshold shares
7. Install shares in HSMs
8. Test threshold signing
9. Securely destroy master key material
10. End recording
POST-CEREMONY:
1. Verify all Oracle nodes operational
2. Generate new trustee recovery shares
3. Distribute to trustees
4. Archive ceremony recording
5. Update key inventory
Authors' Addresses
Chris Perkins
New Mexico Cyber Intelligence & Threat Response Alliance (NMCITRA)
Email: cperkins@nmcitra.org