Live migration + cross-platform portability · v0.1.0-alpha

Live-migrate running pods.
Move them across platforms entirely.

One checkpoint/restore engine, two goals: live-migrate a running pod between nodes with its TCP connections intact, and move containers between hosting platforms — AWS ECS ↔ EC2/Docker, EKS ↔ EC2 — or restore from a full backup. Live migration has real proof today (a held connection survived byte-exact, seq_delta=0, at Sprint 46); cross-platform portability and backup/restore are architected, not yet built. v0.1.0-alpha, CRD-only.

Get started See the proof Star on GitHub →

kubectl apply

$ kubectl apply -f podmigration.yaml
# kind: PodMigration · tcpPreservationMode: BestEffort (default None)
✓Migration complete · ~13 s (cold-path proof run)

Kubernetes Pods Made Immortal^®

Live migration + cross-platform portability, one engine

seq_delta=0 — first proven at Sprint 46 (single held connection); full multi-flow bar ahead

What is proven, on what substrate, and how far the scope goes. Each row states its own CNI, workload, migration mode, and the exact conditions of the run.

Across proof runs so far, we’ve achieved live migration on Postgres, Redis, and Nginx workloads in some cases — not yet consistently across all cases — and we’re working toward a long-term goal of eight-nines (99.999999%) durability for live transfers. See the matrix below for exactly what’s proven where.

A live TCP connection survived a migration byte-exact — 1/1 byte-exact, seq_delta 0/0 — first proven Sprint 46

seq_delta=0 (held TCP): first proven Sprint 46 — single connection · full multi-flow bar still ahead0 RST

PodMotion proof matrix: substrate/CNI, workload, migration mode, and status
Substrate / CNI	Workload	Mode	Status & evidence
Proven — current multi-node cluster
Multi-node cluster, worker node A → worker node B.	Redis — single held TCP connection under load	Strict	Proven byte-exact — Sprint 46.1/1 connections verified byte-exact (seq_delta 0/0), freeze 602ms. A single documented run, not a statistical corpus — the broader multi-flow bar is still ahead. Source pod never deleted before TrafficVerified. See docs/proof/sprint46-proof.md.
Same multi-node cluster, worker node A → worker node B.	Postgres — continuous write load, ReadWriteMany re-attach	None	Proven — Sprint 46.Shared ReadWriteMany volume re-attached on the destination; row counter advanced 8 → 16 with 0 write errors. Exercises ReadWriteMany re-attach only — does not exercise or imply live block-storage migration, which remains deferred to v1.2. See docs/proof/sprint46-proof.md.
AWS EC2, kubeadm cluster (3 nodes), Longhorn CSI on all nodes.	Postgres on a Longhorn-backed PVC — a written value read back post-migration	None	Proven in a live run — AWS EC2 substrate.Migration reached phase=Complete; a value written before the move was read back unchanged after. A single verified in-run result, not a committed proof doc — implies no GA availability, and doesn't extend to the Ceph RBD or snapshot-rsync tiers, which remain unexercised stubs. See docs/proof/milestone-aws-ec2-longhorn-postgres-migration-proof.md.
In supported scope — not yet proven
Cilium	Live migration	—	In supported CNI scope per ADR-0318 (Accepted 2026-07-10, superseding ADR-0082 §2 and ADR-0313) — but cutover-gated, not yet reachable: the live-migration reachability proof is gated behind the storage-cutover work. No live-migration proof yet.
Calico (Tigera Operator, alternative CNI candidate in the substrate)	—	—	Planned, not yet proven. Sprint 48 holds a compatibility plan only; no passing live-migration proof.
Historical single-host kind baselines
Single-host two-node kind cluster (worker → control-plane) — kindnet CNI (host-routed, no VXLAN overlay), amd64 / Ubuntu 24.04 / kernel 6.17	Redis	Cold (None)	Sprint 31, NP-AMD64-BASELINE · 2026-06-15. amd64 cold migration proven (tcpPreservationMode=None). 22/22 gates, freeze=199ms, source pod deleted post-TrafficVerified.
Single-host two-node kind cluster — kindnet CNI (host-routed, no VXLAN overlay), amd64 / kernel 6.17 / containerd 1.7.24 / Kubernetes v1.32.0	redis:7-alpine	Cold	Sprint 33 · 2026-06-16.End-to-end cold-path live migration, n=1, ~13s wall-clock, 22/22 gates. Phase=Complete, TrafficVerified=True reached before the source pod was deleted. Cold-path scope only — zero-TCP-sequence-delta (seq_delta) is NOT claimed here.

Apache-2.0CRIU 4.2KEP-2008 alignedCNCF-track*PodMotion is not a CNCF project. ‘CNCF-track’ describes our intended trajectory only — we are targeting CNCF sandbox submission, gated on ≥2 maintainers from ≥2 organizations. See GOVERNANCE.md.

* Not yet a CNCF project — CNCF sandbox submission planned (target Q3 2026 · M29).

Pre-1.0. Single-cluster. arm64 validated (earlier PoC, Sprint 30). amd64 cold migration proven (Sprint 31, NP-AMD64-BASELINE; Sprint 32, cold-path CRIU restore baseline). Strict-mode TCP preservation first proven at Sprint 46 — a single held connection survived migration byte-exact (seq_delta 0/0) on a real multi-node cluster (no CNI or architecture recorded in the committed proof); broader multi-flow and per-CNI validation is still ahead.

Security note: installing the node-agent grants node-root-equivalent privilege (privileged container + read-write containerd socket + SYS_ADMIN/SYS_PTRACE in the host PID namespace). This is a disclosed alpha trade-off required for CRIU checkpoint/restore — full posture and the ADR-0148 hardening plan in the security policy.

Bounded freeze, never a restart

PodMotion uses CRIU to checkpoint a running pod and restore it on another node, completing end-to-end in about 13 seconds in our cold-path proof run. The source pod is never deleted until the destination is verified serving (ADR-0012) — if anything fails, the migration rolls back and your pod keeps running where it was. This is real live migration with a verified traffic gate, not a stop-and-restart dressed up as zero downtime.
Proof substrate: amd64 / Linux 6.17 / containerd 1.7.24 / Kubernetes v1.32.0 / kindnet (host-routed, no VXLAN overlay).

How it works →

In-memory state restored intact across nodes

No application changes required. PodMotion moves the running pod and restores it on the destination node with in-memory state carried across the node boundary — the Sprint 34 regression run restored an in-memory counter (gcb7:counter=42) intact on a kind cluster (Kubernetes v1.32.0). PodMotion does not depend on preserving the original pod IP: the destination CNI may assign a different primary address, and the pod stays reachable at whatever address the CNI assigns.
Live TCP-connection survival with seq_delta=0 — containers and clients both unaware the pod moved — was first proven at Sprint 46, where a single held connection survived a migration byte-exact (seq_delta 0/0, Strict mode) on a real multi-node cluster. The full guarantee — many concurrent flows, multi-container, on declared CNIs — is not yet proven.
Proven end-to-end on a single declared substrate: amd64 / Linux 6.17 / kindnet (host-routed). Flannel VXLAN and cloud CNIs are on the roadmap, not proof.
TCP preservation is opt-in via tcpPreservationMode (default: None; BestEffort or Strict opt-in). Strict mode was first proven at Sprint 46 for a single held connection (seq_delta 0/0) on a real multi-node cluster; broader multi-flow validation is not yet proven. Source pod is never deleted before TrafficVerified (ADR-0012).

See the proof →

GitOps-safe

PodMotion never mutates your Deployment, StatefulSet, or ReplicaSet spec. It works below the controller layer using an admission webhook, a scheduler plugin, and an NRI restore hook — so Argo CD and Flux see no drift. In the narrow case where a GitOps controller would race the migration, PodMotion sets an annotation that pauses reconciliation for the freeze window only, then removes it — no manual step, no spec change (ADR-0046). Declare the migration in Git as a PodMigrationresource — Argo CD or Flux apply it like any other manifest.

Architecture overview →

Roadmap — Not Available in v0.1.0-alpha

Volumes follow the pod

Storage migration (persistent volumes / PVCs) is deferred to v1.2 and is NOT available in v0.1.0-alpha. Pods with PVCs can still migrate if the volume is RWX and re-attachable on the destination node — the admission webhook blocks configurations that would not transfer safely. A four-tier strategy (Longhorn dual-engine, Ceph RBD rbd migrate, snapshot-clone-rsync, and RWX re-attach) is designed and scoped for v1.2. One tier has moved past design in development: the Longhorn-backed path reached a verified in-run migration on our KVM cluster — a Postgres pod on a Longhorn PVC migrated to phase=Complete / TrafficVerified=True, with the destination independently confirmed accepting real writes at multiple checks roughly 45 seconds apart after restore. That is a single verified dev-cluster run, not a committed proof doc — it implies no GA or mainline availability, and it does not carry to the Ceph RBD or snapshot-rsync tiers, which remain unexercised stubs. Storage migration stays deferred to v1.2 and is not part of the shipped alpha. NOT available in current alpha — deferred to v1.2. Longhorn tier verified in a development run only.

Storage tiers →

Move containers across platforms

PodMotion's checkpoint/restore engine is built toward two co-equal goals: live migration between Kubernetes nodes, and moving containers between different hosting platforms and back — for example AWS ECS and EC2/Docker, or EKS and EC2. Live migration has real proof today; cross-platform portability is architected but not yet implemented in v0.1.0-alpha.

See the roadmap →

Full pod backup, ready for restore

The same checkpoint/restore engine also targets full pod backup — a checkpoint you can hold onto and restore later, not just move immediately to another node. This is not yet available in v0.1.0-alpha.

See the roadmap →

Platform / SRE

What is actually proven today?

On a real 3-node KVM cluster running Cilium (v1.17.6, cluster-pool IPAM): a live-held Redis TCP connection survived a node-to-node migration byte-exact (seq_delta 0/0, Strict mode — one held connection; multi-flow not yet proven), and, in a separate workload, a Postgres pod under continuous write load re-attached to its shared ReadWriteMany volume on the destination node and migrated cleanly in None mode (row counter advancing 8 to 16, zero post-migration write errors). Earlier, on a two-node kind cluster (Kubernetes v1.32.0), a full end-to-end migration completed in about 13 seconds with in-memory state intact and 22/22 assertions passing.

See the proof →

App developer

Will my app break during a migration?

You change nothing in your app. PodMotion does not rely on keeping the original pod IP — the destination CNI may assign a different primary address, and the pod stays reachable at whatever address it assigns. Whether a live connection survives the move is a separate question: connection survival was first proven at Sprint 46 for a single held connection (seq_delta 0/0, Strict mode) on a real multi-node Cilium cluster, with the multi-flow guarantee not yet proven.

How it works →

DevOps / GitOps

Is it GitOps-compatible?

Yes. A migration is a standard Kubernetes resource — apply it from Git, Argo CD, or Flux.

Install with Helm →

Engineering leadership

Is this production-ready?

Not yet. Pre-1.0 (v0.1.0-alpha), single-cluster, end-to-end proven on amd64 / Linux 6.17 / kindnet — with a public, milestone-by-milestone path to v1.0.

See the roadmap →

12-Phase Migration State Machine

How a migration works

Every PodMotion migration runs a 12-phase migration state machine. Each phase writes a Kubernetes status condition before it acts — crash-safe and auditable. The 12 steps below trace the migration pipeline; TCP-preservation phases run when tcpPreservationMode is opted in (default None; BestEffort or Strict). For exactly what is proven, on which substrate, in which mode, and how far the scope goes, see the compatibility matrix at the top of the page.

In plain terms: PodMotion picks up a running app from one machine and sets it down on another, restored on the destination node — no stop-and-restart. Keeping the same pod IP across that move depends on your CNI and cloud network, so it is not something every provider can support.

Pre-freeze (steps 1–5)FinalFreezeStateCapture (step 6)Restoration & verify (steps 7–12)

SocketInventory

Validating

DestinationPrewarm

PreCopyMemory

ZeroWindowArm

FinalFreezeStateCapture

OverlayHandoff

Restore + SocketReattach

DisengageHold

TCPVerifying

ServiceVerifying

CutoverComplete → Complete

Click any phase for details.

Safety invariant (ADR-0012)

The source pod is never deleted until the destination is verified serving (ADR-0012).

Failure mode

If any phase before Cutover fails, PodMotion rolls back — the destination is discarded and your pod keeps running on its original node.

Proof Results

The proof: a running, stateful pod moved across nodes and restored on the destination node

The full proof record — what was proven, on which substrate, in which migration mode, and how far the scope goes — lives in the compatibility matrix at the top of the page. Each row states its own CNI, workload, mode, and the exact conditions of the run, with a link to the committed proof doc.

Broader coverage — many concurrent flows, multi-container, N-flow Service-routed transparency, and additional CNIs — remains on the roadmap, not yet proven. We publish limits, not just wins.

Economics

The restart penalty is a hidden tax
on every Kubernetes cluster you run.

Kubernetes clusters average 8–13% CPU utilization. The other 87–92% is reserved headroom that exists because moving a running application has always required restarting it first — and that restart is the problem. PodMotion removes it.

Server capacity actually in use

8–13%

For every $100 you spend on cloud compute, roughly $87–$92 sits in reserve — paid for, idle, waiting for a peak that rarely arrives.

CAST AI, 4,000-cluster study (2025–2026)

Organizations where overprovisioning is the #1 cost driver

70%

The root cause is structural, not a configuration problem. You cannot consolidate servers if moving an app requires restarting it.

CNCF FinOps Microsurvey, Dec 2023

Average cost of a major infrastructure incident

$794K

Typical organization: 25 high-priority incidents per year. Industry research attributes 79% of production incidents to recent system changes (Komodor 2025) — including routine maintenance events. PodMotion's thesis: fewer restart-driven change events means fewer incident triggers. This figure is the single input used in the incident-cost row of the savings model below — it appears here as a shared reference metric and in that row as a per-incident unit cost; it is not two independent data points.

PagerDuty State of Digital Operations 2024, n=500 IT leaders

Cloud discount available for interruptible capacity

60–90%

Cloud providers sell spare capacity at steep discounts. Most enterprises cannot use it for anything important, because the provider can reclaim it on as little as 30 seconds’ notice — which means a restart.

AWS / GCP / Azure published spot pricing, June 2026 (varies by instance type and region)

This problem was already solved once — for the previous generation of infrastructure

In the early 2000s, physical data centers ran at 5–15% utilization for the same reason Kubernetes clusters do today: moving a running workload meant shutting it down first. Once live server migration arrived, teams went from one workload per server to ten or fifteen, maintenance windows shrank from weekends to minutes, and infrastructure costs fell — without touching application code. Container infrastructure is at the same inflection point, and PodMotion applies the same capability class to the current generation. The three cost mechanisms in the savings model below — compute, vendor licensing, and maintenance — all follow from a single change: applications no longer have to restart to move.

Projected annual savings — planning model

Modeled projection — not a PodMotion-measured outcome

Mid-size:100–200 servers, standard cloud on-demand pricing, 3-person infrastructure team, four maintenance cycles per year. Enterprise: 500–2,000 servers, mixed workload, commercial database licensing applicable, 8+ person infrastructure team. All figures are modeled projections from publicly available industry research (CAST AI 2025–2026, PagerDuty 2024, Komodor 2025, AWS/GCP/Azure published pricing June 2026, Fairwinds 2026 upgrade ops data). Not measured outcomes from PodMotion deployments.

Savings category	Mid-size / yr	Enterprise / yr
Cloud compute — server consolidation	$52K–$201K	$525K–$2.1M
Cloud compute — discounted spare capacity	$84K–$168K	$420K–$840K
Monitoring and observability vendor fees	$38K–$116K	$190K–$580K
Database vendor licensing (where applicable)	$0–$340K	$1.1M–$1.8M
Engineering labor — maintenance windows	$54K–$135K	$270K–$540K
Incident cost reduction	$159K–$397K	$794K–$1.98M
Projected total (non-cumulative categories)	$387K–$1.36M	$3.3M–$7.8M

How to read the total: Database licensing is excluded from the totals above — it applies only to organizations with per-processor commercial contracts, and overlaps with the compute row when the same nodes are decommissioned. Spot/preemptible compute and on-demand compute are alternative scenarios, not additive. Add them separately based on your actual workload split. Review which rows apply to your environment before summing.

Disclaimer. All financial figures on this page are modeled projections for planning purposes only, derived from publicly available industry research (CAST AI 2025–2026, CNCF FinOps Microsurvey Dec 2023, PagerDuty State of Digital Operations 2024, Komodor 2025 Enterprise Kubernetes Report, and published AWS/Datadog/Dynatrace/Sysdig/Oracle pricing as of June 2026). Actual savings depend on workload profile, cluster topology, instance mix, existing licensing agreements, and operational context.

Alpha software — single declared substrate. PodMotion v0.1.0-alpha's current end-to-end proof runs on a standing 3-node KVM/kubeadm cluster: amd64, Kubernetes v1.35.x (containerd), Cilium CNI v1.17.6 (geneve tunnel), CRIU 4.2. That proof covers a single held TCP connection; broad multi-flow, multi-container coverage is not yet proven. Additional CNIs and cloud environments are on the roadmap. No production PodMotion deployments exist; no measured production savings data exists. Storage migration (PVCs) is deferred to v1.2.

Get started

Install the operator, run your first migration, verify it. Three steps.

New to this? PodMotion moves a running container from one machine to another while keeping every open network connection alive — without restarting it or dropping in-flight requests. The steps below use Kubernetes terms, but that one idea is the whole point.

Prerequisites: Requires CRIU 4.2 and Linux kernel 5.15+ (migration runs validated on kernel 6.17). Validated on Kubernetes v1.32.0 and v1.35.x. Use amd64 nodes: amd64 is the primary validated substrate, while arm64 is the earlier baseline whose production validation is deferred to post-v1.0. Single-cluster only in v0.x.

Kernel prerequisites: Live process migration requires direct kernel access. The per-node agent DaemonSet uses hostPID, SYS_PTRACE, and CHECKPOINT_RESTORE to drive CRIU, plus NET_ADMIN for the eBPF TCP relay. Store sensitive configuration in Kubernetes Secrets — not environment variables — so checkpoint images contain no plaintext credentials. See SECURITY.md for hardening guidance.

Warning: Default: gRPC port :9090 uses plaintext unless TLS certificates are configured. See Security Architecture to enable mTLS before deploying to non-isolated environments.

Install PodMotion

Install from local chart (not yet published to a public repository).

helm install podmotion ./charts/podmotion -n podmotion-system --create-namespace

Note: install.yaml is not published for v0.1.0-alpha. Clone the repo and install from the local chart instead: ./charts/podmotion

kubectl apply -f https://github.com/podmotion-io/podmotion/releases/latest/download/install.yaml

Installs the operator, the per-node agent DaemonSet, five CRDs, and the admission webhooks.

Migrate a pod

kubectl apply -f - <<EOF
apiVersion: migration.podmotion.io/v1alpha1
kind: PodMigration
metadata:
  name: migrate-my-pod
  namespace: default
spec:
  podName: my-pod
  podNamespace: default
  targetNodeName: worker-2      # optional: pin destination node
  tcpPreservationMode: None
EOF

Applies a PodMigration CR — the CRD-only interface in v0.1.0-alpha. The operator picks it up and drives the 22-value MigrationPhase state machine to Complete. Set tcpPreservationMode: Strict to preserve long-lived TCP sessions across the migration (opt-in).

Evaluate safely first: set spec.dryRun: true in your PodMigration CR to run a DryRunEstimate — PodMotion will report the estimated freeze window without moving the pod.

Verify it worked

kubectl get podmigration migrate-my-pod
kubectl describe podmigration migrate-my-pod

Confirm that the pod now runs on the destination node. seq_delta=0 (zero TCP sequence delta) with connection parity was first proven at Sprint 46 for a single held connection on a real multi-node cluster; the full multi-flow guarantee is not yet proven. The describe output shows the status.conditions relevant to the migration mode. In Strict mode the aggregate TCP result is TCPVerified (True only when its sub-conditions TCPSequenceContinuityVerified and ConnectionCountParity are both True); application reachability is reported by ServiceVerified. In the default None mode, TCP/socket conditions are skipped entirely. Run kubectl describe podmigration <name> to see the exact set for your migration.

Read the docsFull reference documentation→The CRD interfacePodMigration CR examples and status patterns→Understand the architecture Operator, agents, CRDs, and the 22-value MigrationPhase state machine→

The CRD interface

Every migration is a PodMigration custom resource. Start with dryRun: true for a freeze-window estimate, then apply without it to run the migration and describe the resource to see the completed phase and status conditions. seq_delta=0 (zero TCP sequence delta) was first proven at Sprint 46 for a single held connection on a real multi-node cluster; the full multi-flow guarantee is not yet proven. No new tooling required.

apiVersion: migration.podmotion.io/v1alpha1
kind: PodMigration
metadata:
  name: estimate-api-server-001
  namespace: production
spec:
  podName: api-server-7d9f8b-xkq2p
  podNamespace: production
  targetNodeName: worker-node-3
  tcpPreservationMode: None   # default; opt-in to Strict for long-lived TCP sessions
  dryRun: true

Start with dryRun: true to get a freeze-window estimate without moving the pod. The controller runs the estimation pipeline — dirty page sampling, storage tier detection, freeze budget projection — and writes the result to status.dryRunEstimate. The source pod is never touched. Use this as the safe evaluator entry point before committing to a live migration.

$ kubectl apply -f estimate-api-server-001.yaml
podmigration.migration.podmotion.io/estimate-api-server-001 created

$ kubectl get podmigration estimate-api-server-001 -w
NAME                        PHASE            AGE
estimate-api-server-001     Pending          1s
estimate-api-server-001     DryRunComplete   4s

$ kubectl get podmigration estimate-api-server-001 -o jsonpath='{.status.dryRunEstimate}'
{
  "estimatedFreezeDurationMs": 312,
  "dirtyPageRateMBps": "18.4",
  "storageTier": 0,
  "migratable": true
}

apiVersion: migration.podmotion.io/v1alpha1
kind: PodMigration
metadata:
  name: migrate-api-server-001
  namespace: production
spec:
  podName: api-server-7d9f8b-xkq2p
  podNamespace: production
  targetNodeName: worker-node-3
  tcpPreservationMode: None   # default; use Strict for long-lived TCP sessions

Trigger a live migration by applying a PodMigration custom resource. Set tcpPreservationMode: Strict only for workloads with long-lived TCP sessions that must survive the cutover — this enables hard rollback if TCPSequenceContinuityVerified=False (ADR-0019 Amendment A). The default (None) is appropriate for most stateless and short-connection workloads. The source pod is never deleted before TrafficVerified (ADR-0012).

$ kubectl apply -f migrate-api-server-001.yaml
podmigration.migration.podmotion.io/migrate-api-server-001 created

kubectl describe podmigration migrate-api-server-001

Inspect the full PodMigration status block, phase conditions, freeze window duration, and seq_delta proof. All migration observability lives in the CR — no separate tooling required.

$ kubectl describe podmigration migrate-api-server-001
Name:         migrate-api-server-001
Namespace:    production
Labels:       <none>
API Version:  migration.podmotion.io/v1alpha1
Kind:         PodMigration

Spec:
  Pod Name:                api-server-7d9f8b-xkq2p
  Pod Namespace:           production
  Target Node Name:        worker-node-3
  TCP Preservation Mode:   None

Status:
  Phase:                   Complete
  Source Node Name:        worker-node-1
  Destination Node Name:   worker-node-3
  Pre Copy Stats:
    Rounds Completed:      3
    Freeze Duration Ms:    298

  Conditions:
    Type                             Status   Reason
    ----                             ------   ------
    SocketInventoryComplete          True     PhaseComplete
    OverlayReady                     True     OverlayHandoffComplete
    RelayArmed                       True     RelayProgramsActive
    TCPSequenceContinuityVerified    True     SeqDeltaZero
    SocketsLive                      True
    ProcessResumedAfterSocketsLive   True
    TrafficVerified                  True     PhaseComplete

The PodMigration CR is the complete control plane — compatible with kubectl, Argo CD, Flux, and any Kubernetes-native pipeline.

A kubectl plugin is planned for a future release. In v0.1.0-alpha, all migrations are managed via the PodMigration custom resource.

Community

Join the project

Apache-2.0, vendor-neutral, CNCF-trackPodMotion is not a CNCF project. ‘CNCF-track’ describes our intended trajectory only — we are targeting CNCF sandbox submission, gated on ≥2 maintainers from ≥2 organizations. See GOVERNANCE.md.. Built in the open.

Current status:v0.1.0-alphasingle-clusteramd64 validated (arm64 post-v1.0)one maintainerpre-CNCF-sandbox

Path to v1.0 readiness~45% · M22 of M30

See the full milestones for gate conditions and the validation footprint.

GitHub

Browse source, open issues, and follow the project. All development happens in the open under Apache-2.0.

Star on GitHub

Contributing

Read the contributor guide to set up a local cluster, run the integration harness, and submit a pull request.

Read CONTRIBUTING

Security Disclosure

Report security vulnerabilities to security@podmotion.io. We acknowledge within 72 hours. See our Security Architecture page and SECURITY.md for scope and coordinated disclosure policy. Note: inter-component gRPC on :9090 defaults to plaintext — enable mTLS before production use.

SECURITY.md

Discussions

Ask questions, share use cases, and propose ideas in GitHub Discussions. This is the primary support channel for the project. A real-time channel (Discord or Slack) is planned but not yet available.

Open Discussions

Code of Conduct

PodMotion adopts the Contributor Covenant. All community spaces — GitHub Issues, Discussions, PRs, and events — are covered. Violations may be reported to the maintainer.

Read CODE_OF_CONDUCT.md

Governance commitment

We will not file for CNCF sandbox until there are ≥2 maintainers from ≥2 organizations. PodMotion is a vendor-neutral project with a documented maintainer-diversity plan.

Read GOVERNANCE.md

Support is community-based via GitHub Discussions and Issues; commercial support is not currently available.

Explicitly out of scope

Windows nodes— out of scopeGPU workloads— deferredMulti-cluster migration— long-term roadmap

* Not yet a CNCF project — CNCF sandbox submission planned (target Q3 2026 · M-GOV-3), gated on ≥2 maintainers from ≥2 organizations.

Open Source · Apache-2.0

About the project

PodMotion was born from a gap in the Kubernetes ecosystem. When a pod crashes or gets evicted, it takes its state with it — its memory, its open files, and every TCP connection its clients were holding. The scheduler starts a new replica somewhere else, and those connections get a RST. Every retry, every reconnect, every client-side timeout is a consequence of that assumption being baked into the platform.

This project started as a proof of concept: could CRIU actually move a running, stateful process across nodes and restore it functional? Preserving the original pod IP was a deliberate non-goal — not every cloud provider or CNI can hand one node the IP block that belongs to another. The answer — proven end-to-end on our two-node kind (Kubernetes-in-Docker, a local test-cluster tool) proof substrate (stock upstream Kubernetes v1.32) in about 13 seconds, with an in-memory counter restored intact across nodes and the source pod never deleted until traffic is verified (ADR-0012) — is yes. Live TCP-connection survival — seq_delta=0, no RST, no reconnect — was first proven at Sprint 46, where a single held connection survived a migration byte-exact (seq_delta 0/0, Strict mode) on a real multi-node cluster; the full guarantee across many concurrent flows and declared CNIs is not yet proven. No application changes required.

What started as a proof of concept has grown into a real system in progress: a 12-phase migration state machine, five CRDs, a scheduler plugin, a CNI plugin, and a growing open-source project. PodMotion's checkpoint/restore engine is built toward two co-equal goals: live migration between nodes, proven first, and moving containers between different hosting platforms and back (for example AWS ECS and EC2/Docker, or EKS and EC2) plus full pod backup that's ready for restore. The migration goal has real proof today; the portability and backup goals are architected but not yet implemented in v0.1.0-alpha. Built in the open. Apache-2.0. Founded and built by a team with deep roots in Kubernetes, federal systems, and enterprise transformation.

Chad N. Ingle

Founder & CTO

Principal DevOps Architect, Effectual

28+ years in platform engineering, cloud architecture, and enterprise transformation. Principal DevOps Architect at Effectual, co-founder of Blue Butterfly AI, founder of OAN Ministries. Builds net-new products focused on rebuilding security from the ground up — the hands-on, people-first builder behind PodMotion.

Salle Ingle

Founder

Chief AI Strategist

27+ years in technology leadership, including Principal Architect at McKinsey & Company leading generative-AI workflow teams for Fortune 500s. Expert in enterprise multi-cloud and on-premise architecture across data centers, AWS, Azure, and GCP. Founder of Blue Butterfly AI; drives PodMotion's cloud-certification strategy and enterprise go-to-market.

Jonathan Baier

Co-Founder & Chief AI Scientist

Active Contributor & Agentic AI Lead

20+ years leading AI, cloud, and engineering transformations — Moody's, ADP, and his firm Cyberify Services. Published Kubernetes/GenAI author with four US patents; PodMotion's agentic AI lead, working with Chad since 2012.

David King

Co-Founder & CISO

NAVWAR Branch Lead, US Space Force

Decades of DoD high-security systems experience as NAVWAR Branch Lead, US Space Force, on the next-generation GPS Command & Control system (OCX). At PodMotion he's the compliance and security skeptic DoD and high-assurance customers expect.

Johnathan Knotts

PodMotion Lead Architect

AWS DevOps Engineer, ispace, inc.

AWS DevOps Engineer with a background in RAN/O-RAN network integrations at DISH Network and systems engineering at wi-fiber. BS in Electronic Systems and Engineering Technology, Texas A&M University. Brings hands-on network engineering and customer-facing technical delivery experience to PodMotion's architecture.

Alexander Einbinder

Virtualization Lead Architect

Senior Ground Operations Engineer, ispace, inc.

Two decades of IT infrastructure and mission-critical operations experience, currently steering ground control operations at ispace, inc. Deep proficiency in Linux-based platforms, vSphere/vSAN virtualization, and observability tooling (Grafana, InfluxDB, Yamcs). Brings redundant-systems and resilient-infrastructure design experience to PodMotion's architecture.

Engineering Transparency

Known Limitations (v0.1.0-alpha)

These are the honest boundaries of what v0.1.0-alpha does and does not do. Naming them precisely is engineering discipline — CNCF TAG reviewers treat explicit scope boundaries as a credibility signal, not a weakness.

Single declared substrate
The authoritative current end-to-end proof ran on amd64 / Linux 6.17 / containerd 1.7.24 / Kubernetes v1.32.0 / kindnet (host-routed, no VXLAN overlay) — kind profile podmotion-s33-cold (Sprint 33 + 34). An earlier arm64/k3s PoC exists but is historical, not the current proof. Live TCP preservation (Strict mode) was first proven separately at Sprint 46 for a single held connection (seq_delta 0/0) on a real multi-node cluster; broader multi-flow, multi-CNI validation is not yet proven.
Single-container pod only
Multi-container consistent checkpoint (M22) is in progress. v0.1.0-alpha supports pods with exactly one container; multi-container pods are out of scope until M22 lands.
Single TCP flow only
conntrack/NAT migration (M23) is not yet built. The eBPF relay preserves one active TCP flow per migration. Multi-flow support is planned for a future milestone.
CPU-feature portability
CRIU checkpoints are not CPU-feature portable across microarchitecture generations. A checkpoint taken on a node with AVX-512 (Skylake) will SIGILL on a node without it (Haswell). Fix is targeted for Sprint 33+ (ADR-0153).
No production deployments
v0.1.0-alpha has not been run in a production environment. Most proof runs are on kind clusters; the Sprint 46 live-TCP proof ran on a real multi-node cluster. Production readiness is a future milestone, not a current claim.
kindnet substrate only
The current end-to-end proof ran on kindnet (host-routed, no VXLAN overlay). Flannel VXLAN and other CNIs (Calico, Cilium, Weave) are on the roadmap, not proof. No authoritative end-to-end proof file exists for any CNI other than kindnet.
Storage migration deferred
Persistent volume migration is deferred to v1.2. v0.1.0-alpha does not migrate PVCs or storage state; workloads relying on local storage are out of scope. One tier has been exercised past design in development — a Longhorn-backed Postgres PVC reached a verified in-run migration (phase=Complete / TrafficVerified=True) on the KVM cluster — but that is a single dev-cluster run, not a committed proof and not part of the shipped alpha; the Ceph RBD and snapshot-rsync tiers remain unexercised stubs.
Alpha security posture
The node-agent DaemonSet runs privileged: true with hostPID: true, a read-write mount of the containerd Unix socket, and no seccomp profile. Required capabilities include SYS_PTRACE, SYS_ADMIN, NET_ADMIN, DAC_OVERRIDE, CHOWN, SETUID, SETGID, and CHECKPOINT_RESTORE. This is a node-root surface area necessary for live CRIU checkpoint-restore and eBPF relay in the current implementation.
Checkpoints are written in plaintext to tmpfs at /run/podmotion/checkpoints/<ref>/ (mode 0700, wiped on node reboot). AES-256-GCM checkpoint encryption is not implemented in v0.1.0-alpha. ADR-0029 (Accepted but never wired in — dead code) is superseded by ADR-0160 (Draft), which defines a CheckpointCrypto provider interface (none/aes-gcm/aws-kms/gcp-kms/azure-kms/vault). Checkpoints are plaintext today.
mTLS between operator and node-agent is opt-in via Helm values. In v0.1.0-alpha the implementation fails open: enabling mTLS does not guarantee channel enforcement. The channel is plaintext by default.
Structured audit logging is not implemented in v0.1.0-alpha. The audit/ directory is a stub. PhaseTransition events are emitted; audit records are not.
ADR-0148 (Accepted, unimplemented) defines the post-1.0 hardening roadmap: capability drop to the minimum CRIU/eBPF set (drop ALL; add SYS_PTRACE, SYS_ADMIN, NET_ADMIN, BPF, PERFMON), read-only containerd socket steady-state, hostPID:false via NRI-supplied PID, a trace-derived deny-by-default seccomp profile, and readOnlyRootFilesystem. gRPC mTLS enforcement is a separate workstream (ADR-0159/ADR-0161-C4), not ADR-0148. The controller-manager is already hardened: runAsNonRoot, seccomp: RuntimeDefault, readOnlyRootFilesystem, drop: ALL.

Live-migrate running pods. Move them across platforms entirely.

Bounded freeze, never a restart

In-memory state restored intact across nodes

GitOps-safe

Volumes follow the pod

Move containers across platforms

Full pod backup, ready for restore

How a migration works

The proof: a running, stateful pod moved across nodes and restored on the destination node

The restart penalty is a hidden taxon every Kubernetes cluster you run.

This problem was already solved once — for the previous generation of infrastructure

Projected annual savings — planning model

Get started

Install PodMotion

Migrate a pod

Verify it worked

The CRD interface

Join the project

GitHub

Contributing

Security Disclosure

Discussions

Code of Conduct

About the project

Known Limitations (v0.1.0-alpha)

Live-migrate running pods.
Move them across platforms entirely.

The restart penalty is a hidden tax
on every Kubernetes cluster you run.