Immich Postgres recovery (checkpoint corruption)¶
What happened¶
Postgres in the immich namespace was failing to start with:
invalid xl_info in checkpoint recordPANIC: could not locate a valid checkpoint recordstartup process was terminated by signal 6: Aborted
This indicates WAL/checkpoint corruption: the database was not shut down cleanly (e.g. pod killed during write, or the volume was attached to more than one node — RWO Multi-Attach scenario).
Volume health (Longhorn)¶
- immich-postgres-pvc should show
robustness: healthy. If it were degraded, I/O errors could cause or worsen corruption. Check with:
kubectl get volumes.longhorn.io -n longhorn-system | grep <postgres-pv-name> - The immich-library volume (200Gi) is separate; if it is degraded, fix it in the Longhorn UI or by waiting for/triggering replica rebuild, but that does not block Postgres recovery.
Recovery options¶
Option A: Start fresh (delete DB)¶
If you don't need the existing data:
- Scale Postgres to 0 and delete the PVC:
kubectl scale deploy immich-postgres -n immich --replicas=0
kubectl delete pvc immich-postgres-pvc -n immich
(Ensure no other pods use the PVC, e.g. delete any log-reader job.) - Recreate the PVC (Flux will do it on reconcile, or apply
apps/base/immich/postgres.yaml). - Scale Postgres back to 1. It will init a new empty DB.
- Create extensions (run from a node that can reach the cluster, e.g. plumbus):
kubectl exec -n immich deploy/immich-postgres -c postgres -- psql -U immich -d immich -c "CREATE EXTENSION IF NOT EXISTS vectors; CREATE EXTENSION IF NOT EXISTS cube; CREATE EXTENSION IF NOT EXISTS earthdistance;"
Immich server will run migrations on first connect. You'll need to set up Immich again (admin user, etc.).
Option B: Restore from backup (preferred if you have data)¶
If you have a backup of the Immich Postgres data:
-
Kopia (or similar)
Restore theimmich-postgres-pvcvolume (or thepgdatadirectory) from a snapshot taken when the DB was healthy. Then scale postgres back to 1 and start Immich. -
Immich export
If you previously used Immich’s backup/export feature, restore that after bringing Postgres up with a new PVC (see Option B for bringing Postgres up with a fresh data dir, then import).
Option C: Last resort — pg_resetwal¶
Warning: pg_resetwal forces Postgres to create a new checkpoint and can lose recent transactions. Use only if you have no usable backup and accept possible data loss.
- Scale Postgres down so the PVC is free:
kubectl scale deploy immich-postgres -n immich --replicas=0 - Wait until the postgres pod is gone:
kubectl get pods -n immich -l app=immich-postgres - Run the one-off reset job (same node as where the PVC was, e.g.
rex):# Edit pg-resetwal-job.yaml and set nodeName to the node that had the postgres pod (e.g. rex) kubectl apply -f apps/base/immich/pg-resetwal-job.yaml -n immich kubectl wait job/immich-pg-resetwal -n immich --for=condition=complete --timeout=120s kubectl logs job/immich-pg-resetwal -n immich - Delete the job and scale Postgres back up:
kubectl delete job immich-pg-resetwal -n immich kubectl scale deploy immich-postgres -n immich --replicas=1 - Recreate Immich extensions if needed (once Postgres is Ready):
kubectl exec -n immich deploy/immich-postgres -c postgres -- psql -U immich -d immich -c \ "CREATE EXTENSION IF NOT EXISTS vectors; CREATE EXTENSION IF NOT EXISTS cube; CREATE EXTENSION IF NOT EXISTS earthdistance;"
If Postgres still fails after pg_resetwal, the data directory may be too damaged; then restoring from backup or re-deploying Immich with a new PVC (and re-importing assets) is the only path.
Preventing recurrence¶
- Immich Postgres and server are already pinned off the
oraclenode (nodeAffinity) and the server usesstrategy: Recreatewithreplicas: 1to avoid RWO Multi-Attach. - Ensure Postgres is never scheduled on a node that might lose connectivity or be terminated abruptly; keep backups (e.g. Kopia) of the
immich-postgres-pvcor regularpg_dump/Immich exports.