Immich Postgres recovery (checkpoint corruption)¶

What happened¶

Postgres in the immich namespace was failing to start with:

invalid xl_info in checkpoint record
PANIC: could not locate a valid checkpoint record
startup process was terminated by signal 6: Aborted

This indicates WAL/checkpoint corruption: the database was not shut down cleanly (e.g. pod killed during write, or the volume was attached to more than one node — RWO Multi-Attach scenario).

Volume health (Longhorn)¶

immich-postgres-pvc should show robustness: healthy. If it were degraded, I/O errors could cause or worsen corruption. Check with:
kubectl get volumes.longhorn.io -n longhorn-system | grep <postgres-pv-name>
The immich-library volume (200Gi) is separate; if it is degraded, fix it in the Longhorn UI or by waiting for/triggering replica rebuild, but that does not block Postgres recovery.

Recovery options¶

Option A: Start fresh (delete DB)¶

If you don't need the existing data:

Scale Postgres to 0 and delete the PVC:
kubectl scale deploy immich-postgres -n immich --replicas=0
kubectl delete pvc immich-postgres-pvc -n immich
(Ensure no other pods use the PVC, e.g. delete any log-reader job.)
Recreate the PVC (Flux will do it on reconcile, or apply apps/base/immich/postgres.yaml).
Scale Postgres back to 1. It will init a new empty DB.
Create extensions (run from a node that can reach the cluster, e.g. plumbus):
kubectl exec -n immich deploy/immich-postgres -c postgres -- psql -U immich -d immich -c "CREATE EXTENSION IF NOT EXISTS vectors; CREATE EXTENSION IF NOT EXISTS cube; CREATE EXTENSION IF NOT EXISTS earthdistance;"

Immich server will run migrations on first connect. You'll need to set up Immich again (admin user, etc.).

Option B: Restore from backup (preferred if you have data)¶

If you have a backup of the Immich Postgres data:

Kopia (or similar)
Restore the immich-postgres-pvc volume (or the pgdata directory) from a snapshot taken when the DB was healthy. Then scale postgres back to 1 and start Immich.
Immich export
If you previously used Immich’s backup/export feature, restore that after bringing Postgres up with a new PVC (see Option B for bringing Postgres up with a fresh data dir, then import).

Option C: Last resort — `pg_resetwal`¶

Warning: pg_resetwal forces Postgres to create a new checkpoint and can lose recent transactions. Use only if you have no usable backup and accept possible data loss.

Scale Postgres down so the PVC is free:

kubectl scale deploy immich-postgres -n immich --replicas=0

Wait until the postgres pod is gone:

kubectl get pods -n immich -l app=immich-postgres

Run the one-off reset job (same node as where the PVC was, e.g. rex):

# Edit pg-resetwal-job.yaml and set nodeName to the node that had the postgres pod (e.g. rex)
kubectl apply -f apps/base/immich/pg-resetwal-job.yaml -n immich
kubectl wait job/immich-pg-resetwal -n immich --for=condition=complete --timeout=120s
kubectl logs job/immich-pg-resetwal -n immich

Delete the job and scale Postgres back up:

kubectl delete job immich-pg-resetwal -n immich
kubectl scale deploy immich-postgres -n immich --replicas=1

Recreate Immich extensions if needed (once Postgres is Ready):

kubectl exec -n immich deploy/immich-postgres -c postgres -- psql -U immich -d immich -c \
  "CREATE EXTENSION IF NOT EXISTS vectors; CREATE EXTENSION IF NOT EXISTS cube; CREATE EXTENSION IF NOT EXISTS earthdistance;"

If Postgres still fails after pg_resetwal, the data directory may be too damaged; then restoring from backup or re-deploying Immich with a new PVC (and re-importing assets) is the only path.

Preventing recurrence¶

Immich Postgres and server are already pinned off the oracle node (nodeAffinity) and the server uses strategy: Recreate with replicas: 1 to avoid RWO Multi-Attach.
Ensure Postgres is never scheduled on a node that might lose connectivity or be terminated abruptly; keep backups (e.g. Kopia) of the immich-postgres-pvc or regular pg_dump/Immich exports.