WAL Logs are the transaction logs of PostgreSQL. They contain every query the is executed. That way Postgres is able to verify that the database is in a consistent state after a crash.
If you decide to use WAL Archiving as a Backup-Strategy you also have the possibility to restore the database to any point in time (if you have the neede WAL-Logs), hence the name Point in Time Recovery.
Enabling WAL Archiving
First we need to configure postgres to backup the wal-log-files. The simplest solution is to write them to a NFS-Share.
Therefor we need to edit the postgres-config.
vim /var/lib/pgsql/data/postgresql.conf
# Enable WAL-Log Creation
wal_level = replica
# Enable WAL Archiving
archive_mode = on
This will create Archive-Logs, but we still need a way to back them up. Postgres does not care how you do it and you are free to do what ever works for you.
vim /var/lib/pgsql/data/postgresql.conf
archive_command = 'test ! -f /mnt/postgres-backup/%f && cp %p /mnt/postgres-backup/%f'
Create BaseBackup
Next thing is creating a BaseBackup. This is simply a tar of the /var/lib/pgsql/data
-directory.
When recovering this will generate a base to replay the WAL-Logs.
# -D: Directory to put the BaseBackup
# -R: Create Recovery.conf
# -F: Format the Backup as tar
# -z: use gzip compression for tar files
# -P: Show Progress
# -c: Create a Checkpoint now
$ pg_basebackup -D /mnt/postgres-backup/basebackup/$(date -Iseconds) -R -F tar -z -P -c fast
Restart Postgres
Now restart your Instance to activate the configuration changes.
systemctl restart postgresql
Checking WAL Archiving
WAL Logs will be archived every time Postgres switches to a new logfile. Depending on your configuration and the load on your instance this could take quite a long time.
If you don’t want to wait you can trigger it buy running the following SQL-Command.
SELECT pg_switch_wal();
Now you should see, that the WAL-Logs are copied to the NFS-Share.
If you don’t see any files, have a look in /var/lib/pgsql/data/log/
.
If your error message is something along the lines of “permissions denied”, it could be an SELinux issue.
You can also monitor the archiving process using the “pg_stat_archiver” view.
postgres=# SELECT * FROM "pg_stat_archiver";
archived_count | last_archived_wal | last_archived_time | failed_count | last_failed_wal | last_failed_time | stats_reset
----------------+--------------------------+-------------------------------+--------------+-----------------+------------------+-------------------------------
17 | 000000010000000000000087 | 2024-08-10 06:00:51.242255+02 | 0 | | | 2023-08-04 09:25:25.872096+02
(1 row)
Additional BaseBackups
Right now in case of recovery you need to replay all the logs since the last basebackup.
If you have a lot of WAL-Logs this can take some time.
You can create additional basebackups to speed things up.
pg_basebackup -D /mnt/postgres-backup/basebackup/$(date -Iseconds) -R -F tar -z -P -c fast
If you look at your WAL directory again, you will see a .backup
file in there. This file does not hold database-data but information about the basebackup like a timestamp and the start and end WAL-Log file. If you want to recover the last basebackup you can delete all WAL Files that are older than the .backup
file, but don’t delete them right away! There is a tool for that.
Clean up old WAL-Logs
If you want to delete wal-logs older than your last basebackup, you can do that with pg_archivecleanup
.
Simply specify the name of the .backup
-file in your wal-archive and Postgres will take care of the rest. That way you will not accidentally delete logs that are still needed.
But keep in mind. As soon as you delete the WAL-Logs you can only go back to the basebackup and not to any timestamp.
pg_archivecleanup 00000001000000000000007B.00000028.backup
Disaster Strikes!
Now we can try to recover our database from our backups.
First we need to stop the database.
systemctl stop postgresql
Next we will delete our Database-Directory. You could also use another server and copy the backup files.
sudo rm -rf /var/lib/pgsql/data
Starting the Recovery
First we will restore our basebackup.
sudo cp -r /mnt/postgres/basebackup/<basebackup> /var/lib/pgsql/data
sudo tar -xzf /var/lib/pgsql/data/base.tar.gz
sudo mv /var/lib/pgsql/data/pg_wal.tar.gz /var/lib/pgsql/data/pg_wal/
sudo tar -xzf /var/lib/data/pg_wal/pg_wal.tar.gz
sudo rm -r /var/lib/data/base.tar.gz
sudo rm -r /var/lib/data/pg_wal/pg_wal.tar.gz
sudo chown postgres:postgres -R /var/lib/pgsql
sudo chmod 750 /var/lib/pgsql/data
The next step is to modify the new postgresql.conf
and setup the recovery options.
vim /var/lib/pgsql/data/postgresql.conf
restore_command = 'cp /mnt/postgres-backup/WAL/%f %p'
# optional
# If you only want to recovery up to a certain time, enter a timestamp here
recovery_target_time = '2024-08-10 20:15:00'
In addition to that create a recovery.signal
in the data-directory as will. This will tell Postgres to use the recovery settings and not start as normal.
touch /var/lib/pgsql/data/recovery.signal
After that we can start the database and it should recover from it’s backups
systemctl start postgresql
If you have a look in /var/lib/pgsql/data/log/
you will see the recovery in progress.
As long as revovery is in progress you can only do read-only queries.
If you are missing WAL-Logs because they were not backed up when the database nuked (or in this case go deleted) postgres will be stuck in recovery since it is not successfuly finished.
You can check this with this query.
SELECT pg_is_in_recovery();
It will return true
if it is in recovery and otherwise false
.
The logs will tell you why.
If you are missing WAL-Logs, you have to options. Find the missing logs and place them in the backup-directory or quit the recovery and accept the inherent dataloss.
If you decied to exit the recovery mode because you know that the missing logs are gone and not recoverable, you can promote the instance to a read-write instance.
SELECT pg_promote();
Now your postgres instance is back up and running as usual.