Recovering my Matrix Synapse db
Yesterday, I woke up to my Element desktop being down. After ruling out a firewall issue and replicating on multiple clients and IPs, I discovered that Synapse could not connect to its database.
journalctl logs:
Jan 08 17:28:52 justmatrix matrix-synapse[343655]: File "/opt/venvs/matrix-synapse/lib/python3.11/site-packages/psycopg2/__init__.py", line 122, in connect
Jan 08 17:28:52 justmatrix matrix-synapse[343655]: conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
Jan 08 17:28:52 justmatrix matrix-synapse[343655]: psycopg2.OperationalError: connection to server at "X.X.X.X", port 5432 failed: Connection refused
Jan 08 17:28:52 justmatrix matrix-synapse[343655]: Is the server running on that host and accepting TCP/IP connections?
Synapse and PostgreSQL are on separate VMs. After checking they could still talk to each other, I noticed postgres wasn’t running and it wouldn’t stay running if I tried to start it.
Then I happened to run df
and I saw the problem:
$ df -h
Filesystem Size Used Avail Use% Mounted on
udev 2.0G 0 2.0G 0% /dev
tmpfs 393M 448K 393M 1% /run
/dev/vda1 50G 48G 340K 100% /
tmpfs 2.0G 0 2.0G 0% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
/dev/vda15 124M 12M 113M 10% /boot/efi
tmpfs 393M 0 393M 0% /run/user/1000
The disk was full.
Panic soon set in when I realized my only backup was from almost a year ago. 11 months to be exact. And it was only 10GB then! The database was now 46GB. With only two human users.
Recover
Restoring that backup was a last resort so I put that idea aside for now. My first instinct was to see if I could just expand that volume. These are DreamCompute OpenStack VMs and I didn’t find a quick way to do it. I was pressed for time so I moved on to the next idea, which was to add a new volume. I could then temporarily move the database to this new volume and try to trim it down from there.
Rough steps were as follows:
- Create new volume. Make sure it is larger than the full volume. Ask me how I know.
- Attach volume to instance.
- Create partition.
$ sudo fdisk /dev/vdb
This is an interactive utility. You can look up how to use it. - Format disk partition.
$ sudo mkfs -t ext4 /dev/vdb1
Filesystem must be compatible with the old volume. - Mount disk partition.
$ sudo mount -t auto /dev/vdb1 /mnt/temp
Now that the new volume is ready for use, it’s time to move the database and attempt a clean up. I give credit to this Stack Exchange1 user for these next steps, which I’ll reproduce here:
$ service postgresql stop
$ mv /var/lib/postgresql/9.5/main /mnt/bigdisk
$ ln -sr /mnt/bigdisk/main /var/lib/postgresql/9.5
$ service postgresql start
$ vacuumdb --all --full
$ service postgresql stop
$ rm /var/lib/postgresql/9.5/main
$ mv /mnt/bigdisk/main /var/lib/postgresql/9.5
$ service postgresql start
Reminder to always update commands from the internet so that it’s applicable to your system.
Clean db
The next phase was to clean up the database. I do have retention enabled but it’s clear more is needed to keep the size in check. The Synapse docs have some documentation around database maintenance.23 I also came across this post4 that explains how the database can grow unwieldy. I used this tool5 to automate some of that advice. Afterward, I executed the REINDEX
and VACUUM FULL
commands on the postgres database.
After moving the database back to the original volume, the disk is now at 81%:
$ df -h
Filesystem Size Used Avail Use% Mounted on
udev 2.0G 0 2.0G 0% /dev
tmpfs 393M 472K 393M 1% /run
/dev/vda1 50G 38G 9.2G 81% /
tmpfs 2.0G 1.1M 2.0G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
/dev/vda15 124M 12M 113M 10% /boot/efi
/dev/vdb1 98G 24K 93G 1% /mnt/temp
tmpfs 393M 0 393M 0% /run/user/1000
I would like to shrink it some more, but this is enough breathing room for now.
Action items
I spent about a full work day on this recovery so here are some action items for myself:
- Set up some monitoring (uptime, disk, cpu).
- Take more regular backups.
- Regular database maintenance via script.
- Reevaluate retention policy.
Do you self-host Synapse? Let me know how you take care of your database.
Update (2025-01-28)
It kept growing again so I looked for more answers. I ran this query to list the largest tables:
synapsedb=# SELECT nspname || '.' || relname AS "relation",
pg_size_pretty(pg_total_relation_size(c.oid)) AS "total_size"
FROM pg_class c
LEFT JOIN pg_namespace n ON (n.oid = c.relnamespace)
WHERE nspname NOT IN ('pg_catalog', 'information_schema')
AND c.relkind <> 'i'
AND nspname !~ '^pg_toast'
ORDER BY pg_total_relation_size(c.oid) DESC
LIMIT 20;
relation | total_size
----------------------------------------------+------------
public.state_groups_state | 52 GB
public.device_lists_changes_in_room | 2266 MB
public.received_transactions | 578 MB
public.current_state_delta_stream | 179 MB
public.event_json | 164 MB
public.device_lists_remote_cache | 150 MB
public.e2e_cross_signing_keys | 135 MB
public.device_lists_stream | 105 MB
public.events | 70 MB
public.event_auth | 54 MB
public.event_edges | 53 MB
public.cache_invalidation_stream_by_instance | 51 MB
public.room_memberships | 30 MB
public.event_search | 29 MB
public.event_auth_chain_links | 26 MB
public.state_group_edges | 25 MB
public.state_events | 18 MB
public.event_auth_chains | 16 MB
public.current_state_events | 13 MB
public.state_groups | 13 MB
(20 rows)
Then I searched the Synapse GitHub issues for states_groups_state
and found this open issue. That seemed like exactly what I was dealing with so I decided to try the manual cleanup steps given by csett86. I was disappointed to see it only gave me back a measly 5GB:
relation | total_size
----------------------------------------------+------------
public.state_groups_state | 47 GB
Next, I came across the Rust compressor tool.
This one made a huge difference.
I had some trouble with the db arguments syntax, but this finally worked:
~/rust-synapse-compress-state/target/debug$ ./synapse_auto_compressor -p "user=DB_USER password=DB_PASS host=DB_HOST dbname=DB_NAME" -c 500 -n 100
Once that finished, I did a reindex and full vacuum and the new size was:
relation | total_size
----------------------------------------------+------------
public.state_groups_state | 20 GB
All together, the end result looks like this:
$ df -h
Filesystem Size Used Avail Use% Mounted on
udev 2.0G 0 2.0G 0% /dev
tmpfs 393M 468K 393M 1% /run
/dev/vda1 50G 25G 23G 52% /
tmpfs 2.0G 1.1M 2.0G 1% /dev/shm
tmpfs 5.0M 0 5.0M 0% /run/lock
/dev/vda15 124M 12M 113M 10% /boot/efi
/dev/vdb1 196G 28K 186G 1% /mnt/temp
tmpfs 393M 0 393M 0% /run/user/1000
Much better compared to the first cleanup attempt. I found an open issue to integrate the compressor tool into Synapse so that’s another one I’ll be watching.