Recovering my Matrix Synapse db

2025-01-09 1047 words 5 minutes

Contents

Yesterday, I woke up to my Element desktop being down. After ruling out a firewall issue and replicating on multiple clients and IPs, I discovered that Synapse could not connect to its database.

journalctl logs:

Jan 08 17:28:52 justmatrix matrix-synapse[343655]: File "/opt/venvs/matrix-synapse/lib/python3.11/site-packages/psycopg2/__init__.py", line 122, in connect Jan 08 17:28:52 justmatrix matrix-synapse[343655]: conn = _connect(dsn, connection_factory=connection_factory, **kwasync) Jan 08 17:28:52 justmatrix matrix-synapse[343655]: psycopg2.OperationalError: connection to server at "X.X.X.X", port 5432 failed: Connection refused Jan 08 17:28:52 justmatrix matrix-synapse[343655]: Is the server running on that host and accepting TCP/IP connections?

Synapse and PostgreSQL are on separate VMs. After checking they could still talk to each other, I noticed postgres wasn’t running and it wouldn’t stay running if I tried to start it.

Then I happened to run df and I saw the problem:

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            2.0G     0  2.0G   0% /dev
tmpfs           393M  448K  393M   1% /run
/dev/vda1        50G   48G  340K 100% /
tmpfs           2.0G     0  2.0G   0% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/vda15      124M   12M  113M  10% /boot/efi
tmpfs           393M     0  393M   0% /run/user/1000

The disk was full.

Panic soon set in when I realized my only backup was from almost a year ago. 11 months to be exact. And it was only 10GB then! The database was now 46GB. With only two human users.

Recover

Restoring that backup was a last resort so I put that idea aside for now. My first instinct was to see if I could just expand that volume. These are DreamCompute OpenStack VMs and I didn’t find a quick way to do it. I was pressed for time so I moved on to the next idea, which was to add a new volume. I could then temporarily move the database to this new volume and try to trim it down from there.

Rough steps were as follows:

Create new volume. Make sure it is larger than the full volume. Ask me how I know.
Attach volume to instance.
Create partition. $ sudo fdisk /dev/vdb This is an interactive utility. You can look up how to use it.
Format disk partition. $ sudo mkfs -t ext4 /dev/vdb1 Filesystem must be compatible with the old volume.
Mount disk partition. $ sudo mount -t auto /dev/vdb1 /mnt/temp

Now that the new volume is ready for use, it’s time to move the database and attempt a clean up. I give credit to this Stack Exchange¹ user for these next steps, which I’ll reproduce here:

$ service postgresql stop
$ mv /var/lib/postgresql/9.5/main /mnt/bigdisk
$ ln -sr /mnt/bigdisk/main /var/lib/postgresql/9.5
$ service postgresql start
$ vacuumdb --all --full
$ service postgresql stop
$ rm /var/lib/postgresql/9.5/main
$ mv /mnt/bigdisk/main /var/lib/postgresql/9.5
$ service postgresql start

Reminder to always update commands from the internet so that it’s applicable to your system.

Clean db

The next phase was to clean up the database. I do have retention enabled but it’s clear more is needed to keep the size in check. The Synapse docs have some documentation around database maintenance.²³ I also came across this post⁴ that explains how the database can grow unwieldy. I used this tool⁵ to automate some of that advice. Afterward, I executed the REINDEX and VACUUM FULL commands on the postgres database.

After moving the database back to the original volume, the disk is now at 81%:

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            2.0G     0  2.0G   0% /dev
tmpfs           393M  472K  393M   1% /run
/dev/vda1        50G   38G  9.2G  81% /
tmpfs           2.0G  1.1M  2.0G   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/vda15      124M   12M  113M  10% /boot/efi
/dev/vdb1        98G   24K   93G   1% /mnt/temp
tmpfs           393M     0  393M   0% /run/user/1000

I would like to shrink it some more, but this is enough breathing room for now.

Action items

I spent about a full work day on this recovery so here are some action items for myself:

Set up some monitoring (uptime, disk, cpu).
Take more regular backups.
Regular database maintenance via script.
Reevaluate retention policy.

Do you self-host Synapse? Let me know how you take care of your database.

Update (2025-01-28)

It kept growing again so I looked for more answers. I ran this query to list the largest tables:

synapsedb=# SELECT nspname || '.' || relname AS "relation",
    pg_size_pretty(pg_total_relation_size(c.oid)) AS "total_size"
  FROM pg_class c
  LEFT JOIN pg_namespace n ON (n.oid = c.relnamespace)
  WHERE nspname NOT IN ('pg_catalog', 'information_schema')
    AND c.relkind <> 'i'
    AND nspname !~ '^pg_toast'
  ORDER BY pg_total_relation_size(c.oid) DESC
  LIMIT 20;
                   relation                   | total_size 
----------------------------------------------+------------
 public.state_groups_state                    | 52 GB
 public.device_lists_changes_in_room          | 2266 MB
 public.received_transactions                 | 578 MB
 public.current_state_delta_stream            | 179 MB
 public.event_json                            | 164 MB
 public.device_lists_remote_cache             | 150 MB
 public.e2e_cross_signing_keys                | 135 MB
 public.device_lists_stream                   | 105 MB
 public.events                                | 70 MB
 public.event_auth                            | 54 MB
 public.event_edges                           | 53 MB
 public.cache_invalidation_stream_by_instance | 51 MB
 public.room_memberships                      | 30 MB
 public.event_search                          | 29 MB
 public.event_auth_chain_links                | 26 MB
 public.state_group_edges                     | 25 MB
 public.state_events                          | 18 MB
 public.event_auth_chains                     | 16 MB
 public.current_state_events                  | 13 MB
 public.state_groups                          | 13 MB
(20 rows)

Then I searched the Synapse GitHub issues for states_groups_state and found this open issue. That seemed like exactly what I was dealing with so I decided to try the manual cleanup steps given by csett86. I was disappointed to see it only gave me back a measly 5GB:

                   relation                   | total_size 
----------------------------------------------+------------
 public.state_groups_state                    | 47 GB

Next, I came across the Rust compressor tool.

This one made a huge difference.

I had some trouble with the db arguments syntax, but this finally worked:

~/rust-synapse-compress-state/target/debug$ ./synapse_auto_compressor -p "user=DB_USER password=DB_PASS host=DB_HOST dbname=DB_NAME" -c 500 -n 100

Once that finished, I did a reindex and full vacuum and the new size was:

                   relation                   | total_size 
----------------------------------------------+------------
 public.state_groups_state                    | 20 GB

All together, the end result looks like this:

$ df -h
Filesystem      Size  Used Avail Use% Mounted on
udev            2.0G     0  2.0G   0% /dev
tmpfs           393M  468K  393M   1% /run
/dev/vda1        50G   25G   23G  52% /
tmpfs           2.0G  1.1M  2.0G   1% /dev/shm
tmpfs           5.0M     0  5.0M   0% /run/lock
/dev/vda15      124M   12M  113M  10% /boot/efi
/dev/vdb1       196G   28K  186G   1% /mnt/temp
tmpfs           393M     0  393M   0% /run/user/1000

Much better compared to the first cleanup attempt. I found an open issue to integrate the compressor tool into Synapse so that’s another one I’ll be watching.