Xenforo scale out for high availability

stromb0li

Active member
Hey Xenforo,

Is there any documentation on how to scale-out Xenforo for high availability? Database and cache connectivity is pretty straightforward, but I'm more curious how files can stay in sync (i.e. forum attachments or gallery images), between both deployments of the Xenforo php files and data persistent to disk.

Thank you!
 
Would it not be as simple as running a single load balancer, two web servers, and a HA DB? Or am I missing something here?
Yes (Ignoring that a single load balancer is also a SPOF and therefore doesn't provide HA anyway).

There are basically three types of "data":
  1. Database (MySQL / MariaDB, Percona)
    This is rather easy to scale / HA as there a "standard" approaches to do so (Galera, Active-Active Semi-Sync Replication, Active-Passive replication etc.)
  2. File-based data in virtual filesystems data / internal_data
    This is also not that complicated as it could be put on some kind of POSIX compliant shared filesystem (CIFS/SMB, NFS, GlusterFS, CephFS, etc.) or could use a de-facto "standard" S3 compatible object storage service (Amazon S3, DigitalOcean, Cloudflare R2, MinIO, etc.)
  3. Code
    This is where things get a bit complicated.

    XenForo of course needs it PHP code files which are static - unless Add-on ZIP upload is enabled.
    If it is, XenForo writes to its root filesystem, so those changes need to be kept in sync (and OpCache invalidated).

    But there is not only static code, there is also cached code (from templates, etc.) - this code also needs to be kept in sync as it could change any time (if a template is modified, or an ad, or a widget or a navigation entry or ...).

    While it is possible to also keep this on a shared filesystem (again => CIFS/SMB, NFS, GlusterFS, CephFS, etc.) this is ccertainly not ideal for performance reasons.

    Keeping code local & in sync isn't undoable, but AFAIK there is no easy "standard" way to do this so one has to build some bits.
 
Last edited:
Yes.

There are basically three types of "data":
  1. Database (MySQL / MariaDB, Percona)
    This is rather easy to scale / HA as there a "standard" approaches to do so (Galera, Active-Active Semi-Sync Replication, Active-Passiv replication etc.)
  2. File-based data in virtual filesystems data / internal_data
    This is also not taht complicated as it could be put on some kind of POSIX compliant shared filesystem (CIFS/SMB, NFS, GlusterFS, CephFS, etc.) or could use a de-facto "standard" S3 compatible object storage service (Amazon S3, DigitalOcean, Cloudflare R2, MinIO, etc.)
  3. Code

    This is where things get a bit complicated.
    XenForo of course needs it PHP code files which are static - unless Add-on ZIP upload is enabled.
    If it is, XenForo writes to its root filesystem, so those changes need to be kept in sync (and OpCache invalidated).

    But there is not only static code, there is also cached code (from templates, etc.) - this code also needs to be kept in sync as it could change any time (if a template is modified, or an ad, or a widget or a navigation entry or ...).

    While it is possible to also keep this on a shared filesystem (again => CIFS/SMB, NFS, GlusterFS, CephFS, etc.) this is certailny not ideal for performance reasons.

    Keeping code local & in sync isn't undoable, but AFAIK there is no easy "standard" way to do this so one has to build some bits.
Really appreciate the thorough write-up. #3 is my exact concern. Database is easy and I can mount a share for persistent data, but I'm not sure about cached code, plugin updates, etc. AFAIK, Xenforo in docker isn't officially supported as well.
 
I use MySQL Cluster (specifically the ndbcluster storage engine). It lets you do 10s of millions of queries per second (including writes), it's ACID compliant and designed for zero downtime for end users (even for server maintenance/upgrades/backups or unexpected things like servers being unplugged). It was originally designed for telco use to do call logging and routing of phone systems where you can't just take down the database for any reason. It does have high resource requirements though... Like I have my servers interconnected via 54Gbit Infiniband (the bottleneck with pushing around that many queries is communication between nodes).

To get that kind of throughput, by default all data and indexes are stored in-memory with everything being stored on ay least two independent nodes (which is why it survives things like nodes being unplugged or being taken down for maintenance). Having everything in-memory is also needed to get that kind of throughput. All nodes are fully write capable. In my case, I have 8 physical servers with 1TB of RAM each (of which, 256GB on each is allocated to the data node. So I have 2TB RAM allocated to the databases/indexes and then cut that in half because everything is in 2 different physical servers for redundancy... so 1TB "usable". Of which, I'm currently using 83% of that (so I have 174GB available for use). You can also increase/decrease that on the fly.

Code:
-- NDB Cluster -- Management Client --
ndb_mgm> all report mem
Connected to Management Server at: localhost:1186
Node 11: Data usage is 83%(6433357 32K pages of total 7707291)
Node 11: Index usage is 34%(681317 32K pages of total 1955251)
Node 12: Data usage is 83%(6433121 32K pages of total 7707285)
Node 12: Index usage is 34%(681323 32K pages of total 1955487)
Node 13: Data usage is 83%(6436872 32K pages of total 7707208)
Node 13: Index usage is 34%(681400 32K pages of total 1951736)
Node 14: Data usage is 83%(6440679 32K pages of total 7707135)
Node 14: Index usage is 34%(681473 32K pages of total 1947929)
Node 15: Data usage is 83%(6433374 32K pages of total 7707236)
Node 15: Index usage is 34%(681372 32K pages of total 1955234)
Node 16: Data usage is 83%(6433148 32K pages of total 7707302)
Node 16: Index usage is 34%(681306 32K pages of total 1955460)
Node 17: Data usage is 83%(6436924 32K pages of total 7707216)
Node 17: Index usage is 34%(681392 32K pages of total 1951684)
Node 18: Data usage is 83%(6440450 32K pages of total 7707239)
Node 18: Index usage is 34%(681369 32K pages of total 1948158)

MySQL Cluster has different types of nodes... NDB = data node, MGM = management node, API = API node (which is the "normal" mysqld that handles connection to the data nodes for querying data.

Code:
ndb_mgm> show
Cluster Configuration
---------------------
[ndbd(NDB)]    8 node(s)
id=11    @192.168.10.20  (mysql-5.7.33 ndb-7.6.17, Nodegroup: 0)
id=12    @192.168.10.21  (mysql-5.7.33 ndb-7.6.17, Nodegroup: 1, *)
id=13    @192.168.10.22  (mysql-5.7.33 ndb-7.6.17, Nodegroup: 2)
id=14    @192.168.10.23  (mysql-5.7.33 ndb-7.6.17, Nodegroup: 3)
id=15    @192.168.10.24  (mysql-5.7.33 ndb-7.6.17, Nodegroup: 0)
id=16    @192.168.10.25  (mysql-5.7.33 ndb-7.6.17, Nodegroup: 1)
id=17    @192.168.10.26  (mysql-5.7.33 ndb-7.6.17, Nodegroup: 2)
id=18    @192.168.10.27  (mysql-5.7.33 ndb-7.6.17, Nodegroup: 3)

[ndb_mgmd(MGM)]    2 node(s)
id=1    @192.168.10.20  (mysql-5.7.33 ndb-7.6.17)
id=2    @192.168.10.21  (mysql-5.7.33 ndb-7.6.17)

[mysqld(API)]    33 node(s)
id=21    @192.168.10.20  (mysql-5.7.33 ndb-7.6.17)
id=22    @192.168.10.21  (mysql-5.7.33 ndb-7.6.17)
id=23    @192.168.10.22  (mysql-5.7.33 ndb-7.6.17)
id=24    @192.168.10.23  (mysql-5.7.33 ndb-7.6.17)
id=25    @192.168.10.24  (mysql-5.7.33 ndb-7.6.17)
id=26    @192.168.10.25  (mysql-5.7.33 ndb-7.6.17)
id=27    @192.168.10.26  (mysql-5.7.33 ndb-7.6.17)
id=28    @192.168.10.27  (mysql-5.7.33 ndb-7.6.17)
id=31    @192.168.10.20  (mysql-5.7.33 ndb-7.6.17)
id=32    @192.168.10.21  (mysql-5.7.33 ndb-7.6.17)
id=33    @192.168.10.22  (mysql-5.7.33 ndb-7.6.17)
id=34    @192.168.10.23  (mysql-5.7.33 ndb-7.6.17)
id=35    @192.168.10.24  (mysql-5.7.33 ndb-7.6.17)
id=36    @192.168.10.25  (mysql-5.7.33 ndb-7.6.17)
id=37    @192.168.10.26  (mysql-5.7.33 ndb-7.6.17)
id=38    @192.168.10.27  (mysql-5.7.33 ndb-7.6.17)
id=41    @192.168.10.20  (mysql-5.7.33 ndb-7.6.17)
id=42    @192.168.10.21  (mysql-5.7.33 ndb-7.6.17)
id=43    @192.168.10.22  (mysql-5.7.33 ndb-7.6.17)
id=44    @192.168.10.23  (mysql-5.7.33 ndb-7.6.17)
id=45    @192.168.10.24  (mysql-5.7.33 ndb-7.6.17)
id=46    @192.168.10.25  (mysql-5.7.33 ndb-7.6.17)
id=47    @192.168.10.26  (mysql-5.7.33 ndb-7.6.17)
id=48    @192.168.10.27  (mysql-5.7.33 ndb-7.6.17)
id=51    @192.168.10.20  (mysql-5.7.33 ndb-7.6.17)
id=52    @192.168.10.21  (mysql-5.7.33 ndb-7.6.17)
id=53    @192.168.10.22  (mysql-5.7.33 ndb-7.6.17)
id=54    @192.168.10.23  (mysql-5.7.33 ndb-7.6.17)
id=55    @192.168.10.24  (mysql-5.7.33 ndb-7.6.17)
id=56    @192.168.10.25  (mysql-5.7.33 ndb-7.6.17)
id=57    @192.168.10.26  (mysql-5.7.33 ndb-7.6.17)
id=58    @192.168.10.27  (mysql-5.7.33 ndb-7.6.17)
id=59 (not connected, accepting connect from any host)

I haven't gotten around to upgrading to MySQL 8 yet because it always scares me when everything runs great. And last time I attempted to go to 8.0 (in 2021), I ran into a MySQL bug that forced a rollback to 7.6:


...before that when I tried in 2020, it was a different bug:


One of these days when I'm feeling brave, I'll try it again... hah

For storage of things that change often but aren't accessed that often relative to things like XenForo's PHP files (user uploaded content like avatars and attachments), I used to use GlusterFS, but I've since moved most of that to Cloudflare R2.

For things that don't change often (PHP files, template/phrase edits, etc.), I use csync2 to keep all the servers synced (having the files locally is faster than networked filesystems like Gluster). The csync2 command is triggered automatically with the Filesystem addon as needed when something in the code-cache or addon development process changes a file.

In my setup, all servers can do any task... all run Nginx, PHP-FPM, memcached, etc. Since MySQL Cluster is fully write capable on any node, all the web servers simply communicate to MySQL at localhost, even though they are physically different servers.
 
Last edited:
Those are generally good answers, but to add on it:
1. Stay away from NFS if you can; it is the most easily available option but has rather poor reliability and limited POSIX compatibility
2. If you use redis, run replicas colocated with each XF instance; the difference between 0ms and 2ms latency (ie within a LAN) was quite significant for us at least. Then you typically make that local copy a replica of a redis sentinel setup. And because direct sentinel support in php is terrible, you'll want to have the write config pointing to something else that handles being aware of the current master (we use HAProxy for that)

As far as FS are concerned, I'd recommend CephFS but it is in the "nice when you have it" category, as getting a performant Ceph cluster up and running isn't really for the faint of heart (and often not a good idea economically speaking). But if you have one, it will do the job very well.

AFAIK, Xenforo in docker isn't officially supported as well.
Minor bit but that doesn't matter much in practice.
You just package it like any other PHP app, and volumemount data and internal_data. Works for us just fine in k8s:
Code:
NAME                             READY   STATUS    RESTARTS   AGE
redis-haproxy-565447b7c9-9prdc   1/1     Running   0          41d
redis-haproxy-565447b7c9-kcmjx   1/1     Running   0          41d
redis-haproxy-565447b7c9-v5l4q   1/1     Running   0          12d
xenforo-5d5cd68cb7-6lzb7         4/4     Running   0          3d9h
xenforo-5d5cd68cb7-7q2gq         4/4     Running   0          7d7h
xenforo-5d5cd68cb7-lgxq2         4/4     Running   0          7d7h
xenforo-5d5cd68cb7-q6ksg         4/4     Running   0          7d7h
xenforo-5d5cd68cb7-qj7dd         4/4     Running   0          7d7h
xenforo-5d5cd68cb7-qwmfs         4/4     Running   0          7d7h
xenforo-5d5cd68cb7-s7gbs         4/4     Running   0          7d7h
xenforo-5d5cd68cb7-tlgpv         4/4     Running   0          7d7h
xenforo-cron-68d574d76b-5x6gb    1/1     Running   0          41d
 
Last edited:
Top Bottom