Behind the scenes: How we host Ars Technica

Ironically, AWS does not have a good track record with outages.
Mostly depends on your region. Anyone using their us-east-1 zone is going to have a terrible time, and that's the one you usually hear about when people say "AWS outages". Overall they're very much reliable if you use other regions.

As for the cost... idk. They're (really) expensive BW-wise, but everything else is alright enough if you manage to run on spot instances (and realistically, there's no reason you wouldn't be able to offload some ~80% of your compute on it).
Sure, on-prem is (much!) cheaper, but it also comes with many hassles, especially for large web platforms. Managing 1 on-prem server is fine, but managing 20, 50, 100, ... quickly gets extraordinarily tedious, and the generic fleet management tools on the market are essentially all painful at best. And you quickly miss a service discovery system, an IAM system, ... etc.*

Overall, unless your web platform has massive bandwidth requirements specifically, eventually it's probably not that bad of a deal if you put do it correctly settings-wise.
In our case, we do have specifically massive bandwidth reqs (last I checked their calculator, it quoted somewhere between $50-80k per month depending on user traffic sources for BW alone), so we don't use AWS, but I wouldn't be so quick to judge people that do.

--
*: All these exist in various forms as selfhostable options, but they're equally painful to deploy, maintain, backup, test, ... by yourself. It's doable, but you have to be ok with learning things the hard way all the time.
 
Last edited:
Mostly depends on your region. Anyone using their us-east-1 zone is going to have a terrible time, and that's the one you usually hear about when people say "AWS outages". Overall they're very much reliable if you use other regions.

AWS, the "butter face" of hosting?

As for the cost... idk. They're (really) expensive BW-wise, but everything else is alright enough if you manage to run on spot instances (and realistically, there's no reason you wouldn't be able to offload some ~80% of your compute on it).
Sure, on-prem is (much!) cheaper, but it also comes with many hassles, especially for large web platforms. Managing 1 on-prem server is fine, but managing 20, 50, 100, ... quickly gets extraordinarily tedious, and the generic fleet management tools on the market are essentially all painful at best. And you quickly miss a service discovery system, an IAM system, ... etc.*

Overall, unless your web platform has massive bandwidth requirements specifically, eventually it's probably not that bad of a deal if you put do it correctly settings-wise.

I've lost count of how many sites without massive bandwidth requirements who slashed their web infrastructure costs by up to 90% by moving off AWS. Higher costs are not only associated with their BW.

The settings that were incorrect: the decision maker took the "safe route" rather than the better route. Its not much different from back in the IBM heydays when businesses would spend far more on big iron hardware than they needed to. The saying was "nobody gets fired for choosing IBM". Instead of investigating other options, people would automatically go with IBM.

Throw in AWS learning curve compared to many of their competitors and you have additional hidden costs adding up. The AWS user interface is unintuitive, with a maze of options and features scattered in a haphazard way.

In our case, we do have specifically massive bandwidth reqs (last I checked their calculator, it quoted somewhere between $50-80k per month depending on user traffic sources for BW alone), so we don't use AWS, but I wouldn't be so quick to judge people that do.

--
*: All these exist in various forms as selfhostable options, but they're equally painful to deploy, maintain, backup, test, ... by yourself. It's doable, but you have to be ok with learning things the hard way all the time.

What good is the ability to scale up quickly (which many others can provide) if it costs your dearly? AWS doesn't have a monopoly on scaling, and others can do it without the massive increase in costs.

To me, its not a black and white choice between AWS and self-hosted. There are plenty of "cloud" providers who can provide the essential needs of a saleable operation.

I applaud Xenforo for going with someone other than AWS. Its a win, win, win: the hosting requirements are met, the costs are lower for Xenforo helping the company, and lower cost to their cloud clients.
 
I applaud Xenforo for going with someone other than AWS. Its a win, win, win: the hosting requirements are met, the costs are lower for Xenforo helping the company, and lower cost to their cloud clients.
After I was corrected on what XF Cloud runs on (it's Vultr, apologies again) I thought I'd check various sites out of curiosity, eg BBC, Zen (ISP), Facebook, etc. When I checked xenforo.com, it's on Cloudflare, so I'm wondering why the difference. Do you have any ideas?

For giggles, I also checked aws.com and surprise surprise it runs on AWS lol.
 
Last edited:
Well I compared the website mentioned by OP with one that I manage:

Arstechnica: https://www.similarweb.com/website/arstechnica.com/#overview
1690031164278.webp

The the one I manage: (regional content)
1690030792974.webp

I don't know how much they are spending on servers but we spend like $100 (Dedicated Server) + $25 (Cloudflare) + $5 (GDrive for Backups) = $130/month

CF Stats are as follow: (custom addons for performance ;)) its weekend & vacations so its about 91K, otherwise its mostly above 100K
1690031039504.webp
 
I applaud Xenforo for going with someone other than AWS. Its a win, win, win: the hosting requirements are met, the costs are lower for Xenforo helping the company, and lower cost to their cloud clients.

Our provider was also chosen after many many weeks of trying various other providers, pushing servers to the limit and even breaking them once or twice, the availability, response and quality of support, the communication, API endpoints, security options, ethos of the company etc.

We hope that our reputation for uncompromising quality in our products is evident, and this was one of, if not the main factor in how we built XenForo Cloud and the ultimate choice we went with as we felt they matched our company and requirements best.
 
After I was corrected on what XF Cloud runs on (it's Vultr, apologies again) I thought I'd check various sites out of curiosity, eg BBC, Zen (ISP), Facebook, etc. When I checked xenforo.com, it's on Cloudflare, so I'm wondering why the difference. Do you have any ideas?

For giggles, I also checked aws.com and surprise surprise it runs on AWS lol.

Cloudflare is basically a security front end to a site, not the actual site. The outside world sees Cloudflare, while on the backend Cloudflare is sending the request to the actual web server and returning the reply. This way it can mitigate attacks, and also reduce actual server load by sending cached items from it's own server nodes.
 
  • Like
Reactions: FTL
XenForo uses cloud flare protection - one of which is to mask the original IP of the server.

Cloudflare is basically a security front end to a site, not the actual site. The outside world sees Cloudflare, while on the backend Cloudflare is sending the request to the actual web server and returning the reply. This way it can mitigate attacks, and also reduce actual server load by sending cached items from it's own server nodes.

Yup, I thought they also did hosting, hence my question. A quick check on their website fixed that notion lol.

So, of course, this explains why xenforo.com appears to be hosted by them. Presumably they're actually on Vultr perhaps, given what Slavik said:

Our provider was also chosen after many many weeks of trying various other providers, pushing servers to the limit and even breaking them once or twice, the availability, response and quality of support, the communication, API endpoints, security options, ethos of the company etc.

We hope that our reputation for uncompromising quality in our products is evident, and this was one of, if not the main factor in how we built XenForo Cloud and the ultimate choice we went with as we felt they matched our company and requirements best.
 
AWS, the "butter face" of hosting?
Not entirely sure what you imply here to be honest; at least not from how I understand the meaning of "butter face" :confused:

I've lost count of how many sites without massive bandwidth requirements who slashed their web infrastructure costs by up to 90% by moving off AWS. Higher costs are not only associated with their BW.
Yes you can definitely ruin yourself (or your company) in many creative ways with only a few checkboxes; I was merely pointing out that most ways are misconfiguration-driven. And we can agree that a lot of that is to blame on their counter-intuitive UI and how many easy-to-click checkboxes incur serious fees.
But my point was: with bandwidth costs it doesn't matter if you're an expert, you're getting a bad deal and that's it.

Now should 90% of websites just use some simple shared hosting service for a while before blindly engaging into managing infra sold at a premium with API-driven management as a core feature? Yes.
Many times yes.

Throw in AWS learning curve compared to many of their competitors and you have additional hidden costs adding up. The AWS user interface is unintuitive, with a maze of options and features scattered in a haphazard way.
Depends on who you see as competitors I'd say. If you mean that it's more complex than most/all "smaller" hosting services, then yes definitely.
But whether it's more complex than Azure, GCP, and the likes? Not really tbh. And those are "[AWS'] competitors".

Though yes their UI sucks, that I would not disagree with.
However I'd note that many/most extensive users of their services will eventually gravitate towards their CLI and other management tools a-la Terraform & friends, which means they don't really use the UI in the first place.
Does it suck for everyone else not as deeply invested in cloud configuration memes? yeah. Does it explain why they have no incentive to improve it? yeah also.

What good is the ability to scale up quickly (which many others can provide) if it costs your dearly? AWS doesn't have a monopoly on scaling, and others can do it without the massive increase in costs.
The raw cost is only relevant depending on per-user-revenue though. In the case of many web platforms (of which many fall under a broad definition of "e-commerce"), the marginal extra cost per-user in infra is completely irrelevant even with the "AWS tax" on actual costs of service, given that people don't really browse their websites for entertainment, but with the intent to spend.

That said, I agree that in many cases (and maybe I sound like a shill here so let me reiterate that we don't run on AWS, nor would I recommend it in general) it doesn't make sense compared to other options.

As for a monopoly on scaling or not, they're not the only ones indeed, but I also didn't suggest that this was a strength of theirs. (and I don't know of very many web platforms that would actually need more than a couple dozen of servers to run, at most).
Their strength is instead very much the breadth of services they offer and integrate together and allow you to control fully via APIs.
They might suck at the depth part of it (many of these services are bare-minimum functionality to be marketable), but they are still there and working. And that breadth is very much unique.

I don't know how much they are spending on servers but we spend like $100 (Dedicated Server) + $25 (Cloudflare) + $5 (GDrive for Backups) = $130/month
Idk how much they are spending, but I'd expect at least one extra 0 or two on their monthly bill, from the screenshots they shared.
That said, your case is also peculiar. I don't know what it is for, but the numbers suggest at least a rather media-lightweight site with very good cacheability (and to be fair, AT does look very cacheable from here).

At least in our case, we definitely couldn't, and while we don't run on CF (so I can't exactly match the CF dashboard style) our costs are around $3000/month after many very friendly discounts all the way through.
1690032543970.png
1690033754377.png
(ignore the unclear "bytes_out" label on the middle graph, the data it's pulling from is bytes but with a x8 to match the axis of the plot, which is in bits/s, while the monthly sum on the top right is correctly in bytes, ie x1)

Yup, I thought they also did hosting, hence my question. A quick check on their website fixed that notion lol.
They also do hosting, in some limited fashion. Rather than managing a server directly they will run your code on their own servers (the product is called Cloudflare Workers). But for something like XF they definitely aren't using that bit, at least not for the core of XF.

Also in the end, whichever host XF Cloud might have picked would be just fine. Quality of service is rather achieved by redundancy and fallback plans, rather than betting that paying 30% extra (or however much) means your provider will never fail you.
 
Last edited:
  • Like
Reactions: fly
They also do hosting, in some limited fashion. Rather than managing a server directly they will run your code on their own servers (the product is called Cloudflare Workers). But for something like XF they definitely aren't using that bit, at least not for the core of XF.

Also in the end, whichever host XF Cloud might have picked would be just fine. Quality of service is rather achieved by redundancy and fallback plans, rather than betting that paying 30% extra (or however much) means your provider will never fail you.
Yeah, I saw that and figured it wouldn't be what XF wanted.

I've had some big differences with XF staff, but one thing I can vouch for them is the product quality. The self-hosted option was great and so is their hosted option, which hasn't had any downtime since I took it out last year. I'm also happy to take at face value, the recent announcements by Chris D and Kier.
 
I've had some big differences with XF staff, but one thing I can vouch for them is the product quality. The self-hosted option was great and so is their hosted option, which hasn't had any downtime since I took it out last year. I'm also happy to take at face value, the recent announcements by Chris D and Kier.
Yup they're definitely doing a great job of their cloud offering, as far as I can tell.
After a couple of months running it, a good part is definitely from XF as a piece of software being very stable on its own, but that's also only one part of it all.

Also, I was backreading that thread (I think you're refering to anyway) just yesterday and to be honest I couldn't really see ourselves asking for much more from XF from a technical standpoint (besides a few bits like reworking the dated CSRF token model as it makes caching a royal pain in the a**, and the rather questionable handling of animated images processing).
 
Not entirely sure what you imply here to be honest; at least not from how I understand the meaning of "butter face" :confused:

The meaning was despite it's positive points, there are glaring flaws.

I don't think it's a fair assessment to praise availability if you're going to leave out the regional centers serving more than half the USA's population.

Their own SLA points out their lack of confidence in availability, they don't even give three nine for instance level uptime. Industry standard is five nines outside of major cloud providers such as Google, Azure, AWS, etc. - they each boast of their infrastructure but fail to have to have five 9 SLAs.

Vultr... 100% instance level SLA.
 
Idk how much they are spending, but I'd expect at least one extra 0 or two on their monthly bill, from the screenshots they shared.
That said, your case is also peculiar. I don't know what it is for, but the numbers suggest at least a rather media-lightweight site with very good cacheability (and to be fair, AT does look very cacheable from here).
The major chunk of content that we serve is images as per CF, users uploads lot of images and screenshots.

1690034717008.png
for last 24hrs

Note: CF Converts all the uploaded media to webp.

As of cacheability, I have written a custom addon that configures it for edge and browser. User vs Guest Cache. Its still in testing phase.
 
I don't think it's a fair assessment to praise availability if you're going to leave out the regional centers serving more than half the USA's population.
🤷 you're not wrong, but having 70-something AZs (no idea how many they have these days) and having exactly 1 that sucks because it has something like 10x the population and probably more stuff going on in it that half the internet is also... not very fair to judge on I'd say?

The glaring flaw is rather is that when you use "global" network-related AWS services (like Cloudfront, Route53, ACM, ...), their config must reside inside us-east-1, so when it's broken you can't modify a few things relating to those.
Is that acceptable that it's ever broken? Probably not. Is that relevant to anyone not running their compute there (as nothing forces you to do so) that it's 3 instead of 5 nines? Not really either.
When I worked at a place very invested in AWS, the bit about handling cross-region auth to do network changes there was easily 10000x more annoying than the 2 times in 3 years that it had an issue while I was trying to change something...

Their own SLA points out their lack of confidence in availability, they don't even give three nine for instance level uptime. Industry standard is five nines outside of major cloud providers such as Google, Azure, AWS, etc. - they each boast of their infrastructure but fail to have to have five 9 SLAs.

Vultr... 100% instance level SLA.
Yeah so idk the exact details of everyone's SLAs, and those are always complex to compare so I'd refrain from doing so.

What I'll say however is that "5 nines" being standard in IT is:
a. Very much news to me
b. 100% a promise of indemnisation where applied, rather than any guarantee of reliability

You might decide to commit to it and pay up if someone pipes a word, and that's very honorable, but no one sane should actually count on it.

Heck, if you ever connect a machine/VM to the internet, you're already missing it, because (honest) ISPs don't even try to promise that kind of QoS guarantee (since even if they were perfect, their partners aren't). Sometimes a subsea cable just gets wrecked and NA<>Asia routing goes to **** for the whole internet. Happens more often than you'd think. Or coordinated sabotage in France cuts a couple of major backbone cables all at once causing widespread issues across all of Europe. That's the kind of things that happen yearly or more, and take more than "5m13s" (99.999% annualized SLA) to fix, and mean it's literally not possible to uphold.

And if you mean non-network-related SLAs, I hope you use future storage technology. Single SSDs/HDDs easily incur 5m13s per year of stall due to whatever defect or bug in their firmware/controller. And redundant storage "devices", which are usually network-attached (and are a good thing, not criticizing those) are also subject to whatever software bugs. Unless your distributed storage implementation happens to be bug-free and never crash while a customer was pointed to a given instance.

Now is that still a much better commitment from Vultr? Sure. And I don't doubt that they honor their side of it. But I would still very much not equate an SLA to an effective reliability (and again, I'm sure Vultr's reliability is great if they decided to be so committed to it contractually).

As of cacheability, I have written a custom addon that configures it for edge and browser. User vs Guest Cache. Its still in testing phase.
Well, I'll definitely keep an eye out for it then.
When I originally made a thread about it, tested a few things, and read other suggestions, it seemed quite nontrivial to me to make it work as it should ideally.
DigitalPoint ended up biting that bullet with their CF addon afaik, and while they originally handwaved the difficulty, they stuck through to it and reported quite a few bugs as a result, so eventually we will hopefully see these annoying bits dealt with.
 
Last edited:
When I originally made a thread about it, tested a few things, and read other suggestions, it seemed quite nontrivial to me to make it work as it should ideally.
DigitalPoint ended up biting that bullet with their CF addon afaik, and while they originally handwaved the difficulty, they stuck through to it and reported quite a few bugs as a result, so eventually we will hopefully see these annoying bits dealt with.
There's definitely a lot of gotchas involved with making guest page caching work right. I ended up sorting them all out and now we are left with a few minor ones (it works a lot better than even the XenForo built-in guest page caching system). There's always going to be some drawbacks, the best you can do it minimize them so the upside outweighs the downside:

1690039235041.png

You can see Cloudflare edge guest page caching live on my sites (for example https://iolabs.io).
 
When I originally made a thread about it, tested a few things, and read other suggestions, it seemed quite nontrivial to me to make it work as it should ideally.
I created a testing sheet of all bugs found and issues reported by users and fixed all. And yet a very funny case scenario was left. 🥹

As for CSRF issue, we may have most of the pages where this issue exist hidden from guest so it kinda irrelevant to us.
So far we have no issues related to login/register/forgotpassword or any other caching related issues that i am aware of. (Users starts sending a lot of email even if there is a minor issue and twitter feed ruined)

At end of day this guest page caching is very useful as it decreases TFFB a lot, below are results for a thread, we don't have much traffic from Asia so response time would be higher there.
1690039067741.png1690039174333.png1690039225866.png

PS: That addon may never be released as I coded that specifically for third party as per their needs.
 
Last edited:
There's always going to be some drawbacks, the best you can do it minimize them so the upside outweighs the downside […]
At the moment, for sure yeah. I’d certainly not expect it to work perfectly anyway.

However if we consider the fact that we have to regularly refresh a csrf token anyway especially for guests, that’s somewhat of an avenue that could be used to work around the guests user counts and thread views for example (in principle at least).

As for the theme bit, while that one is certainly trickier and honestly probably not worth working around for guests, it’s still very doable in theory. For example making the selected theme show up as a request cookie of its own, and having one’s edge then convert that into a header (idk if you can easily do that transformation with CF specifically, but given it is a pretty simple operation in principle I’m going to guess that it’s possible), and varying on it in the response. Though I’d certainly not bother with it anyway.

But anyway, I still need to download your plugin and investigate what in it specifically deals with guest caching to adapt it at some point to our infra… unless XF ends up making it simpler in its core 🫡

I created a testing sheet of all bugs found and issues reported by users and fixed all. And yet a very funny case scenario was left. 🥹
Good luck with that last one then! If the sheet itself isn’t too sensitive, do you plan on sharing it somewhere?

As for CSRF issue, we may have most of the pages where this issue exist hidden from guest so it kinda irrelevant to us.
So far we have no issues related to login/register/forgotpassword or any other caching related issues that i am aware of. (Users starts sending a lot of email even if there is a minor issue and twitter feed ruined)
Ah makes sense then.

At end of day this guest page caching is very useful as it decreases TFFB a lot
It’d definitely make a massive difference for most installations indeed. Idk if XF Cloud is sharding their customers by location and it’s a coincidence, or if they have some particular trick, but it’d likely serve them quite well too!
 
At the moment, for sure yeah. I’d certainly not expect it to work perfectly anyway.

However if we consider the fact that we have to regularly refresh a csrf token anyway especially for guests, that’s somewhat of an avenue that could be used to work around the guests user counts and thread views for example (in principle at least).
Ya, I had the same thought, but (hopefully) we will do away with CSRF tokens completely soon. All modern browsers support Sec-Fetch-Site header these days, so hopefully XenForo will just do away with CSRF soon. I have a different addon that does some PWA stuff that uses Sec-Fetch-Site for a different reason and just uses CSRF as a fallback (which honestly is what XenForo should be doing at this point). My point is browsers have done away with the need for CSRF, so hopefully XenForo uses that functionality before too long.

So the "right" long-term way of doing it would be to have a callback to the site for every page view to keep not just thread views up to date, but guests showing in online users. At that point you are starting to diminish the benefits of guest page caching because you are needing to spin up sessions for every page view on the server-side.

My point is that I'm assuming the need to fetch a CSRF token at all will go away at some point, so piggybacking it probably isn't the best long-term solution.

As for the theme bit, while that one is certainly trickier and honestly probably not worth working around for guests, it’s still very doable in theory. For example making the selected theme show up as a request cookie of its own, and having one’s edge then convert that into a header (idk if you can easily do that transformation with CF specifically, but given it is a pretty simple operation in principle I’m going to guess that it’s possible), and varying on it in the response. Though I’d certainly not bother with it anyway.
Yep, It's doable in theory, but the effort required and the kludginess of doing it wasn't worth the effort. The site will still work if a guest picks a non-default theme, it just won't serve it from cache. Realistically, the benefit of caching non-default themes will be minimal anyway because one can assume most guests users aren't going to be using a non-default theme, and for the ones that are, the cache hits are going to be fairly low anyway because well... most guests aren't going to be using them.
 
At that point you are starting to diminish the benefits of guest page caching because you are needing to spin up sessions for every page view on the server-side.
Hmm tbh I think it’s a fine tradeoff in that case. For the user it’d be transparent and not actually slow down anything (realistically the LP time should be higher that firing a non-awaited POST). It means less compute-saving on XF side though, but no more than currently so I would take that drawback happily personally. But I see how it’s a bit of a question mark.

My point is that I'm assuming the need to fetch a CSRF token at all will go away at some point, so piggybacking it probably isn't the best long-term solution.
Yup technically it’s far past being necessary, but I wonder how easy it will be for the XF team to rip out tbh; I haven’t looked that extensively but it seemed pretty entrenched when I looked at the source code of XF.

Realistically, the benefit of caching non-default themes will be minimal anyway because one can assume most guests users aren't going to be using a non-default theme
And yep, totally agree.
 
Hmm tbh I think it’s a fine tradeoff in that case. For the user it’d be transparent and not actually slow down anything (realistically the LP time should be higher that firing a non-awaited POST). It means less compute-saving on XF side though, but no more than currently so I would take that drawback happily personally. But I see how it’s a bit of a question mark.
Yep, I'm going to add support for it to my addon where users can toggle it on/off (backhauling a request to the server even for guests), but I'm waiting for Cloudflare Snippets to be available. That will let me do some trickery where we fork the request inside Cloudflare. The guest requests the page, Cloudflare serves them the cached page and then Cloudflare makes a different request on their end to the origin server to tell the server that a request was made. Basically do it in a way where the end user's browser isn't the one making the request to the origin server. Much fancier than the user's browser making a request to the origin. ;)

Yup technically it’s far past being necessary, but I wonder how easy it will be for the XF team to rip out tbh; I haven’t looked that extensively but it seemed pretty entrenched when I looked at the source code of XF.
You wouldn't want to completely do away with it because you'd need to support really old browsers too. The nice thing is you can tell if a browser supports it on the server-side because the request header is always there if it does. Because of that, it's really easy to shim it in... if the header exists, the CSRF check method uses that header, if it doesn't, use the existing CSRF checking code. Same with sending the CSRF token... if the header is there, you don't need to send the CSRF cookie, if it isn't there, fallback to the existing code.
 
Back
Top Bottom