XF.com Downtime?

It generally should have cleared quickly. I only discovered late last night what was really going on and it ended up in a situation that was really frustrating because it was a) out of my control and b) would have been really easy to prevent (but because of A, I couldn't do anything). The downtime ended up being inevitable when we wanted to change DNS settings.

We had been using DNS from the registrar (Namecheap/Enom) and it had actually been like that since day 1. We have had a few issues with it due to DDoSes directed at others, but for the most part it has been fine. However, yesterday morning, there were some sporadic issues that led to DNS resolution failing for some people. That was the final straw and led to us moving DNS (to Route53; if you're accessing us now, you're using it).

Here's the issue though. As soon as we switched off the Namecheap DNS v1 servers (even if we were moving back to their DNS v2 servers), DNS v1 stopped resolving our domain. There appears to be a goal of redirecting to the new canonical servers but it doesn't appear to work correctly. This means that if someone is still hitting the old DNS server, the domain would fail until their local (intermediate) DNS server updates at realizes that the canonical records are elsewhere. That seemed to happen very quickly on a number of servers, but others appear to still be holding out now (>24 hours later). Because of the locality of DNS and intermediate servers ignoring TTLs, it's somewhat hard to know how many people are still affected but it should hopefully be very few; in terms of raw traffic, there hasn't been any noticeable difference.

If we had been on DNS v2, none of this should have been an issue because it would have still resolved the domain and would have allowed a graceful transition. When I realized what was going on, I contacted the DNS provider about this but it seems that there was nothing to do about it. I contemplated going back (which should have resolved the issue immediately), but we had already gone through (hopefully) most of the "pain" of the transition and if I had undone it, that would have ended up causing the whole issue again at some later.

Ridiculously frustrating situation. :(
 
(to Route53; if you're accessing us now, you're using it)

Yay!

Because of the locality of DNS and intermediate servers ignoring TTLs, it's somewhat hard to know how many people are still affected but it should hopefully be very few; in terms of raw traffic, there hasn't been any noticeable difference.

From my last activity of 12:30 AM (PDT) yesterday, I personally wasn't able to access the site again until about 5:20 PM as I posted earlier, which would have been 12:20 AM for you guys (usually 1:00 AM when it's 5:00 PM here, but you guys haven't gone into DST yet, haha). Then about 30 minutes later it went down for me again. I tried flushing my DNS, but nothing seemed to happen. However, about 10 minutes later I was able to access the site again without further issues, so I'm not sure if it was a coincidence, the flush took a while to take effect, or if it was what you did as I did see you on between 6:00 PM and 7:00 PM (likely one of the latter two, I'll bet).
 
been there... done that, got the same frustration hat on and t-shirt to match!

Glad it's all resolved, I feel your pain @Mike!
 
DNS can be a bit tricky sometimes.
But you are good with Route53, I have been using it since the start. Never had problems. It's also easy to do failover and the health checks are good.
The only thing I miss is DNSSEC!
 
Haven't tried Amazon's Route 53 myself, but I would assume it's pretty good being on AWS.

If anyone is looking at DNS options, check out CloudFlare's DNS service... it's really nice even if you don't use CloudFlare's normal services (front-end proxy, cache, etc.) Not only is it free, it's also better than some of the previous enterprise DNS services I paid for in the past. CloudFlare has been rolling out a lot of new data centers lately as well, so right now DNS is served from the closest data center to the end user our of 32 locations around the world.

lol... okay, I sound like I own CloudFlare or something. hah... just think they have good service, that's all. :)

Side note - you can compare the latency of CloudFlare's DNS vs. Amazon's Route 53 from a few different locations around the world:

http://tools.maxcdn.com/ping?d1=NS-1022.AWSDNS-63.NET&d2=DNS2.CLOUDFLARE.COM

Oh yeah, they also have an API to manage DNS if you want (I use it for monitoring/auto-failover). https://www.cloudflare.com/docs/client-api.html

Haha, okay, now I'm really done. :)
 
Top Bottom