XF 1.4 Image Proxy crashing PHP-FPM after moving to SSL

DeltaHF

Well-known member
I just moved my site to SSL last night, and now the Image Proxy will crash PHP-FPM if it has trouble pulling images from problematic origin servers. I've had trouble with it before (and it's always been a bit temperamental for me), but the impact is now much more severe after the SSL conversion. This is a very large high-traffic forum (10.2 million posts, ~300k daily page views) which has some extremely image-heavy threads, so we push the Image Proxy hard.

PHP-FPM has crashed about 8 times in the past 12 hours. There are no errors in the PHP-FPM error log. The server runs a LEMP web stack configured by @eva2000's Centminmod. Any help or advice would be greatly appreciated.

NewRelic graphs from two of the most severe incidents:
Screen Shot 2015-02-26 at 2.33.18 PM.webp

Screen Shot 2015-02-26 at 2.34.15 PM.webp


Screen Shot 2015-02-26 at 2.34.36 PM.webp

My php-fpm.conf:
Code:
log_level = debug

pid = /var/run/php-fpm/php-fpm.pid
error_log = /var/log/php-fpm/www-error.log
emergency_restart_threshold = 10
emergency_restart_interval = 1m
process_control_timeout = 10s

[www]
user = nginx
group = nginx
listen = 127.0.0.1:9000
listen.allowed_clients = 127.0.0.1
;listen.backlog = -1
;listen = /tmp/php5-fpm.sock
;listen.owner = nobody
;listen.group = nobody
;listen.mode = 0666

pm = static
pm.max_children = 8
; Default Value: min_spare_servers + (max_spare_servers - min_spare_servers) / 2
pm.start_servers = 8
pm.min_spare_servers = 8
pm.max_spare_servers = 8
pm.max_requests = 100

; PHP 5.3.9 setting
; The number of seconds after which an idle process will be killed.
; Note: Used only when pm is set to 'ondemand'
; Default Value: 10s
pm.process_idle_timeout = 10s;

rlimit_files = 65536
rlimit_core = 0

; The timeout for serving a single request after which the worker process will
; be killed. This option should be used when the 'max_execution_time' ini option
; does not stop script execution for some reason. A value of '0' means 'off'.
; Available units: s(econds)(default), m(inutes), h(ours), or d(ays)
; Default Value: 0
;request_terminate_timeout = 0
;request_slowlog_timeout = 0
slowlog = /var/log/php-fpm/www-slow.log

pm.status_path = /phpstatus
ping.path = /phpping
ping.response = pong

; Limits the extensions of the main script FPM will allow to parse. This can
; prevent configuration mistakes on the web server side. You should only limit
; FPM to .php extensions to prevent malicious users to use other extensions to
; exectute php code.
; Note: set an empty value to allow all extensions.
; Default Value: .php
security.limit_extensions = .php .php3 .php4 .php5

; catch_workers_output = yes
php_admin_value[error_log] = /var/log/php-fpm/www-php.error.log

PHP-FPM Status (about 10 minutes after a crash and manual restart):
Code:
pool: www
process manager: static
start time: 26/Feb/2015:19:48:23 +0000
start since: 428
accepted conn: 9846
listen queue: 0
max listen queue: 129
listen queue len: 128
idle processes: 7
active processes: 1
total processes: 8
max active processes: 8
max children reached: 0
slow requests: 0

I have also modified the timeout in /library/XenForo/Model/ImageProxy.php from 10 to 3 seconds, from line 426:
Code:
$response= XenForo_Helper_Http::getClient($requestUrl, array(
                                'output_stream' =>$streamUri,
                                'timeout' =>3
                        ))->setHeaders('Accept-encoding', 'identity')->request('GET');
 

Attachments

  • Screen Shot 2015-02-26 at 2.31.49 PM.webp
    Screen Shot 2015-02-26 at 2.31.49 PM.webp
    42.8 KB · Views: 2
Last edited:
Admittedly I haven't really gotten into the nuances of PHP-FPM configuration, but since you're setting process manager (pm) to static, pm.max_children means that you can only serve 8 processes concurrently. If you have 8 cores and the processes are always CPU bound, that makes sense. But we're seeing an example where you're really just network bound, so if there are 8 processes waiting on network connections, then you could be not serving anything with effectively no load. While normally this wouldn't happen, if someone embeds a number of images in a post from the same bad host, the browser will try to load them simultaneously (up to a point) and that could trigger an issue. It may be worth trying the "dynamic" pm setting to see if that makes a difference.

That's not a crash though and presumably it should recover. Does it just stop serving full stop until you intervene? You may want to increase your log_level to see if that gives any more information.
 
Thanks for your input, Mike, I think you are correct.

I have doubled max_children, start_servers, min_spare_servers, and max_spare_servers to 16 and set the max_requests to 500 (100 is apparently far too low). This is running on an 8-core (Xeon E5-2680 v2 @ 2.80GHz) Linode.

So far the server has been performing well with no crashes. Let's hope it stays that way. :)

EDIT: Nope, it didn't. Just crashed again, same way. It does eventually come back after 15 minutes or so of downtime.
 
Last edited:
If you up the log_level to debug, does that give any more details?
Yep, it sure does:
Code:
server reached pm.max_children setting (16), consider raising it
:) I'm going to switch to dynamic and increase the max_children to 32 and see how it goes. Oddly, most of the timeouts seem to come during the late night when traffic is low. I suppose a large forum with heavy Image Proxy use just needs a lot more PHP children than most configurations call for. I've got 16GB of RAM.

The next question: why did this problem suddenly become so severe after switching to HTTPS?

My theory: I enabled SPDY/3.1 at the same time, which supports multiplexing. With regular HTTP, most browsers would only make 2-6 connections at a time, effectively throttling their requests to proxy.php and limiting the impact of a slow origin server. With SPDY, browsers are making many more simultaneous requests to the proxy, so a sluggish origin server can quickly tie up a large number of PHP processes for a single visitor loading an image-heavy thread. Perhaps this is something to keep in mind for other large sites making the move to SSL.

Use this config:
Code:
pm = dynamic
pm.max_children = 16
pm.start_servers = 6
pm.min_spare_servers = 2
pm.max_spare_servers = 10
pm.max_requests = 500
Thank you, Roldan. Now that I have confirmation the server is running out of processes, I'm going to try those settings out with 32 max_children instead.
 
Yep, it sure does:
Code:
server reached pm.max_children setting (16), consider raising it
:) I'm going to switch to dynamic and increase the max_children to 32 and see how it goes. Oddly, most of the timeouts seem to come during the late night when traffic is low. I suppose a large forum with heavy Image Proxy use just needs a lot more PHP children than most configurations call for. I've got 16GB of RAM.

The next question: why did this problem suddenly become so severe after switching to HTTPS?

My theory: I enabled SPDY/3.1 at the same time, which supports multiplexing. With regular HTTP, most browsers would only make 2-6 connections at a time, effectively throttling their requests to proxy.php and limiting the impact of a slow origin server. With SPDY, browsers are making many more simultaneous requests to the proxy, so a sluggish origin server can quickly tie up a large number of PHP processes for a single visitor loading an image-heavy thread. Perhaps this is something to keep in mind for other large sites making the move to SSL.


Thank you, Roldan. Now that I have confirmation the server is running out of processes, I'm going to try those settings out with 32 max_children instead.

I'm trying the same thing you did (switching to dynamic, max_children to 32), after experiencing similar issue. How has everything been working out for you?
 
I'm trying the same thing you did (switching to dynamic, max_children to 32), after experiencing similar issue. How has everything been working out for you?
It's been going well. I have since moved from Linode to a dedicated server with 32GB of RAM. Here are my current PHP-FPM settings:

Code:
pm = ondemand
pm.max_children = 50
pm.start_servers = 20
pm.min_spare_servers = 5
pm.max_spare_servers = 35
pm.max_requests = 5000
 
It's been going well. I have since moved from Linode to a dedicated server with 32GB of RAM. Here are my current PHP-FPM settings:

Code:
pm = ondemand
pm.max_children = 50
pm.start_servers = 20
pm.min_spare_servers = 5
pm.max_spare_servers = 35

These don't do anything with ondemand:

Code:
pm.start_servers = 20
pm.min_spare_servers = 5
pm.max_spare_servers = 35
pm.max_requests = 5000

Per PHP FPM Comments:

Code:
;  ondemand - no children are created at startup. Children will be forked when
;             new requests will connect. The following parameter are used:
;             pm.max_children           - the maximum number of children that
;                                         can be alive at the same time.
;             pm.process_idle_timeout   - The number of seconds after which
;                                         an idle process will be killed.
 
Last edited:
The revival of this thread got me to look back into why I was using "ondemand" in the first place, because according to my server setup notes, I had decided to go with "dynamic" early last year. o_O I think I changed it to just do some real-world benchmarks and just forgot to revert it back.

Anyway, I changed it back to "dynamic" again to see what would happen. The answer appears to be nothing, according to my NewRelic PHP graphs over the past 24 hours. Having said that, page loads do feel faster, though that may just be a placebo effect. My site serves between 100k-250k pageviews per day between WordPress and XenForo.
 
In order for you to see any difference you would have to be spawning processes non stop in an insane amount of short period of time and even then your hardware would probably handle it with no issue at all.

Ondemand will be able to handle your traffic and everyone else's here just fine. I like to keep it as simple as possible these days with configuration of servers.
 
Top Bottom