in General

Watching nginx upstreams with collectd

Already happy with nginx in front of Apache for a number of sites, I decided it was time to start testing nginx/fastcgi on my personal server (the serial crash test dummy of my web operations). The only problem: I have yet to find a sensible method of grabbing useful runtime information from the PHP fastcgi process itself, and if you can’t sensibly watch it, you can’t sensibly deploy it.

So for now, instead of watching the PHP fastcgi process directly, I’m tracking its performance and usage from nginx’s perspective. You can log all kinds of data about upstream performance with nginx:

log_format upstream '$remote_addr - - [$time_local] "$request" $status '
    'upstream $upstream_response_time request $request_time '
    '[for $host via $upstream_addr]';

Then we log to a central upstream.log file from every location block which includes a fastcgi_pass parameter. For example:

location ~ \.php$ {
    include  fastcgi_params;
    access_log  /var/log/nginx/upstream.log  upstream;
    fastcgi_pass  fcgi_php;
    fastcgi_param  SCRIPT_FILENAME  $wordpress_root$fastcgi_script_name;
}

Now we know how many requests the PHP fastcgi process is handling, and how quickly it’s doing so. collectd‘s tail plugin can watch this log file…

<Plugin tail>
  <File "/var/log/nginx/upstream.log">
  Instance "nginx"
    <Match>
      Regex ".*"
      DSType "CounterInc"
      Type counter
      Instance "requests"
    </Match>
    <Match>
      Regex " upstream ([0-9.]*) "
      DSType GaugeAverage
      Type delay
      Instance "upstream"
    </Match>
  </File>
</Plugin>

… and turn it into something readable. First, the number of requests per second (which I only started watching at 14:30 this afternoon), then the delay for each request:

nginx Upstream Requests

nginx Upstream Response

(Relatively boring statistics here, as it’s only monitoring the dynamic processing of my personal sites.)

Combining nginx’s flexible logging and collectd’s tail plugin makes it pretty easy to watch the usage and performance of whatever you’re running behind nginx, even if you can’t instrument the application itself.

… and thus far, I’m pretty happy with the performance, reliability and resource usage of nginx in front of PHP in fastcgi mode. :-)

Write a Comment

Comment

Comments will be sent to the moderation queue.

16 Comments

  1. I’ve been impressed by nginx so far as well, but it does not have the ability to spawn cgi-processes on its own afaik – so I relied on lighttpd’s spawn-fcgi (nowadays external) helper, but I am seriously considering proxying to apache. Some PHP stuff (pecl uploadprogress anyone) more or less only works with apache. So here is my question: Memory consumption issues aside, would you rather recommend apache or running fcgi processes + opcode optimizers (apc, etc) on production sites. Whats your take ?

    • Either way, you should always use an opcode cache (APC appears to be the most stable and attentively maintained). :-) And yes, I’m using spawn-fcgi to manage the PHP fastcgi processes.

      If you don’t depend on Apache features, I would now recommend going down the fastcgi route. I haven’t done it for so long because Apache has been something of a familiar security blanket. :-)

  2. Cherokee would keep up ok, but then the next version would come out and break you config file. Nobody using php-fpm? It takes a bit more work patching php and applying suhosin etc instead of letting a distro do it. Hopefully it will be packaged or part of standard php one day. Curious about collectd memory & cpu being c based vs one of the usual suspects (cacti, munin etc) – every byte counts on a vps. Do you notice it?

      • How, is it *that* Jeff Waugh complaining about IO of something capable of networked aggregation? :)

        We run a few servers for non-commercial FLOSS related projects, and having the less powerful system handle collectd streams (and hopefully logs in the future when we do VPN as well) helps immensely. The other way around might be dedicated spindle for logs/rrds and backups.

  3. I’ve got some good words to lose on collectd. I peeked into its source and found very clean C code indeed. The architecture is quite smart, too: agents only collect data from the local system and send UDP packets to the servers with the rrdtool plugin enabled. That way I assume there is very little chance for a remote vulnerability in the agents.

    collectd gathers all data by its native C plugins. Contrary to munin, they don’t fork other processes and eat therefore only very few system resources.

    However, I don’t like the web front-ends of collectd (collection3). It’s very basic, difficult to configure and I really hope something better will come up in the future.

    Jeff Waugh: what’s wrong with I/O in collectd? Maybe a cache misconfiguration in the rrdtool plugin?

    • @Astro: Say you’ve got 150 rrd files per server. Say you’ve got 100 servers. Say you want them all reporting collectd data to a single server. Now you’ve got one machine essentially doing random writes over 15000 disk locations every 10 seconds.

      (Sure, there are controls for this if you want to risk your stats to memory, which is not always a bad idea, but still… it’s something the server admin needs to consider.)

      • Then you probably don’t need 150 rrds per server anyways.

        OTOH one also might be interested in Clustrx which is an emerging aggregator capable of handling thousands of nodes with dozens of parameters and resolution down to a second. It is being developed for HPC environments and should be available this year.

  4. Hey Jeff,

    This is the only snippet of information I can find about doing collectd monitoring of PHP-FPM, or any fastcgi backend.

    I’m curious, since this is nearly two years old, have you improved your approach any? Any additional datapoints which you’re monitoring?