通过FastCGI Cache实现服务降级

标签： 系统架构 | 发表时间：2014-11-29 07:31 | 作者：老王

出处：http://www.blogread.cn/it/

在自然界中，很多生物面临生死考验的时候，往往会做出惊人的反应，其中最为大家熟知的当属壁虎，危难关头，与其坐以待毙，不如断尾求生，通过自残来换取活下去的希望。对于互联网项目而言，同样存在着很多生死考验，比如：访问量激增；数据库宕机等等，此时如果没有合理的降级方案，那么结局必然是死路一条。

任何问题一旦脱离了实际情况，便失去了讨论的意义。在继续之前，不妨先介绍一下案例的背景情况：一个PHP网站，以读为主，原本躲在CDN后面，运行很稳定，后来新增了很多强调实时性的需求，便去掉了CDN，进而导致系统稳定性受到影响。因为历史包袱重，所以完全废弃以前的架构显得并不现实，解决方案最好能够尽可能透明，不能对原有架构造成冲击，最终我选择了通过 FastCGI Cache实现服务降级的方案。

关于FastCGI Cache，以前很多朋友已经做过分享，比如：超群、莿鸟栖草堂，概念性的东西我就不再赘述了，说点与众不同的：虽然使用了缓存，但出于实时性考虑，正常情况下缓存都是被穿透的，只有在出现异常情况的时候才查询，架构图如下：

Degradation

实现的关键点在于通过 error_page处理异常，并且完成服务降级：

limit_conn_zone $server_name zone=perserver:1m;

error_page 500 502 503 504 = @degradation;

fastcgi_cache_path /tmp
                   levels=1:2
                   keys_zone=degradation:100m
                   inactive=10d
                   max_size=10g;

upstream php {
    server 127.0.0.1:9000;
    server 127.0.0.1:9001;
}

server {
    listen 80;

    limit_conn perserver 1000;

    server_name *.xip.io;

    root /usr/local/www;

    index index.html index.htm index.php;

    location / {
        try_files $uri $uri/ /index.php$is_args$args;
    }

    location ~ \.php$ {
        set $cache_key $request_method://$host$request_uri;

        set $cache_bypass "1";
        if ($arg_degradation = "on") {
            set $cache_bypass "0";
        }

        try_files $uri =404;

        include fastcgi.conf;
        fastcgi_pass php;
        fastcgi_intercept_errors on;
        fastcgi_next_upstream error timeout;
        fastcgi_cache degradation;
        fastcgi_cache_lock on;
        fastcgi_cache_lock_timeout 1s;
        fastcgi_cache_valid 200 301 302 10h;
        fastcgi_cache_min_uses 10;
        fastcgi_cache_use_stale error
                                timeout
                                invalid_header
                                updating
                                http_500
                                http_503;
        fastcgi_cache_key $cache_key;
        fastcgi_cache_bypass $cache_bypass;

        add_header X-Cache-Status $upstream_cache_status;
        add_header X-Response-Time $upstream_response_time;
    }

    location @degradation {
        rewrite . $request_uri?degradation=on last;
    }
}

插播一个小技巧：设置域名时用到了 xip.io，有了它就不用设置hosts了，方便调试。

代码里用到的都是Nginx缺省包含的功能，我们可以看作是一个通用版，不过对照我们架构图中的目标就会发现：它没有实现全局激活缓存的功能。如何实现呢？最简单的方法就是通过单位时间内出错次数的多少来判断系统健康以否，设置相应的阈值，一旦超过限制就全局激活缓存，通过Lua我们可以实现一个定制版：

lua_shared_dict fault 1m;

limit_conn_zone $server_name zone=perserver:1m;

error_page 500 502 503 504 = @degradation;

fastcgi_cache_path /tmp
                   levels=1:2
                   keys_zone=degradation:100m
                   inactive=10d
                   max_size=10g;

upstream php {
    server 127.0.0.1:9000;
    server 127.0.0.1:9001;
}

init_by_lua '
    get_fault_key = function(timestamp)
        if not timestamp then
            timestamp = ngx.time()
        end

        return os.date("fault:minute:%M", timestamp)
    end

    get_fault_num = function(timestamp)
        local fault = ngx.shared.fault
        local key = get_fault_key(timestamp)

        return tonumber(fault:get(key)) or 0
    end

    incr_fault_num = function(timestamp)
        local fault = ngx.shared.fault
        local key = get_fault_key(timestamp)

        if not fault:incr(key, 1) then
            fault:set(key, 1, 600)
        end
    end
';

server {
    listen 80;

    limit_conn perserver 1000;

    server_name *.xip.io;

    root /usr/local/www;

    index index.html index.htm index.php;

    location / {
        rewrite_by_lua '
            if ngx.var.arg_degradation then
                return ngx.exit(ngx.OK)
            end

            local ok = true

            for i = 0, 1 do
                local num = get_fault_num(ngx.time() - i * 60)
                if num > 1000 then
                    ok = false
                    break
                end
            end

            if not ok then
                local query = "degradation=on"
                if ngx.var.args then
                    ngx.var.args = ngx.var.args .. "&" .. query
                else
                    ngx.var.args = query
                end
            end
        ';

        try_files $uri $uri/ /index.php$is_args$args;
    }

    location ~ \.php$ {
        set $cache_key $request_method://$host$request_uri;

        set $cache_bypass "1";
        if ($arg_degradation = "on") {
            set $cache_bypass "0";
        }

        try_files $uri =404;

        include fastcgi.conf;
        fastcgi_pass php;
        fastcgi_intercept_errors on;
        fastcgi_next_upstream error timeout;
        fastcgi_cache degradation;
        fastcgi_cache_lock on;
        fastcgi_cache_lock_timeout 1s;
        fastcgi_cache_valid 200 301 302 10h;
        fastcgi_cache_min_uses 10;
        fastcgi_cache_use_stale error
                                timeout
                                invalid_header
                                updating
                                http_500
                                http_503;
        fastcgi_cache_key $cache_key;
        fastcgi_cache_bypass $cache_bypass;

        add_header X-Cache-Status $upstream_cache_status;
        add_header X-Response-Time $upstream_response_time;
    }

    location @degradation {
        content_by_lua '
            if ngx.var.arg_degradation then
                return ngx.exit(ngx.HTTP_INTERNAL_SERVER_ERROR)
            end

            local res = ngx.location.capture(
                ngx.var.request_uri, {args = "degradation=on"}
            )

            ngx.status = res.status
            for name, value in pairs(res.header) do
                ngx.header[name] = value
            end
            ngx.print(res.body)

            incr_fault_num()
        ';
    }
}

说明：实际上真实案例中缓存键名的获取逻辑有点复杂，鉴于篇幅所限一切从简。

当系统正常时，运行于动态模式，数据通过PHP-FPM渲染；当系统异常时，全局缓存被激活，运行于静态模式，数据通过缓存渲染。通过测试发现，系统在从正常切换到异常时，因为舍弃了PHP-FPM，所以RPS从一千跃升到一万。这让我想起儿时看圣斗士的情景：每当不死鸟一辉被敌人击倒后，他总能重新站起来，并爆发出更大的能量。

此外需要说明的是：在发生故障的时候，如果出现大量缓存过期的情况，那么由于涉及到缓存的重建，所以依然会和PHP-FPM发生交互行为，这可能会影响性能，此时没有特别好的解决办法，如果Nginx版本够的话，可以考虑激活 fastcgi_cache_revalidate，如此一来，PHP-FPM一旦判断系统处于异常情况，那么可以直接返回304实现续费。

…

通过FastCGI Cache实现服务降级，这是一个完美的方案么？No！它甚至有些丑陋，比如说多台服务器时，会导致大量冗余的缓存，此外磁盘IO也需要注意。虽然这不是一个完美的方案，但是它简单，正符合我解决问题时的惯用打法：先用一个土鳖一点的解决方案缓解问题，再实现一个完美的架构解决问题。稍后我会考虑使用Memcached，加上一致性哈希来替换FastCGI Cache，实现一个相对完美的服务降级方案。

您可能还对下面的文章感兴趣：

漫漫降级路 [2013-05-01 18:38:46]
降级论 [2012-07-04 14:08:56]

通过FastCGI Cache实现服务降级

相关 [fastcgi cache 服务] 推荐：

通过FastCGI Cache实现服务降级

Cache应用中的服务过载案例研究

Guava cache

Java Cache系列之Guava Cache

巧用query cache

Cache-control使用Cache-control:private学习笔记

从KV Cache到Prompt Cache的应用

MySQL Query Cache 小结

从free到page cache

MySQL Query Cache 小结

相关文章

订阅