生产问题（十四）K8S抢占CPU导致数据库链接池打爆

发布时间：2024年01月24日

一、引言

????????线上一天出现了两次数据库连接失败的大量报错，一开始以为是数据库的问题，但是想了想如果是数据库的问题，应该会有大量的应用问题

????????具体分析之后，发现其实是容器cpu出现了Throttled，导致大量线程阻塞

?二、分析

1、堆栈

? ? ? ? 既然出现了报错，又没有发布，先看看堆栈里面报错的地方

? ? ? ? 追踪到最底层的堆栈，显示的是数据库链接池耗尽[size:100; busy:1; idle:0; lastwait:44]，100个链接只有4个在处理，没有一个空闲的，那么很明显其他的链接所在的线程都被阻塞了

2、源码

? ? ? ? 虽然说不太可能，但是还是要看看源码。大多数同学都觉得框架里面是不可能有错的，所以也不会去看源码，基本上都是去其他方向分析，这个在99%的情况下没有问题，但是作者不止一次遇到框架本身是有问题的，只是遇到了一些极端情况。比如：

Mybatis拼接sql出错及源码解析_<if test="month!=null">-CSDN博客

生产问题（九）Mysql8.0 ddl问题_mysql8 default_authentication_plugin-CSDN博客

生产问题（十一）日志JavaAgent-NoClassDefFoundError_-javaagent 找不到报错-CSDN博客

生产问题（十三）谷歌Protobuf误修改系统全局时区-CSDN博客

? ? ? ? 这些问题，如果不往框架的方向上走是找不到问题原因的，那么言归正传，我们看源码，这是tomcat进行线程链接的地方，超时的话就会报

throw new PoolExhaustedException("[" + Thread.currentThread().getName()+"] " +
    "Timeout: Pool empty. Unable to fetch a connection in " + (maxWait / 1000) +
    " seconds, none available[size:"+size.get() +"; busy:"+busy.size()+"; idle:"+idle.size()+"; lastwait:"+timetowait+"].");

? ? ? ? 和lastwait:44的现象也对的上，给我们进一步指明了方向，有很多占据数据库链接的线程不做事，其他很多线程在等

private PooledConnection borrowConnection(int wait, String username, String password) throws SQLException {

        if (isClosed()) {
            throw new SQLException("Connection pool closed.");
        } //end if

        //get the current time stamp
        long now = System.currentTimeMillis();
        //see if there is one available immediately
        PooledConnection con = idle.poll();

        while (true) {
            if (con!=null) {
                //configure the connection and return it
                PooledConnection result = borrowConnection(now, con, username, password);
                if (result!=null) return result;
            }

            //if we get here, see if we need to create one
            //this is not 100% accurate since it doesn't use a shared
            //atomic variable - a connection can become idle while we are creating
            //a new connection
            if (size.get() < getPoolProperties().getMaxActive()) {
                //atomic duplicate check
                if (size.addAndGet(1) > getPoolProperties().getMaxActive()) {
                    //if we got here, two threads passed through the first if
                    size.decrementAndGet();
                } else {
                    //create a connection, we're below the limit
                    return createConnection(now, con, username, password);
                }
            } //end if

            //calculate wait time for this iteration
            long maxWait = wait;
            //if the passed in wait time is -1, means we should use the pool property value
            if (wait==-1) {
                maxWait = (getPoolProperties().getMaxWait()<=0)?Long.MAX_VALUE:getPoolProperties().getMaxWait();
            }

            long timetowait = Math.max(0, maxWait - (System.currentTimeMillis() - now));
            waitcount.incrementAndGet();
            try {
                //retrieve an existing connection
                con = idle.poll(timetowait, TimeUnit.MILLISECONDS);
            } catch (InterruptedException ex) {
                if (getPoolProperties().getPropagateInterruptState()) {
                    Thread.currentThread().interrupt();
                }
                SQLException sx = new SQLException("Pool wait interrupted.");
                sx.initCause(ex);
                throw sx;
            } finally {
                waitcount.decrementAndGet();
            }
            if (maxWait==0 && con == null) { //no wait, return one if we have one
                if (jmxPool!=null) {
                    jmxPool.notify(org.apache.tomcat.jdbc.pool.jmx.ConnectionPool.POOL_EMPTY, "Pool empty - no wait.");
                }
                throw new PoolExhaustedException("[" + Thread.currentThread().getName()+"] " +
                        "NoWait: Pool empty. Unable to fetch a connection, none available["+busy.size()+" in use].");
            }
            //we didn't get a connection, lets see if we timed out
            if (con == null) {
                if ((System.currentTimeMillis() - now) >= maxWait) {
                    if (jmxPool!=null) {
                        jmxPool.notify(org.apache.tomcat.jdbc.pool.jmx.ConnectionPool.POOL_EMPTY, "Pool empty - timeout.");
                    }
                    throw new PoolExhaustedException("[" + Thread.currentThread().getName()+"] " +
                        "Timeout: Pool empty. Unable to fetch a connection in " + (maxWait / 1000) +
                        " seconds, none available[size:"+size.get() +"; busy:"+busy.size()+"; idle:"+idle.size()+"; lastwait:"+timetowait+"].");
                } else {
                    //no timeout, lets try again
                    continue;
                }
            }
        } //while
    }

3、监控

? ? ? ? 既然分析出有很多占据数据库链接的线程不做事，其他线程在等，那么在cpu、线程方面就应该有所体现，接下来看监控。

? ? ? ? 监控显示对应问题的时间点发生了CPU?Throttled，他会限制应用程序或进程的CPU使用率，一度达到十几秒，等待io的task线程激增，很明显这就是具体的原因，接下来需要找寻的是，为什么会产生Throttled。

? ? ? ? jvm的内存和线程激增也侧面验证了cpu阻塞，大量的线程处理不了任务

三、cpu抢占

? ? ? ? 通过上述分析，接下来需要找寻为什么会产生Throttled？

? ? ? ? 一般来说有两种原因：

1、瞬时流量激增

? ? ? ? 这种一般就是入口流量突然增加，已经分配的cpu处理不过来，需要按照K8S的动态分配继续申请cpu，但是来不及

? ? ? ? 看看请求数量，整体变化不大，但是监控采集是一分钟一次，所以这个方向不能被排除，但是没有其他方面的证据协助分析

2、宿主机cpu被抢占

? ? ? ? 这段监控是下午出问题时候的，上午找不到是哪个宿主机了

? ? ? ? 可以看到这个宿主机不太对，他这个0.7是每个cpu维度的，对应时间从0.3到0.7，相当于原来用三个，突然用到了7个，所以宿主机争抢的概率还是大一些????????

? ? ? ? 再看看具体每个容器，明显黄色容器的cpu有一段飙升，倒数第三个是出问题的机器? ? ? ?，上面的几个容器都有增加，一方面也代表了cpu被其他容器抢占的可能?

四、解决

? ? ? ? 由于瞬时流量激增、宿主机cpu被抢占两个方向的解决方案是完全不同的，所以我们需要先快速处理。

1、升配、增加机器数量

????????这样瞬时流量就会大幅减小，由于流量小了，cpu被抢占也就不会有太大影响

2、设置cpu set模式

? ? ? ? K8S给予linux的cgroups默认的一般都是share模式，就是给你分这么多cpu但是后续可以被抢占。

? ? ? ? 设置set模式，就是不给别人抢占，当然你忙的时候也抢不了别人

? ? ? ? 作者认为对于核心应用来说隔离是必须的，如果他隔离的情况下处理不了就应该升配或者扩容，而不是抢别人的

3、增加秒级监控

? ? ? ? 一分钟一次的监控属实是坑，基本上辅助不了确定的方向，只能是结合监控数据去猜一下，之前出现过很多问题也因为监控确定不了具体原因，只是锁定几个方向

? ? ? ? 但是怎么增加，增加有没有影响都是件麻烦事，因为框架组那边很直白的说了一分钟一次对于目前的监控都有压力，秒级的要自己暴露出prometheus规范的指标，通过加env的方式暴露出来，框架的监控agent会去采集。

????????那他为什么不去全部都加一下呢，还是内部有一些资源或者风险存在的，这就很顶了。

五、总结

? ? ? ? 这一次的问题被锁定在了瞬时流量激增、宿主机cpu被抢占两个方向，宿主机cpu被抢占概率大一些，监控分钟级别的采集不能确定唯一方向，只能沿着两个方向一起解决。

? ? ? ? 这也是个好事，起码知道了监控的机制，后续需要改进的方向。

文章来源:https://blog.csdn.net/m0_69270256/article/details/135814318
本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若内容造成侵权/违法违规/事实不符，请联系我的编程经验分享网邮箱：chenni525@qq.com进行投诉反馈，一经查实，立即删除！