<< [ lucene扩展 ] spellChecker原理分析 - MR-fox - 博客园 | 首页 | Lucene权限实现 - 简单之美 - 企业应用与站点完美解决方案 >>

performance - High load average, low CPU usage - why? - Server Fault

We're seeing huge performance problems on a web application and we're trying to find the bottleneck. I am not a sysadmin so there is some stuff I don't quite get. Some basic investigation shows the CPU to be idle, lots of memory to be available, no swapping, no I/O, but a high average load.

The software stack on this server looks like this:

. Solaris 10 . Java 1.6 . WebLogic 10.3.5 (8 domains)

The applications running on this server talk with an Oracle database on a different server.

This server has 32GB of RAM and 10 CPUs (I think).

Running prstat -Z gives something like this:

   PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/NLWP

  3836 ducm0101 2119M 2074M cpu348  58    0   8:41:56 0.5% java/225

 24196 ducm0101 1974M 1910M sleep   59    0   4:04:33 0.4% java/209

  6765 ducm0102 1580M 1513M cpu330   1    0   1:21:48 0.1% java/291

 16922 ducm0102 2115M 1961M sleep   58    0   6:37:08 0.0% java/193

 18048 root     3048K 2440K sleep   59    0   0:06:02 0.0% sa_comm/4

 26619 ducm0101 2588M 2368M sleep   59    0   8:21:17 0.0% java/231

 19904 ducm0104 1713M 1390M sleep   59    0   1:15:29 0.0% java/151

 27809 ducm0102 1547M 1426M sleep   59    0   0:38:19 0.0% java/186

  2409 root       15M   11M sleep   59    0   0:00:00 0.0% pkgserv/3

 27204 root       58M   54M sleep   59    0   9:11:38 0.0% stat_daemon/1

 27256 root       12M 8312K sleep   59    0   7:16:40 0.0% kux_vmstat/1

 29367 root      297M  286M sleep   59    0  11:02:13 0.0% dsmc/2

 22128 root       13M 6768K sleep   59    0   0:10:51 0.0% sendmail/1

 22133 smmsp      13M 1144K sleep   59    0   0:01:22 0.0% sendmail/1

 22003 root     5896K  240K sleep   59    0   0:00:01 0.0% automountd/2

 22074 root     4776K 1992K sleep   59    0   0:00:19 0.0% sshd/1

 22005 root     6184K 2728K sleep   59    0   0:00:31 0.0% automountd/2

 27201 root     6248K  344K sleep   59    0   0:00:01 0.0% mount_stat/1

 20964 root     2912K  160K sleep   59    0   0:00:01 0.0% ttymon/1

 20947 root     1784K  864K sleep   59    0   0:02:22 0.0% utmpd/1

 20900 root     3048K  608K sleep   59    0   0:00:03 0.0% ttymon/1

 20979 root       77M   18M sleep   59    0   0:14:13 0.0% inetd/4

 20849 daemon   2856K  864K sleep   59    0   0:00:03 0.0% lockd/2

 17794 root       80M 1232K sleep   59    0   0:06:19 0.0% svc.startd/12

 17645 root     3080K  728K sleep   59    0   0:00:12 0.0% init/1

 17849 root       13M 6800K sleep   59    0   0:13:04 0.0% svc.configd/15

 20213 root       84M   81M sleep   59    0   0:47:17 0.0% nscd/46

 20871 root     2568K  600K sleep   59    0   0:00:04 0.0% sac/1

  3683 ducm0101 1904K 1640K sleep   56    0   0:00:00 0.0% startWebLogic.s/1

 23937 ducm0101 1904K 1640K sleep   59    0   0:00:00 0.0% startWebLogic.s/1

 20766 daemon   5328K 1536K sleep   59    0   0:00:36 0.0% nfsmapid/3

 20141 daemon   5968K 3520K sleep   59    0   0:01:14 0.0% kcfd/4

 20093 ducm0101 2000K  376K sleep   59    0   0:00:01 0.0% pfksh/1

 20797 daemon   3256K  240K sleep   59    0   0:00:01 0.0% statd/1

  6181 root     4864K 2872K sleep   59    0   0:01:34 0.0% syslogd/17

  7220 ducm0104 1268M 1101M sleep   59    0   0:36:35 0.0% java/138

 27597 ducm0102 1904K 1640K sleep   59    0   0:00:00 0.0% startWebLogic.s/1

 27867 root       37M 4568K sleep   59    0   0:13:56 0.0% kcawd/7

 12685 ducm0101 4080K  208K sleep   59    0   0:00:01 0.0% vncconfig/1

ZONEID    NPROC  SWAP   RSS MEMORY      TIME  CPU ZONE

    42      135   22G   19G    59%  87:27:59 1.2% dsuniucm01

 

Total: 135 processes, 3167 lwps, load averages: 54.48, 62.50, 63.11

I understand that CPU is mostly idle, but the load average is high, which is quite strange to me. Memory doesn't seem to be a problem.

Running vmstat 15 gives something like this:

 kthr      memory            page            disk          faults      cpu

 r b w   swap  free  re  mf pi po fr de sr s0 s1 s4 sd   in   sy   cs us sy id

 0 0 0 32531400 105702272 317 1052 126 0 0 0 0 13 13 -0 8 9602 107680 10964 1 1 98

 0 0 0 15053368 95930224 411 2323 0 0 0 0 0 0  0  0  0 23207 47679 29958 3 2 95

 0 0 0 14498568 95801960 3072 3583 0 2 2 0 0 3 3  0 21 22648 66367 28587 4 4 92

 0 0 0 14343008 95656752 3080 2857 0 0 0 0 0 3 3  0 18 22338 44374 29085 3 4 94

 0 0 0 14646016 95485472 1726 3306 0 0 0 0 0 0 0  0  0 24702 47499 33034 3 3 94

I understand that the CPU is mostly idle, no processes are waiting in the queue to be executed, little swapping is happening.

Running iostat 15 gives this:

   tty        sd0           sd1           sd4           ssd0           cpu

 tin tout kps tps serv  kps tps serv  kps tps serv  kps tps serv   us sy wt id

   0  676 324  13    8  322  13    8    0   0    0  159   8    0    1  1  0 98

   1 1385   0   0    0    0   0    0    0   0    0    0   0    0    3  4  0 94

   0  584  89   6   24   89   6   25    0   0    0  332  19    0    2  1  0 97

   0  296   0   0    0    0   0    0    0   0    0    0   0    0    2  2  0 97

   1 1290  43   5   24   43   5   22    0   0    0  297  20    1    3  3  0 94

Running netstat -i 15 gives the following:

    input   aggr26    output       input  (Total)    output
packets errs  packets errs  colls  packets errs  packets errs  colls
1500233798 0     1489316495 0     0      3608008314 0     3586173708 0     0
10646   0     10234   0     0      26206   0     25382   0     0
11227   0     10670   0     0      28562   0     27448   0     0
10353   0     9998    0     0      29117   0     28418   0     0
11443   0     12003   0     0      30385   0     31494   0     0
 

When you say 'High Load average' I assume you mean that prstat shows for 'load average' at the bottom of the output figures of

Total: 135 processes, 3167 lwps, load averages: 54.48, 62.50, 63.11

These numbers, look similar to the ones that top provides and probably mean the average queue size of running process. This isn't the percentage of processor time being used but how many 'things' are harassing the CPU for time to run. Admittedly, these do look quite high but this all depends on the app that you are running; the processes may not actually be doing much once they get their slot. See herefor a nice explanation regarding top.

I'm not familiar with WebLogic but I have noticed that, generally, with Apache Tomcat many Java threads can be spawned simultaneously for what appears as not many requests. It could be this that is causing those high average load numbers. Make sure that you are using connection pooling where appropriate to connect to the backend and consider upping the number of idle threads that are available to your app to handle connections (not sure how you do this on WebLogic; Tomcat has a per connector thread pool or a general executor thread pool). If you don't do this then brand new threads may be being spawned to process requests.

As to performance, you need to nail down what part of your app is suffering. Is it the processing that is happening in the WebLogic/Java side of things, the database access, DNS lookups (if they're being done for some reason...), network issues or something on the OS.

99% of the time it will be your code and how it talks to the database that is holding things up. Then it will be configuration of the web app. Past this point you will be working on squeezing the last milliseconds out of your app or looking at providing higher concurrency with the same hardware. For this finer grained performance tuning you need metrics.

For Java I'd suggest installing Java Melody. It can provide a lot of info regarding what your program is doing and help narrow down where it is spending time. I've only used it with Tomcat but should work fine with any Java EE container/servlet thingy.

There are a number of ways you can tune Java, so take a look at their performance guidelines (I'm sure you probably have) and make sure you're setting the correct Heap Size etc. suitable for your program. Java Melody can help you track down the size of Java's heap you're consuming as well as how hard the garbage collector is working/how often it is interrupting your program to clear objects.

I hope that has been helpful. If you provide any more information, I may be able to update this answer and hone it more towards your needs.

 

With some further investigation, it appears that the performance problem is mostly due to a high number of network calls between two systems (Oracle SSXA and UCM). The calls are quick but plenty and serialized, hence the low CPU usage (mostly waiting for I/O), the high load average (many calls waiting to be processed) and especially the long response times (by accumulation of small response times).

 

(2) how to understand the load average on the first line ("load average: 14.04, 14.02, 14.00")?

This article on load average uses a nice traffic analogy and is the best one I've found so far:Understanding Linux CPU Load - when should you be worried?. In your case, as people pointed out:

On multi-processor system, the load is relative to the number of processor cores available. The "100% utilization" mark is 1.00 on a single-core system, 2.00, on a dual-core, 4.00 on a quad-core, etc.

So, with a load average of 14.00 and 24 cores, your server is far from being overloaded.

 

参考:

http://www.linuxjournal.com/article/9001

http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages

 

阅读全文……

标签 : ,



发表评论 发送引用通报