OC4J重启问题的解决方法

标签： 随笔文章 | 发表时间：2012-07-12 23:05 | 作者：人月神话

出处：http://blog.sina.com.cn/cmmi

该问题首先要去分析为何会自动重启？服务调用异常如果还可接受的话，自动重启是完全不能接受的。所以先要分析下OC4J可能会自动重启的原因有哪些？

搜索核心关键词： oc4j restart Internal Server Error。

We encountered similar problems. Turned out we had to tweak memory settings. It appeared that the OC4J instance restarten because of memory problems (leaky garbage collection something).

OC4J重启一般会和内存问题有关系，包括内存泄露，out of memory，或者是内存收集上出问题。需要考虑是否在linux下加 SAR的shell脚本，进一步跟踪在重启时间段，内存使用是否正常，是否出现内存100%导致JVM崩溃。

OK, there may be a reason for opmn to "think" your instance is dead. One reason is that the instance is so busy that it can't respond to opmn's requests.

进程死掉是什么意思？暂时没有明白。

We've had a problem with the OC4J restarting at random times due to the opmn ping timeout. I recently made a change to our opmn.xml file adding in the following values as suggested in this Oracle article - Agile PLM OC4J Intermittently Restarting After "OC4J Ping Attempt Timed Out" Due to Unresponsive Process [ID 572136.1]

<data id="reverseping-failed-ping-limit" value="5"/> <ping timeout="60" interval="60"/>

说明 ping time out会导致重启。看前面的日志，里面有 OC4J ping attempt timed out

I am getting the same issue with AS 10.1.3.3. Basically, OPMN is trying to do a ping and then does not get a response in time. Then it forcefully restarts the container.

As Steve said, when OPMN restarts the OC4J instance it is unreachable by a simple ping. Unreachable can have many reasons: - The simplest one is that you have installed AS on a DHCP host and forgot to adjust the hostname - ip number relationship. On a Windows environment you need to configure your system accordingly. The installation guide has a complete description for this. - Another reason is that the instance itself is massively busy and can't respond to the ping in time.

As I said earlier in the thread, you need to look through the various log files for any hints/indications as to why the OC4J instance is not responding to the OPMN health checks.

OPMN will kill and restart and OC4J process if the OC4J process does not seem to be alive any longer. The measures of "aliveness" are based on OC4J responding to client requests (ie sending back responses to clients), OC4J sending regular hearbeat notifications to OPMN and OC4J directly responding to requests from OPMN for a health check. If some/all of those fail in different combinations, then OPMN will assume the OC4J process is hung, it will kill the underlying OS process, and start a fresh one.

Therefore you need to look into why the OC4J instance is not responding.

简单说明，OPMN向OC4J发送心跳检测，检测OC4J的健康情况，如果没有得到回应，OPMN认为OC4J进程已经挂起，这个时候对OC4J进行自动重启操作。那么OC4J进程为何会挂起？无法处理OPMN发送给OC4J的心跳检测。

接下来继续分析OC4J为何会挂起，查询到OC4J问题的汇总贴，详细链接如下：

https://www.mylendlease.com/904unix_rn/relnotes.904/relnotes/oc4j.htm#1041435

在该文里面就提到了

Process Ping Failed: OC4J~<instance name="">~default_island~1 (opmnid)

The line above indicates that the memory and CPU resources of the current host are probably not sufficient to perform the operation within the currently specified ping timeout interval (used by OPMN to determine OC4J "responsiveness").

我们的日志里面有这句，说明还是CPU或内存资源的不足的问题引起了OC4J挂起，无法响应OPMN发送的健康检测。而给出的具体的解决方法是修改opmn.xml配置文件，做如下调整：

<ping timeout="60" interval="60"/>"
<data id="reverseping-failed-ping-limit" value="5" />

接下来根据解决方法对配置文件进行了调整，但是发现问题仍然没有解决。这再次说明当我们搜索到解决方法的时候，我们传统的问题递进方式失效，或者没有坚持。那么在捷径无用的情况下，仍然要回到问题本质上来分析，即内存或CPU负荷满导致了OC4J进程无响应，那么问题转化为内存和CPU的监控。

在内存和CPU的监控过程中我们发现监控情况一切正常，第一个奇怪点出来了，原理上面想不过去。后面想到不应该是去监控物理机的内存，而应该是监控JVM的内存使用情况。因此使用jstat工具对JVM的内存进行监控，果然发现heap space堆内存持续增长在100%左右。在这里已经涉及大量知识的学习，再次强调，核心关键问题的解决本身就是一个知识不断学习的过程，只有问题驱动的学习是最好的学习。做技术一定要对技术有兴趣，有迎难而上，废寝忘食解决技术问题的精神。具体学习内容包括：

JVM参数调优：http://lijunjie.iteye.com/blog/278923
JVM监控工具介绍：http://dolphin-ygj.iteye.com/blog/366216
JVM内存分析和内存溢出解决：http://www.cnblogs.com/flynewton/archive/2010/09/03/1817057.html
JVM内存模型和垃圾回收机制：http://www.zhixing123.cn/jsp/10563.html
JVM启动参数实例分析：http://zhaohe162.blog.163.com/blog/static/38216797201201321317290/
探究JVM内存泄露：http://blog.csdn.net/winniepu/article/details/4934764
JVM的Jstat命令详解：http://blog.csdn.net/fenglibing/article/details/6411951
JVM调优总结：http://unixboy.iteye.com/blog/174173

通过这些内容的学习，重点就是详细的学习JVM的内存模型和垃圾回收机制，JVM内存的监控方法，JVM参数的调优等。再回到原问题本身，我们发现问题的重点已经变化为了两个上面，一个是heap space堆内存持续增长的问题，一个频繁的进行full gc的问题。我们先来看频繁full gc的问题：

GC策略和内存申请：http://www.189works.com/article-68355-1.html

我们发现频繁的full gc只有三种情况，即old space空间不足；perm space空间不足，或者是显示调用System.GC, RMI等的定时触发原因造成。我们通过调整jvm参数陆续排除了old和perm空间不足的问题，那么导致该问题的原因就只有显示调用System.GC, RMI等。

继续查找资料，找到使用Sun的Rmi实现,可能会导致FullGc频繁的现象,大概没分钟一次. 当访问量比较大时会影响系统的性能。解决方法是禁止显式调用gc操作,即启动参数中配置参数 -XX:+DisableExplicitGC。我们对JVM参数进行了重新调整，进一步分析后发现full gc频繁收集的问题已经解决。

那么接下来的问题就是heap space堆内存仍然持续增长或不释放的问题。在这个问题发现在unix环境下根本无法使用jmap命令查看具体的进程中内存的分布情况。在这里想到的方法是通过jconsole工具或者使用thread dump工具对进程中所有的线程信息进行导出。具体查找文章如下：

百度百科thread dump文章：http://baike.baidu.com/view/5111187.htm
thread dump方法和操作：http://frankzhao.blog.51cto.com/273790/395861

通过这些方法对进程进行thread dump后进一步来分析CPU和内存的使用情况以查找问题个根本原因。

一个问题的真正解决，有假设和验证，有排除，有按图索骥，有边界和分支，有范围的逐步缩小和递进，有问题的延伸和深入，只要方法对，问题最终一定可以解决。

青春就应该这样绽放游戏测试：三国时期谁是你最好的兄弟！！你不得不信的星座秘密

OC4J重启问题的解决方法

相关 [oc4j 重启问题] 推荐：