Fix Memory Leaks in Java Production Applications - Dynatrace APM Blog
诊断识别解决 Java 内存泄漏
Adding more memory to your JVMs (Java Virtual Machines) might be a temporary solution to fixing memory leaks in Java applications, but it for sure won’t fix the root cause of the issue. Instead of crashing once per day it may just crash every other day. “Preventive” restarts are also just another desperate measure to minimize downtime – but – let’s be frank: this is not how production issues should be solved.
One of our customers – a large online retail store – ran into such an issue. They run one of their online gift card self-service interfaces on two JVMs. Especially during peak holiday seasons – when users are activating their gift cards or checking the balance – crashes due to OOM (Out Of Memory) were more frequent which caused bad user experience. The first “measure” they took was to double the JVM Heap Size. This didn’t solve the problem as JVMs were still crashing, so they followed the memory diagnostics approach for production as explained in Java Memory Leaks to identify and fix the root cause of the problem.
Before we walk through the individual steps, let’s look at the memory graph that shows the problems they had in December during the peak of the holiday season. The problem persisted even after increasing the memory. They could fix the problem after identifying the real root cause and applying specific configuration changes to a 3rd party software component:
Step 1: Identify a Java Memory Leak
The first step is to monitor the JVM/CLR Memory Metrics such as Heap Space. This will tell us whether there is a potential memory leak. In this case we see memory usage constantly growing resulting in an eventual runtime crash when the memory limit is reached.
Step 2: Identify problematic Java Objects
The out-of-memory exception automatically triggers a full memory dump that allows for analysis of which objects consumed the heap and are most likely to be the root cause of the out-of-memory crash. Looking at the objects that consumed most of the heap below indicates that they are related to a 3rd party logging API used by the application.
A closer look at an instance of the VPReportEntry4 shows that it contains 5 Strings – with one consuming 23KB (as compared to several bytes of other string objects).This also explains the high GC Size of the String class in the overall Heap Dump.
Following the referrer chain further up reveals the complete picture. The EventQueue keeps LogEvents in an Array which itself keeps VPReportEntrys in an Array. All of these objects seem to be kept in memory as the objects are being added to these arrays but never removed and therefore not garbage collected:
Step 3: Who allocates these objects?
Analyzing object allocation allows us to figure out which part of the code is creating these objects and adding them to the queue. Creating what is called a “Selective Memory Dump” when the application reached 75% Heap Utilization showed the customer that the ReportWriter.report method allocated these entries and that they have been “living” on the heap for quite a while.
Step 4: Why are these objects not removed from the Heap?
The premise of the 3rd party logging framework is that log entries will be created by the application and written in batches at certain times by sending these log entries to a remote logging service using JMS. The memory behavior indicates that – even though these log entries might be sent to the service, these objects are not always removed from the EventQueue leading to the out-of-memory exception.
Further analysis revealed that the background batch writer thread calls a logBatch method which loops through the event queue (calling EventQueue.next) to send current log events in the queue. The question is whether as many messages were taken out of the queue (using next) vs put into the queue (using add) and whether the batch job is really called frequently enough to keep up with the incoming event entries. The following chart shows the method executions of add, as well as the call to logBatch highlighting that logBatch is actually not called frequently enough and therefore not calling next to remove messages from the queue:
Step 5: Fixing the Java Memory Leak problem
After providing this information to the 3rd party provider and discussing with them the number of log entries and their system environment the conclusion was that our customer used a special logging mode that was not supposed to be used in high-load production environments. It is like running with DEBUG log level in a high load or production environment. This overwhelmed the remote logging service and this is why the batch logging thread was stopped and log events remained in the EventQueue until the out of memory occurred.
After making the recommended changes the system could again run with the previous heap memory size without experiencing any out-of-memory exceptions.
They still use the same dashboards they have built to troubleshoot this issue, to monitor for any future excessive logging problems.
Conclusion
Adding additional memory to crashing JVMs is most often not a temporary fix. If you have a real Java memory leak it will just take longer until the Java runtime crashes. It will even incur more overhead due to garbage collection when using larger heaps. The real answer to this is to use the simple approach explained here. Look at the memory metrics to identify whether you have a leak or not. Then identify which objects are causing the issue and why they are not collected by the GC. Working with engineers or 3rd party providers (as in this case) will help you find a permanent solution that allows you to run the system without impacting end users and without additional resource requirements.
Next Steps
If you want to learn more about Java Memory Management or general Application Performance Best Practices check out our free online Java Enterprise Performance Book. Existing customers of our APM Solution may also want to check out additional best practices on our APM Community.