/ October 2011 ~ Java EE Support Patterns

10.30.2011

OutOfMemoryError – Oracle Service Bus Alerts

This case study describes the complete steps from root cause analysis to resolution of a sudden OutOfMemoryError problem experienced with Oracle Service Bus 2.6 running on Weblogic Server 9.2.
This case study will also demonstrate how you combine Thread Dump analysis and Solaris PRSTAT to pinpoint culprit Thread(s) consuming large amount of memory on your application Java Heap.

Environment specifications

-         Java EE server: Oracle Service Bus 2.6 & Weblogic Server 9.2
-         OS: Solaris 10
-         JDK: Sun HotSpot 1.5 SR7
-         RDBMS: Oracle 10g
-         Platform type: Enterprise Service Bus

Troubleshooting tools

-         Quest Foglight for Java EE
-         Java VM Thread Dump (for stuck Thread analysis)
-         Solaris PRSTAT (for Thread CPU contributors breakdown)

Problem overview

-         Problem type: OutOfMemoryError

An OutOfMemoryError problem was observed in our Oracle Service Bus Weblogic logs along with major performance degradation.

Preventive restart was then implemented since a Java Heap memory leak was initially suspected which did not prove to be efficient.

Gathering and validation of facts

As usual, a Java EE problem investigation requires gathering of technical and non technical facts so we can either derived other facts and/or conclude on the root cause. Before applying a corrective measure, the facts below were verified in order to conclude on the root cause:

·        What is the client impact? HIGH
·        Recent change of the affected platform? No
·        Any recent traffic increase to the affected platform? No
·        Since how long this problem has been observed?  Since several months with no identified root cause
·        Is the OutOfMemoryError happening suddenly or over time? It was observed that the Java Heap is suddenly increasing and can then take several hours before failing with a OutOfMemoryError
·        Did a restart of the Weblogic server resolve the problem? No, preventive restart was attempted which did not prove to be efficient

-         Conclusion #1: The sudden increase of the Java Heap could indicate long running Thread(s) consuming a large amount of memory in a short amount of time

-         Conclusion #2: A Java Heap leak cannot be ruled out at this point

Java Heap monitoring using Foglight

Quest Foglight monitoring tool was used to monitor the trend of the Oracle Service Bus Java Heap for each of the managed servers. You can see below a sudden increase of the Java Heap OldGen space can be observed.

This type of memory problem normally does not indicate a linear or long term memory leak but rather rogue Thread(s) consuming large amount of memory in a short amount of time. In this scenario, my recommendation to you is to first perform Thread Dump analysis before jumping too quickly to JVM Heap Dump analysis.


Thread Dump analysis

Thread Dump snapshots were generated during the sudden Java Heap increase and analysis did reveal a primary culprit as you can see below.

Analysis of the different Thread Dump snapshots did reveal a very long elapsed time of this Thread along with a perfect correlation with our Java Heap increase. As you can see, this Thread is triggered by the ALSB alerting layer and is involved in read/write of the Weblogic internal diagnostic File Store.

"pool-2-thread-1" prio=10 tid=0x01952650 nid=0x4c in Object.wait() [0x537fe000..0x537ff8f0]
        at java.lang.Object.wait(Native Method)
        at java.lang.Object.wait(Object.java:474)
        at weblogic.common.CompletionRequest.getResult(CompletionRequest.java:109)
        - locked <0xf06ca710> (a weblogic.common.CompletionRequest)
        at weblogic.diagnostics.archive.wlstore.PersistentStoreDataArchive.readRecord(PersistentStoreDataArchive.java:657)
        at weblogic.diagnostics.archive.wlstore.PersistentStoreDataArchive.readRecord(PersistentStoreDataArchive.java:630)
        at weblogic.diagnostics.archive.wlstore.PersistentRecordIterator.fill(PersistentRecordIterator.java:87)
        at weblogic.diagnostics.archive.RecordIterator.fetchMore(RecordIterator.java:157)
        at weblogic.diagnostics.archive.RecordIterator.hasNext(RecordIterator.java:130)
        at com.bea.wli.monitoring.alert.AlertLoggerImpl.getAlerts(AlertLoggerImpl.java:157)
        at com.bea.wli.monitoring.alert.AlertLoggerImpl.updateAlertEvalTime(AlertLoggerImpl.java:140)
        at com.bea.wli.monitoring.alert.AlertLog.updateAlertEvalTime(AlertLog.java:248)
        at com.bea.wli.monitoring.alert.AlertManager._evaluateSingleRule(AlertManager.java:992)
        at com.bea.wli.monitoring.alert.AlertManager.intervalCompleted(AlertManager.java:809)
        at com.bea.wli.monitoring.alert.AlertEvalTask.execute(AlertEvalTask.java:65)
        at com.bea.wli.monitoring.alert.AlertEvalTask.run(AlertEvalTask.java:37)
        at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:417)
        at java.util.concurrent.FutureTask$Sync.innerRun(FutureTask.java:269)
        at java.util.concurrent.FutureTask.run(FutureTask.java:123)
        at java.util.concurrent.ThreadPoolExecutor$Worker.runTask(ThreadPoolExecutor.java:650)
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:675)
        at java.lang.Thread.run(Thread.java:595)

Solaris PRSTAT

Solaris PRSTAT was also captured in order to understand the amount of CPU consumed and determine if this Thread was stuck or performing a lot of processing.

As you can see below, Thread id #76 was identified as the primary CPU contributor with 8.6 %.

prstat –L <Java Pid>

  PID USERNAME  SIZE   RSS STATE  PRI NICE      TIME  CPU PROCESS/LWPID
  9116 bea      3109M 2592M sleep   59    0  21:52:59 8.6% java/76
  9116 bea      3109M 2592M sleep   57    0   4:28:05 0.3% java/40
  9116 bea      3109M 2592M sleep   59    0   6:52:02 0.3% java/10774
  9116 bea      3109M 2592M sleep   59    0   6:50:00 0.3% java/84
  9116 bea      3109M 2592M sleep   58    0   4:27:20 0.2% java/43
  9116 bea      3109M 2592M sleep   59    0   7:39:42 0.2% java/41287
  9116 bea      3109M 2592M sleep   59    0   3:41:44 0.2% java/30503
  9116 bea      3109M 2592M sleep   59    0   5:48:32 0.2% java/36116
  9116 bea      3109M 2592M sleep   59    0   6:15:52 0.2% java/36118
  9116 bea      3109M 2592M sleep   59    0   2:44:02 0.2% java/36128
  9116 bea      3109M 2592M sleep   59    0   5:53:50 0.1% java/36111
  9116 bea      3109M 2592M sleep   59    0   4:27:55 0.1% java/55
  9116 bea      3109M 2592M sleep   59    0   9:51:19 0.1% java/23479
  9116 bea      3109M 2592M sleep   59    0   4:57:33 0.1% java/36569
  9116 bea      3109M 2592M sleep   59    0   9:51:08 0.1% java/23477
  9116 bea      3109M 2592M sleep   59    0  10:15:13 0.1% java/4339
  9116 bea      3109M 2592M sleep   59    0  10:13:37 0.1% java/4331
  9116 bea      3109M 2592M sleep   59    0   4:58:37 0.1% java/36571
  9116 bea      3109M 2592M sleep   59    0   3:13:46 0.1% java/41285
  9116 bea      3109M 2592M sleep   59    0   4:27:32 0.1% java/48
  9116 bea      3109M 2592M sleep   59    0   5:25:28 0.1% java/30733

Thread Dump and PRSTAT correlation approach

In order to correlate the PRSTAT data please follow the steps below:

-         Identify the primary culprit / contributor Thread(s) and locate the Thread Id (DECIMAL format) from the last column
-         Convert the decimal value to HEXA since Thread Dump native Thread id is in HEXA format
-         Search from your captured Thread Dump data and locate the Thread native Id >> nid=<Thread Id HEXA>

For our case study:

-         Thread Id #76 identified as the primary culprit
-         Thread Id #76 was converted to HEXA format >> 0x4c
-         Thread Dump data Thread native Id >> nid=0x4c

Such correlation and finding did confirm that this rogue Thread was actively processing (reading/writing to the File Store) and allocating a large amount of memory on the Java Heap in a short amount of time.

Root cause

The above analysis along with Mark Smith Oracle Blog article finding did confirm the root cause.

Oracle Service Bus allows Alert actions to be configured within the message flow (pipeline alerts). These pipeline alert actions generate alerts based on message context in a pipeline, to send to an alert destination. Such alert destination is the actual Weblogic diagnostic File Store which means this structure will grow over time depending of the volume of Alerts that your OSB application is generating.

It is located under >> //domain_name/servers/server_name/data/store/diagnostics/

In our case, the File Store size was around 800 MB.

Such increase of the diagnostic File Store size over time is leading to an increase of elapsed time of the Thread involved in read/write operations; allocating large amount of memory on the Java Heap. Such memory cannot be garbage collected until Thread completion which is leading to OutOfMemoryError and performance degradation.

Solution

2 actions were required to resolve this problem:

-         Reset the diagnostic File Store by renaming the existing data file and forcing Weblogic to create a fresh one
-         Review and reduce the Oracle Service Bus alerting level to the minimum acceptable level

The reset of the diagnostic File Store did bring some immediate relief and by ensuring short and optimal diagnostic File Store operations and not causing too much impact on the Java Heap.

The level of OSB alerting is still in review and will be reduced shortly in order to prevent this problem at the source.

Regular monitoring and purging of the diagnostic File Store will also be implemented in order to prevent such performance degradation going forward; as per Oracle recommendations.

Conclusion and recommendations

I hope this article has helped you understand how to identify and resolve this Oracle service bus problem and appreciate how powerful Thread Dump analysis and Solaris PRSTAT tool can be to pinpoint this type of OutOfMemoryError problem.

Please do not hesitate to post any comment or question.

10.29.2011

net framework 4 - DebugDiag tutorial

This case study describes the complete root cause analysis and resolution of a Thread hang problem affecting an ASP.NET 4 Web application.

As a Java Thread Dump analysis expert, my first natural question to approach this problem was: how can I generate a .NET Thread Dump in order to pinpoint our slowdown condition? The answer was found with the Windows dump file and their latest Debug Diagnostic Tool 1.2.

This article will demonstrate how you can take advantage of the new Microsoft Debug Diagnostic Tool 1.2 to investigate and troubleshoot hang Thread problems. For people familiar with Java EE, you will definitely notice similarities vs. Java VM Thread Dump and Heap Dump analysis.

Environment specifications (case study)

-         Web and App server: ASP.NET framework 4 & IIS 7
-         OS: Windows Server 8 Service Pack 2 - 64-bit
-         Platform type: Web application

Monitoring and Troubleshooting tools

** DebugDiag 1.2 can be downloaded for free from Microsoft **

-         Compuware Server Vantage (IIS connection monitoring)
-         Debug Diagnostic Tool 1.2

Problem overview

-         Problem type: Major slowdown and hang Threads was observed from the .net framework 4 servers affecting the overall Web application response time

Such problem was observed especially during high load along with an increase of the active IIS connection count.

Gathering and validation of facts

Similar to Java  EE problems, ASP.NET framework 4 and IIS performance problems require gathering of technical and non technical facts so we can either derived other facts and/or conclude on the root cause. Before applying a corrective measure, the facts below were verified in order to conclude on the root cause:

·        What is the client impact? HIGH; major Web application slowdown
·        Recent change of the affected platform? No
·        Any recent traffic increase to the affected platform? No
·        What is the health of the IIS server? Overall IIS health is OK but an increase of IIS active connections is observed; ultimately causing performance degradation
·        Did a restart of the IIS and application pool resolve the problem? No, this was used as a mitigation strategy only but problem is re-occurring quickly following IIS and application pool recycle

-         Conclusion #1: The problem appears to be isolated at the .NET framework 4 application layer due to an unidentified .NET Thread contention

Diagnostic Debug Tool overview

The Debug Diagnostic Tool (DebugDiag) is designed to assist in troubleshooting issues such as hangs, slow performance, memory leaks or fragmentation, and crashes in any user-mode process. The tool includes additional debugging scripts focused on Internet Information Services (IIS) applications, web data access components, COM+ and related Microsoft technologies.

This is the tool that our team used to successfully pinpoint the source of slowdown and Thread hang condition.

Step #1 – Create a dump file of your affected w3wp process

In order to troubleshoot .NET Thread hangs problems, you first need to generate a dump file of your affected .NET application pool(s) process. This can be done by following the steps below:

-         Start Windows Task Manager by performing a right-click an empty area of the task bar, and then click Start Task Manage
-         Select the Processes tab.
-         Identify and Right-click the name of the process that you want, and then click Create Dump File. If you are prompted for an administrator password or confirmation, type your password or click Continue.
-         A dump file for the process is created in the following folder Drive:\Users\UserName\AppData\Local\Temp
When you receive a message that states that the dump file was successfully created, click OK.

This dump file contains a lot of information on your application pool and ASP.NET runtime environment including memory footprint, Threads, attached client connections etc.

Is it now time to analyze the dump in order to understand your Thread hang condition.

Step #2 – Start the Debug Diagnostic Tool

Start the Debug Diagnostic Tool and load your w3wp dump file.




Step #4 – Launch the Hang Analyzer

Start the Hang Analyzer and wait for completion of the analysis. Please be patient as this process may take several minutes depending of the size of your dump file…




Step #5 – Review the report analysis overview

Once the Hang Analyzer is done, it will automatically open the analysis report in your browser.

This report is quite detailed and will provide you with:

-         An analysis summary (blocked Threads etc.) along with recommendations
-         A .NET Threads summary; including TOP 5 Thread CPU contributors
-         A HttpContext report with Thread Id correlation
-         A detailed .NET Thread Dump (similar to Java VM Thread Dump)
-         A detailed HTTP report (list of all active and stuck HTTP connections)



Step #6 – Review the blocked Thread summary

Now please review your blocked Threads summary.

As you can see below from our case study, the blocked Threads originate mainly from our remote SOSS service. SOSS is a distributed cache service that our application is using and invoking remotely via a WinSock (Socket).


Step #7 – Blocked Threads deep dive analysis

At this point, it is now time to review a few samples of your blocked Threads in order to pinpoint the source of contention. Simply select a Thread id from the summary or scroll down the report to look at the raw .NET Thread Stack Trace data.

This is quite similar to Java Thread Dump analysis but the Debug Diagnostic Tool includes more detail such as description of the hanging call etc. Always start your Stack Trace (.NET Call stack) from bottom up. The first line will always give you the current Thread hang operation and should give you a better indication of the culprit of your problem (hanging Web Service provider, hanging database call etc.).

As you can see for our case study, the source of Thread contention was identified and was caused by a slow response time of our SOSS service (.Sosslib_invoke(UInt32, Void*, SByte*, UInt32, UInt32*)).


Conclusion

I hope this simple tutorial has helped you understand and appreciate the powerful Debug Diagnostic Tool and how you can leverage it to analyze .NET framework 4 hang Thread patterns.

Please don’t hesitate to post any comment or question.

10.24.2011

verbosegc output tutorial - Java 7

Verbose output for the Java VM and garbage collection process has been around for quite a long time but I often get questions on how to read and interpret the output.

This short article will provide with a detailed tutorial on how to enable and read Java 7 HotSpot VM verbose gc output data.

I recommend that you compile and run the sample program on your end as well.

Tutorial specifications and tools

·        OS: Windows 7 - 64-bit
·        Java VM: Sun Java 7 HotSpot (build 21.0-b17)
·        IDE: Eclipse Java EE IDE for Web Developer v4.1


Tutorial video available

We also recently created a Java verbose GC tutorial video explaining this analysis process.

Step #1 – Compile our sample Java program

We created a sample Java program in order to load the Java Heap and trigger an explicit GC in order to generate some interesting verbose GC output. This program is simply loading about 3 million instances of java.lang.String in a static Map data structure and triggers an explicit GC (via System.gc()) followed by the removal of 2 million instances along with a second explicit GC before exiting.

package org.ph.javaee.tools.jdk7;

import java.util.Map;
import java.util.HashMap;

/**
 * JavaHeapVerboseGCTest
 * @author Pierre-Hugues Charbonneau
 *
 */
public class JavaHeapVerboseGCTest {
      
       private static Map<String, String> mapContainer = new HashMap<String, String>();
      
       /**
        * @param args
        */
       public static void main(String[] args) {
            
             System.out.println("Java 7 HotSpot Verbose GC Test Program v1.0");
             System.out.println("Author: Pierre-Hugues Charbonneau");
             System.out.println("http://javaeesupportpatterns.blogspot.com/");
            
             String stringDataPrefix = "stringDataPrefix";
       
             // Load Java Heap with 3 M java.lang.String instances
             for (int i=0; i<3000000; i++) {
                    String newStringData = stringDataPrefix + i;
                    mapContainer.put(newStringData, newStringData);
             }
            
             System.out.println("MAP size: "+mapContainer.size());
             System.gc(); // Explicit GC!
            
             // Remove 2 M out of 3 M
             for (int i=0; i<2000000; i++) {
                    String newStringData = stringDataPrefix + i;
                    mapContainer.remove(newStringData);
             }
            
             System.out.println("MAP size: "+mapContainer.size());
             System.gc();
             System.out.println("End of program!");      
            
       }
}


Step #2 – Enable verbose GC via the JVM start-up arguments


The next step is to enable the verbose GC via the JVM start-up arguments and specify a name and location for our GC log file.




Step #3 – Execute our sample Java program


At this point, it is now time to execute our sample program and generate the JVM verbose GC output.



Verbose GC output high level analysis

It is now time to review the generated GC output.

First, let’s start with the raw data. As you can see below, the GC output is divided into 3 main sections:

·        5 Minor collections (YoungGen space collections) identified as PSYoungGen
·        2 Major collections (triggered by System.gc()) identified as Full GC (System)
·        A detailed Java Heap breakdown of each memory space

0.437: [GC [PSYoungGen: 262208K->43632K(305856K)]
       262208K->137900K(1004928K), 0.1396040 secs]
       [Times: user=0.45 sys=0.01, real=0.14 secs]
       
0.785: [GC [PSYoungGen: 305840K->43640K(305856K)]
       400108K->291080K(1004928K), 0.2197630 secs]
       [Times: user=0.56 sys=0.03, real=0.22 secs]

1.100: [GC [PSYoungGen: 164752K->43632K(305856K)]
       412192K->340488K(1004928K), 0.0878209 secs]
       [Times: user=0.37 sys=0.00, real=0.09 secs]

1.188: [Full GC (System) [PSYoungGen: 43632K->0K(305856K)]
       [PSOldGen: 296856K->340433K(699072K)]
       340488K->340433K(1004928K)
       [PSPermGen: 1554K->1554K(16384K)], 0.4053311 secs]
       [Times: user=0.41 sys=0.00, real=0.40 secs]
      
1.883: [GC [PSYoungGen: 262208K->16K(305856K)]
       602641K->340449K(1004928K), 0.0326756 secs]
       [Times: user=0.09 sys=0.00, real=0.03 secs]

2.004: [GC [PSYoungGen: 92122K->0K(305856K)]
       432556K->340433K(1004928K), 0.0161477 secs]
       [Times: user=0.06 sys=0.00, real=0.02 secs]

2.020: [Full GC (System) [PSYoungGen: 0K->0K(305856K)]
       [PSOldGen: 340433K->125968K(699072K)]
       340433K->125968K(1004928K)
       [PSPermGen: 1555K->1555K(16384K)], 0.2302415 secs]
       [Times: user=0.23 sys=0.00, real=0.23 secs]
      
Heap
 PSYoungGen      total 305856K, used 5244K [0x3dac0000, 0x53010000, 0x53010000)
  eden space 262208K, 2% used [0x3dac0000,0x3dfdf168,0x4dad0000)
  from space 43648K, 0% used [0x4dad0000,0x4dad0000,0x50570000)
  to   space 43648K, 0% used [0x50570000,0x50570000,0x53010000)
 PSOldGen        total 699072K, used 125968K [0x13010000, 0x3dac0000, 0x3dac0000)
  object space 699072K, 18% used [0x13010000,0x1ab140a8,0x3dac0000)
 PSPermGen       total 16384K, used 1560K [0x0f010000, 0x10010000, 0x13010000)
  object space 16384K, 9% used [0x0f010000,0x0f1960b0,0x10010000)

Verbose GC data interpretation and sequence

As you can see from the verbose GC output, the OldGen space was at 340 MB after the initial loading of 3 M String instances in our HashMap. It did go down to 126 MB following the removal of 2 M String instances.

Now find below explanation and snapshots on how you can read the GC output data in more detail for each Java Heap space.

## YoungGen space analysis



## OldGen space analysis




## PermGen space analysis





## Java Heap breakdown analysis




Conclusion

I hope this sample Java program and verbose GC output analysis has helped you understand how to read and interpret this critical data. Please do not hesitate to post any comment or question. My next article will cover the verbose GC for the IBM Java JRE.