Mapping An LWP Using Excessive CPU Time To a Java Thread

Introduction

It happens. End users are complaining about a slow system or a request that never returned. You log into a production middleware server and notice that a java process is consuming an entire CPU or worst every available CPU on the box. It has happened again–a thread(s) has gone into an infinite loop on your production system. It was a routine problem with Cobol programs running in CICS regions on mainframes(this still happens routinely today depending on the company), I watched it happen many times in Tuxedo domains running on Unix machines, and the J2EE platform is not immune to it either.

Generally, if an LWP or Light Weight Process is observed consuming an entire CPU-worth of processor time (e.g., 25% on a four-way box), the thread is either stuck in an infinite loop or in some type of deep–possibly infinite–recursion. First, let me propose an assumption that I have largely found to be true. Assuming a business-application context, say a web-front-end to a trading system or other financial application, long-running CPU intensive activities are very rare. If you encounter something (that proves to be a single request) that runs for an extended period of time–a relative concept–consuming an entire, or nearly so, CPU, something is amiss. I posed infinite loop or deep recursion as the two most likely scenarios. Of the two, an infinite loop is the more likely of the two scenarios in our business setting. Although, there will be exceptions. Weird behavior inside an XML DOM Parser sets the state for infinite recursion problems. But, my experience has been it is usually an infinite loop when you encounter an LWP consuming an entire CPU in a business application. Again, this could be perfectly normal in another setting.

Threads, Threads Everywhere. And, Not a Needle In Sight

While writing this article I went off on a tangent discussing the relationship between processes, LWPs, Java threads, native threads, and kernel threads. I removed those thoughts from this article and put them in their own, which can be found here. For the purposes of our discussion, we are assuming a modern JVM on Solaris, Linux, or AIX (the only Unixes I have experience with running Java, whether its the Sun or IBM JVM) where there is a 1-1 correspondence between Java Threads, Native Threads (provided by a threading library in User Space), Light Weight Processes (LWPs), and Kernel Threads. On Solaris 2.9 and above, the default threading model guarantees this relationship. On a Linux 2.6 kernel using the NPTL, there is a 1-1 correspondence between these concepts. On an AIX v5.3 system, this relationship will also exist as long as the AIXTHREAD_SCOPE=S environment variable is set–this is the default for the Java 1.5 (and above IBM JVMs).

I am over simplifying things here. For each of these operating systems, are discussing the relationships between user threads (or LWPs) and kernel threads. All modern JVM implementations on these platforms support a 1-1 mapping between Java threads and native threads (user space or user threads). Back in 1999, Java 1.1 running on a Solaris 2.5.1 system would map N Java Threads to one native thread (and one LWP). Then, the JVM would schedule its threads onto the time slices that the native thread was given by the kernel. This was not a very scalable design. Things changed dramatically in future versions of Java. Once that level of the relationship between Java Threads<->Native/User-Space Threads<->LWP<->Kernel Threads was fixed, the relationship between the parts had to be addressed.

It is important to understand that the relationships between these four concepts–Java Threads, Native Threads, LWP, & Kernel Threads–are dependent upon the JVM implementation, the native thread library, and the Operating System. I’ll save the rest of this discussion for the Thread article.

There is a great deal of debate and philosophical differences revolving thread implementations. This is far beyond the scope of this article. But, I will say I am a big fan of the Solaris way of doing things.

The Problem

So, we have a java process on a production machine and top, topas, or prstat is reporting that it is using far more CPU than it normally does. In the absence of a deeper knowledge of multithreading and JVM internals, we may our investigation here. But, we know that there are many threads running in our application server process. So, we know that in order to effectively troubleshoot the problem, we have to know which thread is actually using the CPU time. In particular, we need to know which Java Thread is using the excessive CPU time and what code it is running.

Put in more concrete terms, we need to identify the LWP (identified by an LWPID) that is reporting excessive CPU utilization, then map that LWPID to the NID (Native Thread ID) associated with a Java Thread. How to find and map an LWPID of an LWP being reported as using excessive CPU and mapping that to a Java Thread is the focus of this article. More importantly, we’re going to map that LWPID to a section of application Java code or maybe an internal JVM function (e.g., garbage collection).

The Solution

To come up with what thread is using CPU, there are two steps (really three):

1. Identify which LWP is using excessive CPU time.
2. Map the LWP’s LWPID to a Java thread’s NID (Native Thread ID)
3. Identify what that Java thread is doing.

Step #1

The first step is different depending upon which Operating System you’re running on. I will briefly review how to do this for Solaris, Linux, and AIX. I will present an example using Linux (something I have ready access too). For the moment, I’m assuming that there is only one thread of interest.

Solaris

On Solaris systems, finding an LWP that is using a large amount of CPU is very easy–prstat. Just run the following command on the Java pid you are investigating:

prstat -p pid -L

By default, LWPs are ranked in order of decreasing CPU utilization. So, the one at the top is the one you are interested in.

AIX

On AIX, the goal is the same, but there isn’t a prstat tool that presents the information we need as succienctly. Here, the ps command will be relied upon to report the CPU utilization of each LWP/thread. Run the following command to dump the needed information

ps -mp pid -o THREAD

An example of this commands output is given here. The information you need is in the sixth column. The first data line actually reports CPU information for the entire process; this is not what you want. Running this output through a sort command with the ‘-k’ & ‘-r’ parameters would order the data in decreasing order of CPU utilization, which is what we really want. Furthermore, grep could be used to remove the full-process data line (look for the string “java”, perhaps).

This could serve as the basis of a prstat-like script, that generates the data we want to see. Regardless, we can find the LWPID we need with this information.

Linux

Linux, like AIX, is missing a prstat-like tool that can easily break down CPU utilization by LWP and present it as conviniently as top. But, there is still the venerable ps command. On Linux, ps can report CPU utilization with the following:

ps -Lo pid,ruser,vsz,rss,state,priority,nice,time,%cpu,comm,lwp,psr

There is a similar problem where the data is not sorted by decreasing CPU by default. I originally tried to use the “–sort=-pcpu” flag, but that didn’t seem to work on my system. It always seemed to order from least CPU to most CPU used. So, I used the tac command to reverse the data.

Likewise, this is the basis of a handy script–an exercise for the reader.

Step #2

We know have an offending LWPID. It should be noted that this is a decimal number. The next step is to grab a thread dump and match up the LWPID with an NID belonging to a Java Thread listed in the dump. The NIDs are listed in hexidecimal. So, starting up Calculator in scientific mode or using:

echo “ibase=10; obase=16; a_number” | bc

will give you the hex representation you need.

Using a method outlined in this article, create a thread dump. Search through the thread dump for you hex-based LWPID. Note, depending on the vendor, Hex numbers A-F, could be represented as upper-case or lower-case letters. Assuming the simplest case, you will find a match like “nid=0xNNNN” that matches the LWPID you found in Step #1 (and converted in Step #2 here).

This is the Java Thread corresponding to the LWP using excessive CPU.

Step #3

At this point, you know which Java thread is using excessive CPU and you are staring at a stack trace that explains what is happening. Ideally, the stack trace references Java classes with which you are familiar. Based upon this information, you can direct the issue to the appropriate development group or vendor. If it falls on you interpret what is happening in this thread, that will be the focus of another article on my part. But, I wish you luck.

An Example

This article, as they so often do, will end with an example. This example will use a simple program I created called ManyThreads.java, which spawns a bunch of threads, gets most of them stuck in deadlocks, and has one go into an infinite loop. We will then identify which of these threads is the culprit.

I am running this example on a Linux 2.6.25 kernel using Sun Java 1.5.0_16.

I am using the this script to execute this program. I am running the program with nice because the java process spawns over 200 LWPs and it was causing problems on my system.

Once the program is run, we can see an issue with CPU in top:

top – 18:57:13 up 5 days, 19:45, 1 user, load average: 1.00, 1.00, 1.00
Tasks: 126 total, 2 running, 123 sleeping, 1 stopped, 0 zombie
Cpu(s): 0.3%us, 0.3%sy, 99.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st
Mem: 1025980k total, 1007256k used, 18724k free, 72496k buffers
Swap: 2031608k total, 88k used, 2031520k free, 760336k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
31548 a_user 30 10 350m 15m 7112 S 99.0 1.6 317:08.76 java
2353 root 20 0 193m 9364 4620 S 0.3 0.9 24:33.73 Xorg
2452 gdm 20 0 13828 4900 2484 S 0.3 0.5 2:10.00 pulseaudio
32101 a_user 20 0 2528 964 744 R 0.3 0.1 0:00.01 top
1 root 20 0 1948 740 536 S 0.0 0.1 0:00.84 init
2 root 15 -5 0 0 0 S 0.0 0.0 0:00.01 kthreadd
3 root RT -5 0 0 0 S 0.0 0.0 0:00.00 migration/0
4 root 15 -5 0 0 0 S 0.0 0.0 0:24.33 ksoftirqd/0
5 root RT -5 0 0 0 S 0.0 0.0 0:03.27 watchdog/0
6 root 15 -5 0 0 0 S 0.0 0.0 0:03.48 events/0
7 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 khelper
57 root 15 -5 0 0 0 S 0.0 0.0 0:09.78 kblockd/0
59 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 kacpid
60 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 kacpi_notify
130 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 cqueue
132 root 15 -5 0 0 0 S 0.0 0.0 0:00.08 ksuspend_usbd
137 root 15 -5 0 0 0 S 0.0 0.0 0:00.22 khubd
140 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 kseriod
181 root 15 -5 0 0 0 S 0.0 0.0 0:13.07 kswapd0
222 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 aio/0
366 root 15 -5 0 0 0 S 0.0 0.0 0:00.00 kpsmoused
398 root 15 -5 0 0 0 S 0.0 0.0 1:09.65 ata/0

The formatting isn’t perfect in this medium, but you can see that the java process at the top is using 99% of the CPU. So, our next step is to get a listing of CPU utilization by LWP within this process:

31548 a_user 359088 16348 R 30 10 05:21:14 98.8 java 31760 0
31548 a_user 359088 16348 S 30 10 00:00:00 0.0 java 31759 0
31548 a_user 359088 16348 S 30 10 00:00:00 0.0 java 31758 0
31548 a_user 359088 16348 S 30 10 00:00:00 0.0 java 31757 0
31548 a_user 359088 16348 S 30 10 00:00:00 0.0 java 31756 0
31548 a_user 359088 16348 S 30 10 00:00:00 0.0 java 31755 0
31548 a_user 359088 16348 S 30 10 00:00:00 0.0 java 31754 0
31548 a_user 359088 16348 S 30 10 00:00:00 0.0 java 31753 0
31548 a_user 359088 16348 S 30 10 00:00:00 0.0 java 31752 0
31548 a_user 359088 16348 S 30 10 00:00:00 0.0 java 31751 0
31548 a_user 359088 16348 S 30 10 00:00:00 0.0 java 31750 0
31548 a_user 359088 16348 S 30 10 00:00:00 0.0 java 31749 0
31548 a_user 359088 16348 S 30 10 00:00:00 0.0 java 31748 0
31548 a_user 359088 16348 S 30 10 00:00:00 0.0 java 31747 0
31548 a_user 359088 16348 S 30 10 00:00:00 0.0 java 31746 0
31548 a_user 359088 16348 S 30 10 00:00:00 0.0 java 31745 0
31548 a_user 359088 16348 S 30 10 00:00:00 0.0 java 31744 0
31548 a_user 359088 16348 S 30 10 00:00:00 0.0 java 31743 0
31548 a_user 359088 16348 S 30 10 00:00:00 0.0 java 31742 0
31548 a_user 359088 16348 S 30 10 00:00:00 0.0 java 31741 0
31548 a_user 359088 16348 S 30 10 00:00:00 0.0 java 31740 0
31548 a_user 359088 16348 S 30 10 00:00:00 0.0 java 31739 0
31548 a_user 359088 16348 S 30 10 00:00:00 0.0 java 31738 0
31548 a_user 359088 16348 S 30 10 00:00:00 0.0 java 31737 0
31548 a_user 359088 16348 S 30 10 00:00:00 0.0 java 31736 0
31548 a_user 359088 16348 S 30 10 00:00:00 0.0 java 31735 0
31548 a_user 359088 16348 S 30 10 00:00:00 0.0 java 31734 0
31548 a_user 359088 16348 S 30 10 00:00:00 0.0 java 31733 0

Neither top nor this ps output shows it, but there are actually 217 LWPs in this java process. And, now we know exactly which LWP is using CPU–LWPID=31760. In hex, this is equal to 7c10. Now, we need to generate a thread dump. In this example, standard out is being redirected to a file in the working directory called out. So, we can easily load this file into your favorite text editor and search for ‘7c10’ or ‘7C10’.

I found a match corresponding to the following stack trace:

“Thread-200” prio=1 tid=0x09e20890 nid=0x7c10 runnable [0xa2647000..0xa2647da0]
at Thread3.run(ManyThreads.java:37)

A quick look in ManyThreads.java shows line thirty seven as:

while(true);//tight loop:)…

I submit to the reader that it is obvious why this thread is burning CPU time.

This is a relatively simple example. It is possible that multiple threads could be using excessive CPU time; it’s also possible that the thread in question is actually a JVM system deamon thread that could be doing something like garbage collection. Troubleshooting such an issue will be the subject of another article.

Reference

[1] http://java.sun.com/docs/hotspot/threads/threads.html
[2] http://en.wikipedia.org/wiki/Native_POSIX_Thread_Library
[3] http://publib.boulder.ibm.com/infocenter/pseries/v5r3/index.jsp?topic=/com.ibm.aix.prftungd/doc/prftungd/thread_tuning.htm
[4] http://developers.sun.com/solaris/articles/prstat.html
[5] http://developers.sun.com/solaris/articles/prstat.html
[6] http://www.performancewiki.com/thread-analysis.html