Performance Tuning Methodology

I’m taking a brief excursion from my usual identity and API-centric posts to answer a question about performance tuning someone asked me earlier this year. In a previous incarnation of my career, I was focused on performance tuning and diagnostics — especially of Java systems. However, the same principles can apply to just about any running system. This post explores how to approach load testing and performance tuning just about anything.

Approach

Performance Testing (and Tuning) is generally best done with an approach similar to Black Box Testing, which involves testing a system without any specific knowledge of the internals of the system — or, at least approaching the problem in this fashion. In all likelihood, you are going to have a fairly detailed idea of what the system architecture and internals of your system looks like. This may not be true of SaaS, PaaS, or third-party vendor applications where you don’t have source code access.

This approach works with everything (including vendor and SaaS apps). Note, most SaaS providers (and third-party hosting providers) don’t necessarily want you load testing their systems — unless specific provisions are made (legally, operational support, and system capacity standpoint).

So, one is not examining source code at this point — that comes later. We are interested in the overall behavior of the system (including the code running in it). I was at a client site a few years ago, working directly with an application development team, doing this type of work; the lead developer had trouble understanding why my performance tuning methodology didn’t start with reviewing code.

This approach to performance tuning entails:

Understand/document what needs to be tested.
Define desired targets in terms of throughput or other system metrics.
Create load test scripts.
Produce load against system.
Observe behavior of end-to-end system (ie, response time).
Record CPU utilization, memory utilization, network I/O, disk I/O, throughput during these tests.
Identify where bottlenecks are.
Fix the most prominent bottleneck (even if you see more than one bottleneck, only fix one at a time).
Repeat steps 4–8 until the desired (Service Level Agreement) SLA has been met.

Step #8 is the hard part.

Given my background in middleware, I tend to start towards the bottom of the technology stack and start working my way up until I find the problem. Eventually, I get to the application code. Sometimes, that proves more efficient; sometimes, starting at the application code and working your way down is more efficient. One of my peers approaches this activity in the exact opposite fashion that I do; he will start looking at the code first and then work his way down into the system until he finds the problem. This peer’s skill set includes being very proficient at using Java debuggers to efficiently find problems. Use the approach that comes most naturally to you. Not everyone has the mindset or patience for this activity; your organization should employ someone who does.

When a bottleneck or error is encountered, start at the place where the issue is observed. Then, move deeper into the system as needed to identity the root cause. Maintaining discipline in this activity tends to pay off in the long run.

I described this methodology as similar to black box testing, but there is an important difference. The testing team is aware of the system internals (to some level), often has access to the source code, but approaches testing as though they don’t have these things. When problems start surfacing, the team uses the knowledge of the system (system internals, application code, etc.) to troubleshoot the problem. It is likely that the team members applying this knowledge to troubleshoot problems is different from the team members running the load tests.

What is Being Load Tested?

It’s necessary to understand what is being load tested. In fact, the more familiar are with the system you are trying to tune, the more successful you will be (and most likely, the easier you will be) in your tuning efforts.

Get an accurate system architecture diagram. If there isn’t one, make one. Don’t have the information? Begin a walk-about around your IT department, interview people, and find the information. It’s incredible how difficult this part can be.
Identify what technologies and system components are used: web servers, application servers, databases, identity stack components, load balancers, firewalls, other components of the security apparatus, etc, etc.
Understand the network path that traffic will take. Understand the physical and virtual (SDN) network components involved.
Is this system running on-premise or in the cloud?
Is virtualization involved? Understand which VMs are attached to which physical hardware (ESX hosts, etc).
If in the cloud, is it an IaaS, PaaS, or SaaS solution? If the whole system is SaaS, there is probably nothing to do. If parts of the whole are SaaS, then contract limitations regarding load testing activities almost certainly apply.
Is containerization in some form used? Docker, etc? Does this container layer impact the CPU resources allocated to the system layer?

Create a document that captures all of this information before you begin creating load test scripts (or running load tests).

Human Actors

The following people (or roles that one or more people may fill) are needed for the load testing phase of a project. Some of these roles may be filled by the same person. Some may not be relevant to all situations. Sometimes, more than one person is needed to bring all the necessary knowledge to the table for one of these roles. Multiple people may be needed to Additional resources could be needed.

Developer: wrote the application code. Is familiar with the application code base and can troubleshoot problems related to said code base as issues arise during the performance test cycle.

Load Test Engineer: builds load test, runs load test, and coordinates load testing activities.

System Administrator: monitors VM health and system resource utilization. Of course, this assumes that on-premise VMs or a cloud IaaS solution is being used, rather than a PaaS or SaaS solution.

Middleware Administrator: monitors application server health and resource utilization.

Database Administrator: monitors database server health and resource utilization.

Network Engineer: monitors network health and resource utilization.

Sign Off Manager: the manager responsible for making the final call regarding whether an application should be allowed past the load test gate on the way to production. This should not be the manager whose bonus is tied to getting any particular application into production.

Prep work

Before load testing begins, some information must be gathered.

This includes:

the information described in the “What Are You Load Testing” section above.
define SLAs (success criteria) for performance of the application. This can include TPS, concurrent transactions, concurrent # of users, or other relevant criteria. In large organizations, this is probably a formalized process complete with questionnaires for the development team.
identify common transactions from the real world that should be reflected in the load test.

That last one is the real trick. If this is a brand new system, then you are essentially guessing. If this is an existing system with a point upgrade or minor update, then there is real world data that can be used to figure out relative usage of various transaction to build a load test.

The Process

The basic idea is to generate load, identify bottlenecks, eliminate bottlenecks, and repeat until the desired (Service Level Agreement) SLA is obtained with the desired load. Getting through even one iteration of this cycle can take anywhere from hours to days (if not longer for new or complex systems).

The steps described in the “Prep Work” section above must be completed before you begin. If you don’t have this information, success will be elusive. Or, you will simply be tracking it down a little ways down the road.

You need to produce load. Choose your load testing tools. Ideally, your organization would already have tooling and hardware in place. Use these tools to record a load test that mimics expected production usage.

Run load tests to produce sufficient amount of traffic to meet the defined SLA. You will incrementally approach this target.

Start with a fraction of desired load (1–2 TPS). Run for 10–15 minutes.
Look for errors.
Monitor metrics mentioned earlier.
Resolve any issues that are observed. This part of the process can take a little bit of time.
Increase load until desired level is reached.

Most of the time when I have done this, the first several iterations of running a small amount of load encounters various issues ranging from identity and access issues for test accounts to network connectivity. The test engineer must get the test to a point where it can run at 1–2 TPS without any errors occurring. For a brand new system, this can be challenging. Keep at it and you will get there. But, be realistic; I’ve seen this take anywhere from a couple of days to 2–6 weeks, depending on the complexity of the system.

Once errors related to configuration and application bugs have been eliminated, you can begin increasing load. At some point, you will hit your first load related issue.

Unfortunately, I cannot tell you how to resolve those issues. These could be at any layer/tier of the application. It could be in system configuration, middleware configuration, system resources, application code, or other pieces. As a first step, narrow down where the problem is.

Is it in the database?
Is it threads in an application server?
Does one or more servers have insufficient memory?

Having a good understanding of the software the application is built on top of is a necessary precondition to doing this successfully. It doesn’t have to be the load test engineer who has these skill sets. Other Subject Matter Experts (SMEs) can be brought in as necessary to monitor and troubleshoot problems.

Once the current issue is resolved, run the load test again. Keep repeating this cycle until the system can run a sustained load test for at least several hours (4–6) at the desired load level.

As issues are encountered, it can be difficult to separate the cause and effect. Which observation is the root cause and which is a side effect? This comes with experience.

Keep a detailed log of configuration changes that are made when troubleshooting this process. It seems obvious, but sometimes people forget to be diligent about this and then have no idea what fixed it. Only change one thing at a time! Even if it seems like a small detail, ONLY APPLY ONE CHANGE AT A TIME.

Additional Thoughts:

Apply basic tuning (vendor recommendations, lessons learned from previous deployments, etc) to systems before starting the load test cycle described in this document. For example, JEE datasources need to have more than one connection.
Matching desired TPS or concurrent users is typically not enough. The test needs to have sufficient unique users to log into the system to mimic what would actually happen in production.
Disable auto-scaling of VMs, container images, and virtual CPUs during load tests.
Ideally, the load test environment would exactly mirror what will be used in production. However, it may not be clear what is needed for production ahead of time or funding may not be available for a full replica. In the case of the latter, most organizations will build half, a quarter, or one stripe of production — it can then be grown as necessary.
Every system has a bottleneck if you increase traffic sufficiently. The trick is to only keep addressing the next bottleneck until you’ve reached the desired SLA at the required load.
If using Java technology, remember, background/concurrent garbage collectors aren’t much use if there is only one core allocated to the VM.
Have sufficient data in the database during load tests to mimic conditions in real life. The size of data in the database usually has a large impact on system performance. For example, performance of queries may change dramatically with real production data volumes — just sizing the boxes isn’t enough, the data should be sized similarly too. Database queries can look very different between low-load and high-load with small amounts of data and large amounts of data to search.
Clocks need to be synchronized with a common time server.
Logs across systems should be written to a common location that are easily searchable — think Splunk.
Have a way to record network packet captures (either through root access and software on the box to record traffic off the NIC or sniffers on span ports on switches) — a lot of performance troubleshooting involves the network, even if it’s not a network issue. You’ll need this capability setup for each node of the system.
Be aware of random flukes and external factors (you are not expecting) that can impact the results of load tests. If weird things happen in one iteration, make sure you can reproduce it.
Automate database population and cleanup between load test runs.

Summary

Before new systems or changes to existing systems go live, performance tuning needs to be done. Performance tuning is a process. It is a journey — not a destination. You could always spend more time to make the system a little more efficient. But, is it needed? This is where defining the required SLA (or target performance level) ahead of time is needed. This will tell you when it is not just good, it is good enough.

Image: Kogod Courtyard — looking up — Smithsonian American Art Museum / Tim Evanson