Our aim is to accomplish performance testing in an optimal way, at less cost. Where applicable, open source JMeter is used as the load test tool. However, as with most open source tools, it comes with a lack of features such as in-built real-time monitoring and configuring metrics and graphs for test reports. To overcome these challenges, JMeter has provided the implementation of the Backend Listener, which has the ability to send metrics to a backend. Together with Grafana and Influxdb, it gives us a powerful option to monitor the performance of tests in real-time.
- Live real-time results,
- Visualized key metrics help understand what is occurring during the test,
- Sharing access to real-time results with a client. Monitoring results together with the DevOps team helps to collaboratively analyze the behavior of the system and its infrastructure, speeds up the process of identifying a problem, making a decision and bringing it into action,
- We can trace the overall as well as individual performance metrics for necessary transactions (referred to a set of requests to emulate a single business action) and requests in real-time,
- Graphs can be adjusted by adding necessary metrics to get a better view, correlate several parameters for more effective analysis,
- Metrics can be adjusted for the specific needs of the project. For example, we might want to separate response times for the assets, web requests and API method invocations to identify slow areas and locate the bottleneck,
- The time period of the report is set by a filter. It is a good idea to exclude warm-up and cool-down times from the measured time to get a realistic perception of system behavior at a specific load volume. Obviously, when the load is increased, it takes the system some time to warm-up, i.e. populates pools and various caches at different system layers, compiles classes, and adjust resources for load volume, which makes access to the data, faster during runtime. The first request made to an application is often significantly slower than the average response time during the lifetime of the process. In order to get more accurate results, we warm up the system with a preliminary test using less load volume. Additionally, when running a load test, we increase the number of users gradually and exclude warm-up time from measure time,
- Ability to compare 2 or more load tests. We can trace the impact of change and optimization between different builds by comparing metrics between them,
- Storing monitoring data, as long as JMeter results are in the same backend.
Metrics to monitor in real-time
A set of chosen metrics should be sufficient to understand what is happening with the application but not excessive enough to cause overhead on load generators and, therefore, reduce its performance.
- The Summary Grafana panel displays common metrics Current Active Threads, Current Overall Transaction Throughput, Transaction Success Rate, and Error Rate.
- The summary graph displays the Overall metrics: Throughput and 90th Percentile Response time correlated with Active sessions (Load volumes).
- Throughput of Successful transactions vs Errors correlated with Active sessions
Having access to a transaction log and error messages, we can get additional detail on errors and provide a client team with our observations in real-time.
For example, in the graph above, errors started to appear at 14:44 when 4000 users were running. The Error rate became 51.27% at 4000 user level. The previous level was 3,000 users before the system broke down. Most of the errors were with HTTP code - 500 Internal Server Errors, SSL Handshake Exception - Remote host closed connection during the handshake, HTTP Code - 404/Not Found.
Monitoring the test together with the DevOps team, we discovered that the CPU was maxed out around the time when the system failed to serve requests. 100% of memory was consumed by the end. Percentage of time spent in GC was quite high and reached 100% by the end. API Gateway is serverless but it has Elastic Search behind it, which has limited resources, 8G of RAM server. API gateway suddenly went unresponsive at 14:44 because the elastic side overloaded.
From the looks of it, the application broke at 580 transactions per second. In this example, a bottleneck was detected in real-time and the stress test was stopped, in the process saving resource time and associated expenses.
- Response Time vs Throughput
Using the following graph, we were able to define a volume level at which the system broke down, and also supported the volume of load, the system’s capacity, and responsiveness. The Overall Throughput was going up to 599 UT/s with 4,000 users until the system broke down at 14:44. The Transaction 90% Percentile Response Time was within 1 sec for successful transactions until the breakpoint then increased up to 18 sec. The Overall Response times went up to 3 min at the end.
The Throughput of Successful transactions was 460.53 UT/s with 90% Percentile Response Time of 1sec during a 10-minute interval, prior to the time when the system broke down, which allowed it to reach 1,657,908 user sessions per hour.
- Server-side statistics
Working collaboratively with the client’s team, we requested for monitoring of key server stats during testing. Typical metrics we need to monitor include:
o CPU utilization, CPU context switches,
o Memory consumption, amount of free memory, heap memory/GC,
o Page/Swap file utilization,
o Disk Usage, Disk time, amount of free disk space,
o Network Bandwidth Utilisation,
o Any Load Balancing stats available (if clustering used),
o If applicable, DB statistics, such as the longest, most frequently executed queries, timed events, latencies, throughput, hit ratio, memory and activity, error metrics, and analyze the effectiveness of queries, DB lock contention, I/O Latency, time working vs time waiting, etc,
o Application server statistics, such as the slowest and most frequently called methods.
Having server statistics provided, we examine it to see whether any metric correlates with the observed capacity, scalability, and response time behaviour. Using real-time results and being assisted with DevOps, we identify problems quickly during the test. Once we see a problem, we need to find the cause. This is where the server and network key performance indicators really come into play.
- The Transaction Metrics Overview reveals the slowest, most frequently called, failing transactions in real-time.
- The Request Metrics Overview helps drill down to requests, which might be a potential culprit of performance degradation.
- Throughput, Response Time and Error rate for an individual transaction and a request:
- Key metrics for each time interval or a load volume
For a stress test, the steady-increasing load profile simulates increasing the number of users throughout the load test run each 10 minutes until the system becomes saturated or begins to perform unacceptably. In this case, it is worth checking out metrics for each 10-minute interval rather than overall metrics to understand a supported load volume, which the system can handle, and with which level of quality and performance. The performance monitoring tool allows us to calculate and display metrics for specific time intervals and output a table with key metrics for each time period.
When a test is complete, we also supplement the final report with the information from the JMeter Dashboard report to provide customers with a comprehensive analysis of the data.
This approach has already proved to be valuable for some of our clients, which has allowed them to promptly react and eliminate performance issues in an effective way.