How to troubleshoot High CPU utilization in Linux?

CPU utilization in Linux

There will be many causes behind the high CPU utilization. Let's begin with some troubleshooting steps to find the reason behind this.

{getToc} $title={Table of Contents}

There will be two scenarios for high CPU Utilization.

1. Currently system is utilizing high CPU.  or

2. Have to find the reason for high CPU utilization during x days and y hours.

Let's assume we have to check the currently high CPU utilization.

Run the top command and arrange the view with CPU utilization high to Low

You can find complete List of top command here

[root@TechArticles:~]# top
top - 23:10:40 up 19:45,  0 users,  load average: 3.88, 1.96, 0.77
Tasks:  49 total,   5 running,  44 sleeping,   0 stopped,   0 zombie
%Cpu(s): 49.9 us,  0.2 sy,  0.0 ni, 49.3 id,  0.0 wa,  0.0 hi,  0.6 si,  0.0 st
MiB Mem :  25177.0 total,  23979.4 free,    228.1 used,    969.5 buff/cache
MiB Swap:   7168.0 total,   7168.0 free,      0.0 used.  24230.5 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 237837 jay       20   0    6020   3208   1364 R 100.0   0.0   3:09.29 bash
 237838 jay       20   0    6020   3208   1364 R 100.0   0.0   3:09.29 bash
 237839 jay       20   0    6020   3208   1364 R 100.0   0.0   3:09.29 bash
 237840 jay       20   0    6020   3208   1364 R 100.0   0.0   3:09.29 bash
      1 root      20   0  167996  12156   9604 S   0.3   0.0   2:00.19 systemd
     21 root      20   0  439208 266272 264812 S   0.3   1.0   2:29.85 systemd-journal
    398 dbus      20   0    4856   2836   2548 S   0.3   0.0   0:21.79 dbus-broker
     30 root      20   0  173356  24428  18204 S   0.0   0.1   0:05.82 php-fpm

Press "P" or "shift+p" to arrange the view from high to low CPU utilization.

As you can see, user code>jay/code> is running the bash command, which consumes 100% of the CPU, but the actual CPU utilization is %Cpu(s)49.9. and load average is 3.88

Now that you have found one reason for high CPU utilization, you can stop here and inform the customer that this user is executing a bash command, which is why current CPU utilization is high on the server.

To continue more troubleshooting, please follow the below steps:

To find out more about it, check the current load average. If the load average exceeds the total physical CPU core count and the CPU(s) are also close to 100%, the server is unquestionably under load. 

[root@TechArticles:~]# top
top - 23:30:09 up 20:04,  0 users,  load average: 8.39, 7.52, 4.91
Tasks:  66 total,   9 running,  57 sleeping,   0 stopped,   0 zombie
%Cpu(s): 99.5 us,  0.3 sy,  0.0 ni,  0.0 id,  0.0 wa,  0.0 hi,  0.2 si,  0.0 st
MiB Mem :  25177.0 total,  23456.8 free,    632.6 used,   1087.6 buff/cache
MiB Swap:   7168.0 total,   7168.0 free,      0.0 used.  23797.8 avail Mem

    PID USER      PR  NI    VIRT    RES    SHR S  %CPU  %MEM     TIME+ COMMAND
 237839 jay       20   0    6020   3208   1364 R  99.7   0.0  22:35.74 bash
 238864 jay       20   0    6020   1844      0 R  99.3   0.0   7:55.92 bash
 237837 jay       20   0    6020   3208   1364 R  99.0   0.0  22:35.11 bash
 237840 jay       20   0    6020   3208   1364 R  99.0   0.0  22:35.43 bash
 238862 jay       20   0    6020   1844      0 R  99.0   0.0   7:55.69 bash
 238863 jay       20   0    6020   1844      0 R  99.0   0.0   7:54.89 bash
 237838 jay       20   0    6020   3208   1364 R  98.3   0.0  22:35.12 bash
 238865 jay       20   0    6020   3208   1364 R  97.4   0.0   7:55.44 bash
 239293 mysql     20   0 2434840 413420  35944 S   1.0   1.6   0:04.67 mysqld
      1 root      20   0  168152  12240   9604 S   0.7   0.0   2:02.34 systemd
     21 root      20   0  447400 273640 272168 S   0.3   1.1   2:33.38 systemd-journal
    398 dbus      20   0    4980   2852   2548 S   0.3   0.0   0:22.06 dbus-broker
 239803 root      20   0    7872   3784   3184 R   0.3   0.0   0:00.03 top

According to the top command, the current load average is 8.39, and the %Cpu(s) is also 99.5. Let's find the actual physical CPU core.

[root@TechArticles:~]# nproc
8
[root@TechArticles:~]#

About nproc: The nproc command is a Linux/Unix utility that is used to display the number of processing units available on the system. This can include physical CPUs, cores, and/or hyperthreads. The command simply prints the number of processing units to standard output and exits.

As per above output Total CORE is 8 and load average is 8.39, its is in higher side.

Are you going to advise the consumer to upgrade the CPU at this point? But wait, it's too soon to suggest raising the CPU.

Let's troubleshoot more on it before giving any conclusions right away.

We will use the SAR command to check some historical CPU utilization.

Read more here on about sar command and its uses

[root@TechArticles:~]# sar -u -1
Linux 4.18.0-372.9.1.el8.x86_64 (TechArticles)  03/19/2023   _x86_64_        (8 CPU)
09:03:48        CPU     %user     %nice   %system   %iowait    %steal     %idle
09:10:00        all      0.16      0.00      0.18      0.02      0.00     99.64
09:20:03        all      0.09      0.00      0.12      0.06      0.00     99.73
09:30:02        all      0.11      0.00      0.14      0.15      0.00     99.60
09:40:03        all      0.07      0.00      0.10      0.02      0.00     99.80
09:50:00        all      0.08      0.10      0.14      0.02      0.00     99.67
10:00:03        all      0.08      0.00      0.10      0.01      0.00     99.80
10:10:04        all      0.08      0.00      0.11      0.01      0.00     99.80
10:20:02        all      0.09      0.85      2.37      0.01      0.00     96.68
10:30:04        all      0.09      0.03      0.17      0.01      0.00     99.71
10:40:01        all      0.08      0.00      0.11      0.01      0.00     99.80
10:50:04        all      0.09      0.00      0.12      0.01      0.00     99.79
11:00:01        all      0.10      0.00      0.12      0.01      0.00     99.77
11:10:03        all      0.10      0.00      0.13      0.01      0.00     99.77
11:20:00        all      0.10      0.00      0.12      0.01      0.00     99.77
11:30:03        all      0.12      0.00      0.15      0.01      0.00     99.72
[...]
20:30:03        CPU     %user     %nice   %system   %iowait    %steal     %idle
20:40:01        all      0.11      0.00      0.15      0.01      0.00     99.73
20:50:04        all      0.13      0.00      0.16      0.01      0.00     99.70
21:00:01        all      0.12      0.00      0.15      0.01      0.00     99.72
21:10:03        all      0.15      0.00      0.18      0.01      0.00     99.66
22:07:30        all      0.15      0.00      0.20      0.19      0.00     99.46
22:10:00        all      0.17      0.00      0.20      0.01      0.00     99.61
22:20:01        all      0.08      0.00      0.11      0.04      0.00     99.77
22:30:01        all      0.08      0.00      0.10      0.01      0.00     99.81
22:40:03        all      0.07      0.00      0.10      0.00      0.00     99.82
22:50:00        all      0.07      0.01      0.11      0.00      0.00     99.81
23:00:03        all      0.07      0.00      0.11      0.02      0.00     99.80
23:10:00        all     12.63      0.00      0.29      0.00      0.00     87.08
23:20:03        all     49.76      0.00      0.90      0.02      0.00     49.32
23:30:01        all     88.77      0.00      0.69      0.00      0.00     10.54
23:40:04        all     89.52      0.00      0.45      0.01      0.00     10.03
23:50:01        all     62.54      0.00      0.45      0.03      0.00     36.98
Average:        all      4.21      0.01      0.20      0.01      0.00     95.56

As per the above sar report, we could see idle CPU was only below 89% as of 23:00. due to a user running some bash script during

Let's further troubleshoot on more historical data if the CPU utilization is going high daily during this time or any other time also.

[root@TechArticles:~]# sar -u -3
Linux 4.18.0-372.9.1.el8.x86_64 (TechArticles)  03/18/2023   _x86_64_        (8 CPU)
09:03:48        CPU     %user     %nice   %system   %iowait    %steal     %idle
09:10:00        all      0.16      0.00      0.18      0.02      0.00     99.64
09:20:03        all      0.09      0.00      0.12      0.06      0.00     99.73
09:30:02        all      0.11      0.00      0.14      0.15      0.00     99.60
09:40:03        all      0.07      0.00      0.10      0.02      0.00     99.80
09:50:00        all      0.08      0.10      0.14      0.02      0.00     99.67
10:00:03        all      0.08      0.00      0.10      0.01      0.00     99.80
10:10:04        all      0.08      0.00      0.11      0.01      0.00     99.80
10:20:02        all      0.09      0.85      2.37      0.01      0.00     96.68
10:30:04        all      0.09      0.03      0.17      0.01      0.00     99.71
[...]
13:40:00        CPU     %user     %nice   %system   %iowait    %steal     %idle
13:50:01        all      0.14      0.00      0.19      0.01      0.00     99.67
14:00:00        all      0.18      0.00      0.21      0.00      0.00     99.60
14:10:01        all      0.29      0.01      0.32      0.02      0.00     99.36
14:20:01        all      0.23      0.00      0.37      0.02      0.00     99.37
14:30:01        all      0.24      0.00      0.35      0.01      0.00     99.41
14:40:01        all      0.32      0.00      0.58      0.01      0.00     99.10
14:50:00        all      0.34      0.00      0.78      0.01      0.00     98.87
15:00:02        all      0.36      0.00      0.65      0.01      0.00     98.98
22:13:00        all      0.56      0.00      1.60      0.11      0.00     97.74
22:20:00        all      0.10      0.00      0.12      0.02      0.00     99.76
22:30:00        all      0.05      0.00      0.08      0.02      0.00     99.85
22:40:02        all      0.05      0.00      0.07      0.01      0.00     99.87
22:50:00        all      0.04      0.00      0.06      0.00      0.00     99.90
23:00:03        all      0.04      0.01      0.07      0.00      0.00     99.88
23:10:00        all      0.05      0.00      0.05      0.00      0.00     99.90
23:20:03        all      0.04      0.00      0.04      0.00      0.00     99.92
23:30:00        all      0.05      0.00      0.04      0.01      0.00     99.90
23:40:02        all      0.04      0.00      0.03      0.05      0.00     99.88
Average:        all      0.12      0.02      0.24      0.01      0.00     99.61

Let's refine the command to get the report only when the CPU utilization goes below some certain points.

[root@TechArticles:~]# sar -u -1 | egrep -v "Average" | awk 'NR==3||$8<95'

Linux 4.18.0-372.9.1.el8.x86_64 (TechArticles)  03/18/2023   _x86_64_        (8 CPU)

09:03:48        CPU     %user     %nice   %system   %iowait    %steal     %idle
23:10:00        all     12.63      0.00      0.29      0.00      0.00     87.08
23:20:03        all     49.76      0.00      0.90      0.02      0.00     49.32
23:30:01        all     88.77      0.00      0.69      0.00      0.00     10.54
23:40:04        all     89.52      0.00      0.45      0.01      0.00     10.03
23:50:01        all     62.54      0.00      0.45      0.03      0.00     36.98

I checked more old data but did not find any logs where CPU utilization went below 87%. So, at this point, we can recommend the customer check his script, as we did not find any other instances of high CPU utilization other than today.


What if we found high CPU utilization in historical data?

If we find high utilization logs on historical data, we can also troubleshoot further to find the reason behind this.

To troubleshoot the reason for high CPU utilization, I am going to use the recap tool for this tutorial, and the recap tool should be already installed and configured to capture logs.

Make sure recap tool is installed and its configure to capture the resource utilization.

recap tool: recap is a system status reporting tool. A reporting script that generates reports of various information about the server.

Installation in RHEL/CentOS

recap is available from the EPEL repository.

# yum install recap
# recap -V
2.1.0

If the above tool is installed and enabled to capture historical data, you can easily find the reason for high resource utilization.

Let's look at the types of data that are available on recap. By default, recap maintains its settings in /etc/recap.conf file and logs in the /var/log/recap directory. The recap can be customised to meet your needs.

[root@TechArticles:/var/log/recap]# ls -ltr
total 100
drwxr-xr-x 2 root root 4096 Sep 21 20:22 snapshots
drwxr-xr-x 2 root root 4096 Sep 21 20:22 backups
-rw-r--r-- 1 root root 7262 Mar 20 00:54 ps_20230320-005439.log
-rw-r--r-- 1 root root 7094 Mar 20 00:54 resources_20230320-005439.log
-rw-r--r-- 1 root root 6034 Mar 20 00:54 netstat_20230320-005439.log
-rw-r--r-- 1 root root 8231 Mar 20 15:51 recap.log
[root@TechArticles:/var/log/recap]# 

As per the above details, recap capture the logs of ps, running resources, and logs of netstat.

To identify the cause, look for several times and dates in the ps and resource logs. Several reports, including the "Top 10 cpu utilising processes," will be displayed.

You will be able to offer suggestions and solutions to resolve the problem based on all the current logs and history logs.

Please Note: To capture the historical logs, there are many tools on the market, both free and paid. The GNU General Public License, version 2.0, governs the recap tool. It is totally free.


There are a variety of causes for the high CPU utilisation. Let's examine a few more issues.

(a) Since the backup team takes heavy backups, you typically encounter these scenarios on weekends or outside of business hours.

(b) Use # top to determine which processes are using the most CPU time, then take a snapshot of those processes. Send the snapshot and let the user know to end the unnecessary process.

(c) If those processes are backups, alert the backup team and ask them to reduce CPU usage by stopping some backups or changing the backup priority to lower.

(d) On occasion, CPU utilisation will peak during peak hours (defined as times when businesses are open for business) and then return to normal after some time (within seconds or some minutes). but a ticket that the monitoring team raised. Therefore, we must take a picture of that peak stage, add it to the raised ticket, and then close that ticket.

(e) If there are any spare processors or other low-load CPUs available, heavy application processes should occasionally be transferred to those CPUs if they are running continuously (i.e., business applications). 

(f) If additional CPUs are not available, inform the data centre staff or CPU vendor to request the purchase of a new CPU with business approval and transfer some processes to the new CPUs.

While working in the real world, a wide variety of problems may arise. I'm hoping this article will help you troubleshoot issues with high CPU utilisation.

Was this article of use to you? Post your insightful thoughts or recommendations in the comments section if you don't find this article to be helpful or if you see any outdated information, a problem, or a typo to help this article better.

Jay

I love keeping up with the latest tech trends and emerging technologies like Linux, Azure, AWS, GCP, and other cutting-edge systems. With experience working with various technology tools and platforms, I enjoy sharing my knowledge through writing. I have a talent for simplifying complex technical concepts to make my articles accessible to all readers. Always looking for fresh ideas, I enjoy the challenge of presenting technical information in engaging ways. My ultimate aim is to help readers stay informed and empowered on their tech journeys.

Post a Comment

Previous Post Next Post

Contact Form