There will be many causes behind the high CPU utilization. Let’s begin with some troubleshooting steps to find the reason behind this.
There will be two scenarios for high CPU Utilization.
1. Currently system is utilizing high CPU. or
2. Have to find the reason for high CPU utilization during x days and y hours.
Let’s assume we have to check the currently high CPU utilization.
Run the top command and arrange the view with CPU utilization high to Low
You can find complete List of top command here
[root@TechArticles:~]# top top - 23:10:40 up 19:45, 0 users, load average: 3.88, 1.96, 0.77 Tasks: 49 total, 5 running, 44 sleeping, 0 stopped, 0 zombie %Cpu(s): 49.9 us, 0.2 sy, 0.0 ni, 49.3 id, 0.0 wa, 0.0 hi, 0.6 si, 0.0 st MiB Mem : 25177.0 total, 23979.4 free, 228.1 used, 969.5 buff/cache MiB Swap: 7168.0 total, 7168.0 free, 0.0 used. 24230.5 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 237837 jay 20 0 6020 3208 1364 R 100.0 0.0 3:09.29 bash 237838 jay 20 0 6020 3208 1364 R 100.0 0.0 3:09.29 bash 237839 jay 20 0 6020 3208 1364 R 100.0 0.0 3:09.29 bash 237840 jay 20 0 6020 3208 1364 R 100.0 0.0 3:09.29 bash 1 root 20 0 167996 12156 9604 S 0.3 0.0 2:00.19 systemd 21 root 20 0 439208 266272 264812 S 0.3 1.0 2:29.85 systemd-journal 398 dbus 20 0 4856 2836 2548 S 0.3 0.0 0:21.79 dbus-broker 30 root 20 0 173356 24428 18204 S 0.0 0.1 0:05.82 php-fpm
Press “P” or “shift+p” to arrange the view from high to low CPU utilization.
As you can see, user code>jay/code> is running the bash command, which consumes 100% of the CPU, but the actual CPU utilization is %Cpu(s)49.9
. and load average is 3.88
Now that you have found one reason for high CPU utilization, you can stop here and inform the customer that this user is executing a bash command, which is why current CPU utilization is high on the server.
To continue more troubleshooting, please follow the below steps:
To find out more about it, check the current load average. If the load average exceeds the total physical CPU core count and the CPU(s) are also close to 100%
, the server is unquestionably under load.
[root@TechArticles:~]# top top - 23:30:09 up 20:04, 0 users, load average: 8.39, 7.52, 4.91 Tasks: 66 total, 9 running, 57 sleeping, 0 stopped, 0 zombie %Cpu(s): 99.5 us, 0.3 sy, 0.0 ni, 0.0 id, 0.0 wa, 0.0 hi, 0.2 si, 0.0 st MiB Mem : 25177.0 total, 23456.8 free, 632.6 used, 1087.6 buff/cache MiB Swap: 7168.0 total, 7168.0 free, 0.0 used. 23797.8 avail Mem PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 237839 jay 20 0 6020 3208 1364 R 99.7 0.0 22:35.74 bash 238864 jay 20 0 6020 1844 0 R 99.3 0.0 7:55.92 bash 237837 jay 20 0 6020 3208 1364 R 99.0 0.0 22:35.11 bash 237840 jay 20 0 6020 3208 1364 R 99.0 0.0 22:35.43 bash 238862 jay 20 0 6020 1844 0 R 99.0 0.0 7:55.69 bash 238863 jay 20 0 6020 1844 0 R 99.0 0.0 7:54.89 bash 237838 jay 20 0 6020 3208 1364 R 98.3 0.0 22:35.12 bash 238865 jay 20 0 6020 3208 1364 R 97.4 0.0 7:55.44 bash 239293 mysql 20 0 2434840 413420 35944 S 1.0 1.6 0:04.67 mysqld 1 root 20 0 168152 12240 9604 S 0.7 0.0 2:02.34 systemd 21 root 20 0 447400 273640 272168 S 0.3 1.1 2:33.38 systemd-journal 398 dbus 20 0 4980 2852 2548 S 0.3 0.0 0:22.06 dbus-broker 239803 root 20 0 7872 3784 3184 R 0.3 0.0 0:00.03 top
According to the top command, the current load average is 8.39
, and the %Cpu(s) is also 99.5
. Let’s find the actual physical CPU core.
[root@TechArticles:~]# nproc 8 [root@TechArticles:~]#
About nproc:
The nproc command is a Linux/Unix utility that is used to display the number of processing units available on the system. This can include physical CPUs, cores, and/or hyperthreads. The command simply prints the number of processing units to standard output and exits.
As per above output Total CORE is 8 and load average is 8.39
, its is in higher side.
Are you going to advise the consumer to upgrade the CPU at this point? But wait, it’s too soon to suggest raising the CPU.
Let’s troubleshoot more on it before giving any conclusions right away.
We will use the SAR command to check some historical CPU utilization.
Read more here on about sar command and its uses
[root@TechArticles:~]# sar -u -1 Linux 4.18.0-372.9.1.el8.x86_64 (TechArticles) 03/19/2023 _x86_64_ (8 CPU) 09:03:48 CPU %user %nice %system %iowait %steal %idle 09:10:00 all 0.16 0.00 0.18 0.02 0.00 99.64 09:20:03 all 0.09 0.00 0.12 0.06 0.00 99.73 09:30:02 all 0.11 0.00 0.14 0.15 0.00 99.60 09:40:03 all 0.07 0.00 0.10 0.02 0.00 99.80 09:50:00 all 0.08 0.10 0.14 0.02 0.00 99.67 10:00:03 all 0.08 0.00 0.10 0.01 0.00 99.80 10:10:04 all 0.08 0.00 0.11 0.01 0.00 99.80 10:20:02 all 0.09 0.85 2.37 0.01 0.00 96.68 10:30:04 all 0.09 0.03 0.17 0.01 0.00 99.71 10:40:01 all 0.08 0.00 0.11 0.01 0.00 99.80 10:50:04 all 0.09 0.00 0.12 0.01 0.00 99.79 11:00:01 all 0.10 0.00 0.12 0.01 0.00 99.77 11:10:03 all 0.10 0.00 0.13 0.01 0.00 99.77 11:20:00 all 0.10 0.00 0.12 0.01 0.00 99.77 11:30:03 all 0.12 0.00 0.15 0.01 0.00 99.72 [...] 20:30:03 CPU %user %nice %system %iowait %steal %idle 20:40:01 all 0.11 0.00 0.15 0.01 0.00 99.73 20:50:04 all 0.13 0.00 0.16 0.01 0.00 99.70 21:00:01 all 0.12 0.00 0.15 0.01 0.00 99.72 21:10:03 all 0.15 0.00 0.18 0.01 0.00 99.66 22:07:30 all 0.15 0.00 0.20 0.19 0.00 99.46 22:10:00 all 0.17 0.00 0.20 0.01 0.00 99.61 22:20:01 all 0.08 0.00 0.11 0.04 0.00 99.77 22:30:01 all 0.08 0.00 0.10 0.01 0.00 99.81 22:40:03 all 0.07 0.00 0.10 0.00 0.00 99.82 22:50:00 all 0.07 0.01 0.11 0.00 0.00 99.81 23:00:03 all 0.07 0.00 0.11 0.02 0.00 99.80 23:10:00 all 12.63 0.00 0.29 0.00 0.00 87.08 23:20:03 all 49.76 0.00 0.90 0.02 0.00 49.32 23:30:01 all 88.77 0.00 0.69 0.00 0.00 10.54 23:40:04 all 89.52 0.00 0.45 0.01 0.00 10.03 23:50:01 all 62.54 0.00 0.45 0.03 0.00 36.98 Average: all 4.21 0.01 0.20 0.01 0.00 95.56
As per the above sar report, we could see idle CPU was only below 89% as of 23:00. due to a user running some bash script during
Let’s further troubleshoot on more historical data if the CPU utilization is going high daily during this time or any other time also.
[root@TechArticles:~]# sar -u -3 Linux 4.18.0-372.9.1.el8.x86_64 (TechArticles) 03/18/2023 _x86_64_ (8 CPU) 09:03:48 CPU %user %nice %system %iowait %steal %idle 09:10:00 all 0.16 0.00 0.18 0.02 0.00 99.64 09:20:03 all 0.09 0.00 0.12 0.06 0.00 99.73 09:30:02 all 0.11 0.00 0.14 0.15 0.00 99.60 09:40:03 all 0.07 0.00 0.10 0.02 0.00 99.80 09:50:00 all 0.08 0.10 0.14 0.02 0.00 99.67 10:00:03 all 0.08 0.00 0.10 0.01 0.00 99.80 10:10:04 all 0.08 0.00 0.11 0.01 0.00 99.80 10:20:02 all 0.09 0.85 2.37 0.01 0.00 96.68 10:30:04 all 0.09 0.03 0.17 0.01 0.00 99.71 [...] 13:40:00 CPU %user %nice %system %iowait %steal %idle 13:50:01 all 0.14 0.00 0.19 0.01 0.00 99.67 14:00:00 all 0.18 0.00 0.21 0.00 0.00 99.60 14:10:01 all 0.29 0.01 0.32 0.02 0.00 99.36 14:20:01 all 0.23 0.00 0.37 0.02 0.00 99.37 14:30:01 all 0.24 0.00 0.35 0.01 0.00 99.41 14:40:01 all 0.32 0.00 0.58 0.01 0.00 99.10 14:50:00 all 0.34 0.00 0.78 0.01 0.00 98.87 15:00:02 all 0.36 0.00 0.65 0.01 0.00 98.98 22:13:00 all 0.56 0.00 1.60 0.11 0.00 97.74 22:20:00 all 0.10 0.00 0.12 0.02 0.00 99.76 22:30:00 all 0.05 0.00 0.08 0.02 0.00 99.85 22:40:02 all 0.05 0.00 0.07 0.01 0.00 99.87 22:50:00 all 0.04 0.00 0.06 0.00 0.00 99.90 23:00:03 all 0.04 0.01 0.07 0.00 0.00 99.88 23:10:00 all 0.05 0.00 0.05 0.00 0.00 99.90 23:20:03 all 0.04 0.00 0.04 0.00 0.00 99.92 23:30:00 all 0.05 0.00 0.04 0.01 0.00 99.90 23:40:02 all 0.04 0.00 0.03 0.05 0.00 99.88 Average: all 0.12 0.02 0.24 0.01 0.00 99.61
Let’s refine the command to get the report only when the CPU utilization goes below some certain points.
[root@TechArticles:~]# sar -u -1 | egrep -v "Average" | awk 'NR==3||$8<95' Linux 4.18.0-372.9.1.el8.x86_64 (TechArticles) 03/18/2023 _x86_64_ (8 CPU) 09:03:48 CPU %user %nice %system %iowait %steal %idle 23:10:00 all 12.63 0.00 0.29 0.00 0.00 87.08 23:20:03 all 49.76 0.00 0.90 0.02 0.00 49.32 23:30:01 all 88.77 0.00 0.69 0.00 0.00 10.54 23:40:04 all 89.52 0.00 0.45 0.01 0.00 10.03 23:50:01 all 62.54 0.00 0.45 0.03 0.00 36.98
I checked more old data but did not find any logs where CPU utilization went below 87%. So, at this point, we can recommend the customer check his script, as we did not find any other instances of high CPU utilization other than today.
What if we found high CPU utilization in historical data?
If we find high utilization logs on historical data, we can also troubleshoot further to find the reason behind this.
To troubleshoot the reason for high CPU utilization, I am going to use the recap tool for this tutorial, and the recap tool should be already installed and configured to capture logs.
Make sure recap tool is installed and its configure to capture the resource utilization.
recap tool:
recap is a system status reporting tool. A reporting script that generates reports of various information about the server.
Installation in RHEL/CentOS
recap is available from the EPEL repository.
# yum install recap # recap -V 2.1.0
If the above tool is installed and enabled to capture historical data, you can easily find the reason for high resource utilization.
Let’s look at the types of data that are available on recap. By default, recap maintains its settings in /etc/recap.conf
file and logs in the /var/log/recap
directory. The recap can be customised to meet your needs.
[root@TechArticles:/var/log/recap]# ls -ltr total 100 drwxr-xr-x 2 root root 4096 Sep 21 20:22 snapshots drwxr-xr-x 2 root root 4096 Sep 21 20:22 backups -rw-r--r-- 1 root root 7262 Mar 20 00:54 ps_20230320-005439.log -rw-r--r-- 1 root root 7094 Mar 20 00:54 resources_20230320-005439.log -rw-r--r-- 1 root root 6034 Mar 20 00:54 netstat_20230320-005439.log -rw-r--r-- 1 root root 8231 Mar 20 15:51 recap.log [root@TechArticles:/var/log/recap]#
As per the above details, recap capture the logs of ps, running resources, and logs of netstat.
To identify the cause, look for several times and dates in the ps and resource logs. Several reports, including the “Top 10 cpu utilising processes,” will be displayed.
You will be able to offer suggestions and solutions to resolve the problem based on all the current logs and history logs.
Please Note: To capture the historical logs, there are many tools on the market, both free and paid. The GNU General Public License, version 2.0, governs the recap tool. It is totally free.
There are a variety of causes for the high CPU utilisation. Let’s examine a few more issues.
(a) Since the backup team takes heavy backups, you typically encounter these scenarios on weekends or outside of business hours.
(b) Use # top
to determine which processes are using the most CPU time, then take a snapshot of those processes. Send the snapshot and let the user know to end the unnecessary process.
(c) If those processes are backups, alert the backup team and ask them to reduce CPU usage by stopping some backups or changing the backup priority to lower.
(d) On occasion, CPU utilisation will peak during peak hours (defined as times when businesses are open for business) and then return to normal after some time (within seconds or some minutes). but a ticket that the monitoring team raised. Therefore, we must take a picture of that peak stage, add it to the raised ticket, and then close that ticket.
(e) If there are any spare processors or other low-load CPUs available, heavy application processes should occasionally be transferred to those CPUs if they are running continuously (i.e., business applications).
(f) If additional CPUs are not available, inform the data centre staff or CPU vendor to request the purchase of a new CPU with business approval and transfer some processes to the new CPUs.
While working in the real world, a wide variety of problems may arise. I’m hoping this article will help you troubleshoot issues with high CPU utilisation.
Was this article of use to you? Post your insightful thoughts or recommendations in the comments section if you don’t find this article to be helpful or if you see any outdated information, a problem, or a typo to help this article better.