Soft lockup detection in linux through dmesg logs parsing and sending telemetry by adityagarg0911 · Pull Request #3573 · Azure/WALinuxAgent

adityagarg0911 · 2026-03-05T03:35:52Z

Description

Add kernel soft lockup monitoring to the Azure Linux Agent. This new feature periodically parses dmesg output to detect CPU soft lockup events (BUG: soft lockup - CPU#N stuck for Xs!), aggregates them by CPU, and reports summarized telemetry to Azure. This helps detect and diagnose VM health issues caused by CPUs being stuck in kernel code

Changes:

New module kernel_event_monitor.py — MonitorKernelSoftLockup periodic operation that:
- Parses dmesg for soft lockup events using regex
- Aggregates events by CPU ID (count, max stuck time, last kernel timestamp)
- Reports via telemetry (WALAEventOperation.KernelSoftLockup)
- Persists watermark to disk to avoid duplicate reporting across agent restarts
- Detects reboots via boot ID to reset watermark
monitor.py — Conditionally adds MonitorKernelSoftLockup to the monitor thread based on config
conf.py — New config options: Monitor.KernelSoftLockup (enable/disable) and Monitor.KernelSoftLockupPeriod (check interval, minimum 300s)
waagent.conf — Added default config entries (Monitor.KernelSoftLockup=y, Monitor.KernelSoftLockupPeriod=21600)
event.py — New WALAEventOperation.KernelSoftLockup operation type
test_kernel_event_monitor.py — 16 unit tests covering regex, parsing, aggregation, reporting, state persistence, dmesg error handling, and end-to-end operation
test_monitor.py — Updated to verify MonitorKernelSoftLockup is included/excluded based on config flag

Issue #

PR information

Ensure development PR is based on the develop branch.
If applicable, the PR references the bug/issue that it fixes in the description.
New Unit tests were added for the changes made

Quality of Code and Contribution Guidelines

I have read the contribution guidelines.

Distro maintenance information, if applicable

This is a contribution from a distro maintainer
The changes in this PR have been taken as a downstream patch (Note: it is not recommended to patch the agent without upstream review and approval)

…porting. Also add unit test cases for this feature.

…gargaditya/kernel_soft_lockup_detection

azurelinuxagent/ga/kernel_event_monitor.py

Copilot

Pull request overview

Adds kernel soft lockup monitoring to WALinuxAgent by periodically parsing dmesg, aggregating detected “BUG: soft lockup” events per CPU, and emitting summarized telemetry to Azure, with configuration gates and unit tests.

Changes:

Introduces MonitorKernelSoftLockup periodic operation that parses dmesg, aggregates soft lockup events, sends telemetry, and persists a watermark across restarts (with boot-id reset handling).
Wires the new monitor into the GA monitor thread behind a config flag and adds new config options + defaults.
Adds unit tests for parsing/aggregation/reporting/state persistence and updates monitor-thread tests to cover enable/disable behavior.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
`azurelinuxagent/ga/kernel_event_monitor.py`	New periodic operation for parsing `dmesg`, aggregating soft lockups, telemetry emission, and watermark persistence.
`azurelinuxagent/ga/monitor.py`	Conditionally adds `MonitorKernelSoftLockup` to the monitor thread’s periodic operations.
`azurelinuxagent/common/conf.py`	Adds config accessors for enabling soft lockup monitoring and setting its period.
`azurelinuxagent/common/event.py`	Adds `WALAEventOperation.KernelSoftLockup` operation name.
`config/waagent.conf`	Adds default config entries to enable soft lockup monitoring and set the default period.
`tests/ga/test_kernel_event_monitor.py`	New unit tests covering regex, parsing, aggregation, reporting, dmesg error handling, and persistence.
`tests/ga/test_monitor.py`	Extends monitor-thread tests to validate inclusion/exclusion of the new periodic operation based on config.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-03-26T19:41:59Z

azurelinuxagent/ga/kernel_event_monitor.py

+            agg["last_timestamp"] = kernel_timestamp
+
+        if not found_timestamp:
+            logger.warn(


When dmesg timestamps are missing (e.g. CONFIG_PRINTK_TIME=n or printk.time=0), this code will emit a logger.warn(...) on every execution of the operation. This can create recurring log noise on affected systems; the codebase commonly uses logger.periodic_warn(...) for this kind of persistent condition. Consider switching to periodic_warn (e.g., hourly) or only logging once per boot.

Suggested change

logger.warn(

logger.periodic_warn(

logger.EVERY_HOUR,

Copilot · 2026-03-26T19:41:59Z

tests/ga/test_monitor.py

+            with patch("azurelinuxagent.ga.monitor.MonitorHandler.stopped", side_effect=[False, True, False, True, False, True, False, True]):
                with patch("time.sleep"):


This test hardcodes a side_effect list for MonitorHandler.stopped that must match the exact number of loop iterations (2 calls per run * 4 config combinations). This is brittle: adding another config combination or an extra stopped() check in the implementation will cause unrelated test failures. Consider generating the side_effect programmatically from the number of runs (e.g., [False, True] * num_runs) or using a callable side_effect that toggles per run.

Copilot · 2026-03-26T19:42:00Z

azurelinuxagent/ga/kernel_event_monitor.py

+        try:
+            return run_command(['dmesg'], track_process=False, timeout=self._DMESG_TIMEOUT)
+        except Exception as e:
+            logger.warn("KernelSoftLockup: Failed to read dmesg output: {0}".format(ustr(e)))


The _DMESG_TIMEOUT passed to run_command() will not actually be enforced on Python 2: shellutil.run_command explicitly ignores the timeout parameter on Py2. If the agent still supports running under Python 2, consider implementing an explicit timeout/kill for the dmesg subprocess (or document/remove the timeout to avoid a false sense of safety).

Copilot · 2026-03-26T19:42:00Z

azurelinuxagent/ga/kernel_event_monitor.py

+            with open(self._state_file_path, 'w') as f:
+                json.dump(state, f)
+        except Exception as e:
+            logger.warn("KernelSoftLockup: Failed to save state: {0}".format(ustr(e)))


_save_state() writes JSON directly to the final state file. If the agent is interrupted mid-write, the file can be left truncated/corrupt; on next start _load_state() will reset the watermark to 0, which can cause duplicate reporting (the opposite of what the watermark is trying to prevent). Consider writing to a temporary file and os.rename() to the final path (similar to EventLogger.save_event) to make the update atomic.

Suggested change

with open(self._state_file_path, 'w') as f:

json.dump(state, f)

except Exception as e:

logger.warn("KernelSoftLockup: Failed to save state: {0}".format(ustr(e)))

tmp_path = self._state_file_path + ".tmp"

with open(tmp_path, 'w') as f:

json.dump(state, f)

f.flush()

os.fsync(f.fileno())

os.rename(tmp_path, self._state_file_path)

except Exception as e:

logger.warn("KernelSoftLockup: Failed to save state: {0}".format(ustr(e)))

try:

if 'tmp_path' in locals() and os.path.exists(tmp_path):

os.remove(tmp_path)

except Exception:

# Best-effort cleanup of temp file; ignore cleanup errors

pass

Aditya Garg added 9 commits March 3, 2026 11:45

Add soft lockup detection through dmesg logs parsing and telemetry re…

6e54b5e

…porting. Also add unit test cases for this feature.

Merge branch 'develop' of https://github.com/Azure/WALinuxAgent into …

c37aede

…gargaditya/kernel_soft_lockup_detection

Fix errors in unit testcases

1f6d661

Fix unit test failing in python 3.5

45e8f98

Fix unit test failing in python 2.6

a66b170

Fix unit test failing in python 2.6/2.7

8a270c8

Remove unused import

6a7d16a

Cosmetic changes

408656b

Move class initialization inside if condition

e679fb5

adityagarg0911 marked this pull request as ready for review March 5, 2026 15:52

adityagarg0911 requested review from ZhidongPeng, maddieford, nagworld9 and narrieta as code owners March 5, 2026 15:52

nagworld9 reviewed Mar 17, 2026

View reviewed changes

Replace logger.periodic_warn with logger.warn

d6c8531

narrieta assigned Copilot Mar 26, 2026

Merge branch 'develop' into gargaditya/kernel_soft_lockup_detection

b60a60e

Copilot AI review requested due to automatic review settings March 26, 2026 19:36

Copilot started reviewing on behalf of narrieta March 26, 2026 19:37 View session

Copilot AI reviewed Mar 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Soft lockup detection in linux through dmesg logs parsing and sending telemetry#3573

Soft lockup detection in linux through dmesg logs parsing and sending telemetry#3573
adityagarg0911 wants to merge 11 commits intoAzure:developfrom
adityagarg0911:gargaditya/kernel_soft_lockup_detection

adityagarg0911 commented Mar 5, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

		with patch("azurelinuxagent.ga.monitor.MonitorHandler.stopped", side_effect=[False, True, False, True, False, True, False, True]):
		with patch("time.sleep"):

-            with open(self._state_file_path, 'w') as f:
-                json.dump(state, f)
-        except Exception as e:
-            logger.warn("KernelSoftLockup: Failed to save state: {0}".format(ustr(e)))
+            tmp_path = self._state_file_path + ".tmp"
+            with open(tmp_path, 'w') as f:
+                json.dump(state, f)
+                f.flush()
+                os.fsync(f.fileno())
+            os.rename(tmp_path, self._state_file_path)
+        except Exception as e:
+            logger.warn("KernelSoftLockup: Failed to save state: {0}".format(ustr(e)))
+            try:
+                if 'tmp_path' in locals() and os.path.exists(tmp_path):
+                    os.remove(tmp_path)
+            except Exception:
+                # Best-effort cleanup of temp file; ignore cleanup errors
+                pass

Conversation

adityagarg0911 commented Mar 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Issue #

PR information

Quality of Code and Contribution Guidelines

Distro maintenance information, if applicable

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

adityagarg0911 commented Mar 5, 2026 •

edited

Loading