Skip to content

Soft lockup detection in linux through dmesg logs parsing and sending telemetry#3573

Open
adityagarg0911 wants to merge 11 commits intoAzure:developfrom
adityagarg0911:gargaditya/kernel_soft_lockup_detection
Open

Soft lockup detection in linux through dmesg logs parsing and sending telemetry#3573
adityagarg0911 wants to merge 11 commits intoAzure:developfrom
adityagarg0911:gargaditya/kernel_soft_lockup_detection

Conversation

@adityagarg0911
Copy link

@adityagarg0911 adityagarg0911 commented Mar 5, 2026

Description

Add kernel soft lockup monitoring to the Azure Linux Agent. This new feature periodically parses dmesg output to detect CPU soft lockup events (BUG: soft lockup - CPU#N stuck for Xs!), aggregates them by CPU, and reports summarized telemetry to Azure. This helps detect and diagnose VM health issues caused by CPUs being stuck in kernel code

Changes:

  • New module kernel_event_monitor.py — MonitorKernelSoftLockup periodic operation that:
    • Parses dmesg for soft lockup events using regex
    • Aggregates events by CPU ID (count, max stuck time, last kernel timestamp)
    • Reports via telemetry (WALAEventOperation.KernelSoftLockup)
    • Persists watermark to disk to avoid duplicate reporting across agent restarts
    • Detects reboots via boot ID to reset watermark
  • monitor.py — Conditionally adds MonitorKernelSoftLockup to the monitor thread based on config
  • conf.py — New config options: Monitor.KernelSoftLockup (enable/disable) and Monitor.KernelSoftLockupPeriod (check interval, minimum 300s)
  • waagent.conf — Added default config entries (Monitor.KernelSoftLockup=y, Monitor.KernelSoftLockupPeriod=21600)
  • event.py — New WALAEventOperation.KernelSoftLockup operation type
  • test_kernel_event_monitor.py — 16 unit tests covering regex, parsing, aggregation, reporting, state persistence, dmesg error handling, and end-to-end operation
  • test_monitor.py — Updated to verify MonitorKernelSoftLockup is included/excluded based on config flag

Issue #

PR information

  • Ensure development PR is based on the develop branch.
  • If applicable, the PR references the bug/issue that it fixes in the description.
  • New Unit tests were added for the changes made

Quality of Code and Contribution Guidelines


Distro maintenance information, if applicable

  • This is a contribution from a distro maintainer
  • The changes in this PR have been taken as a downstream patch (Note: it is not recommended to patch the agent without upstream review and approval)

@adityagarg0911 adityagarg0911 marked this pull request as ready for review March 5, 2026 15:52
Copilot AI review requested due to automatic review settings March 26, 2026 19:36
Copy link

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds kernel soft lockup monitoring to WALinuxAgent by periodically parsing dmesg, aggregating detected “BUG: soft lockup” events per CPU, and emitting summarized telemetry to Azure, with configuration gates and unit tests.

Changes:

  • Introduces MonitorKernelSoftLockup periodic operation that parses dmesg, aggregates soft lockup events, sends telemetry, and persists a watermark across restarts (with boot-id reset handling).
  • Wires the new monitor into the GA monitor thread behind a config flag and adds new config options + defaults.
  • Adds unit tests for parsing/aggregation/reporting/state persistence and updates monitor-thread tests to cover enable/disable behavior.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
azurelinuxagent/ga/kernel_event_monitor.py New periodic operation for parsing dmesg, aggregating soft lockups, telemetry emission, and watermark persistence.
azurelinuxagent/ga/monitor.py Conditionally adds MonitorKernelSoftLockup to the monitor thread’s periodic operations.
azurelinuxagent/common/conf.py Adds config accessors for enabling soft lockup monitoring and setting its period.
azurelinuxagent/common/event.py Adds WALAEventOperation.KernelSoftLockup operation name.
config/waagent.conf Adds default config entries to enable soft lockup monitoring and set the default period.
tests/ga/test_kernel_event_monitor.py New unit tests covering regex, parsing, aggregation, reporting, dmesg error handling, and persistence.
tests/ga/test_monitor.py Extends monitor-thread tests to validate inclusion/exclusion of the new periodic operation based on config.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

agg["last_timestamp"] = kernel_timestamp

if not found_timestamp:
logger.warn(
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When dmesg timestamps are missing (e.g. CONFIG_PRINTK_TIME=n or printk.time=0), this code will emit a logger.warn(...) on every execution of the operation. This can create recurring log noise on affected systems; the codebase commonly uses logger.periodic_warn(...) for this kind of persistent condition. Consider switching to periodic_warn (e.g., hourly) or only logging once per boot.

Suggested change
logger.warn(
logger.periodic_warn(
logger.EVERY_HOUR,

Copilot uses AI. Check for mistakes.
Comment on lines +65 to 66
with patch("azurelinuxagent.ga.monitor.MonitorHandler.stopped", side_effect=[False, True, False, True, False, True, False, True]):
with patch("time.sleep"):
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test hardcodes a side_effect list for MonitorHandler.stopped that must match the exact number of loop iterations (2 calls per run * 4 config combinations). This is brittle: adding another config combination or an extra stopped() check in the implementation will cause unrelated test failures. Consider generating the side_effect programmatically from the number of runs (e.g., [False, True] * num_runs) or using a callable side_effect that toggles per run.

Copilot uses AI. Check for mistakes.
Comment on lines +159 to +162
try:
return run_command(['dmesg'], track_process=False, timeout=self._DMESG_TIMEOUT)
except Exception as e:
logger.warn("KernelSoftLockup: Failed to read dmesg output: {0}".format(ustr(e)))
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The _DMESG_TIMEOUT passed to run_command() will not actually be enforced on Python 2: shellutil.run_command explicitly ignores the timeout parameter on Py2. If the agent still supports running under Python 2, consider implementing an explicit timeout/kill for the dmesg subprocess (or document/remove the timeout to avoid a false sense of safety).

Copilot uses AI. Check for mistakes.
Comment on lines +147 to +150
with open(self._state_file_path, 'w') as f:
json.dump(state, f)
except Exception as e:
logger.warn("KernelSoftLockup: Failed to save state: {0}".format(ustr(e)))
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_save_state() writes JSON directly to the final state file. If the agent is interrupted mid-write, the file can be left truncated/corrupt; on next start _load_state() will reset the watermark to 0, which can cause duplicate reporting (the opposite of what the watermark is trying to prevent). Consider writing to a temporary file and os.rename() to the final path (similar to EventLogger.save_event) to make the update atomic.

Suggested change
with open(self._state_file_path, 'w') as f:
json.dump(state, f)
except Exception as e:
logger.warn("KernelSoftLockup: Failed to save state: {0}".format(ustr(e)))
tmp_path = self._state_file_path + ".tmp"
with open(tmp_path, 'w') as f:
json.dump(state, f)
f.flush()
os.fsync(f.fileno())
os.rename(tmp_path, self._state_file_path)
except Exception as e:
logger.warn("KernelSoftLockup: Failed to save state: {0}".format(ustr(e)))
try:
if 'tmp_path' in locals() and os.path.exists(tmp_path):
os.remove(tmp_path)
except Exception:
# Best-effort cleanup of temp file; ignore cleanup errors
pass

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants