Soft lockup detection in linux through dmesg logs parsing and sending telemetry#3573
Soft lockup detection in linux through dmesg logs parsing and sending telemetry#3573adityagarg0911 wants to merge 11 commits intoAzure:developfrom
Conversation
…porting. Also add unit test cases for this feature.
…gargaditya/kernel_soft_lockup_detection
There was a problem hiding this comment.
Pull request overview
Adds kernel soft lockup monitoring to WALinuxAgent by periodically parsing dmesg, aggregating detected “BUG: soft lockup” events per CPU, and emitting summarized telemetry to Azure, with configuration gates and unit tests.
Changes:
- Introduces
MonitorKernelSoftLockupperiodic operation that parsesdmesg, aggregates soft lockup events, sends telemetry, and persists a watermark across restarts (with boot-id reset handling). - Wires the new monitor into the GA monitor thread behind a config flag and adds new config options + defaults.
- Adds unit tests for parsing/aggregation/reporting/state persistence and updates monitor-thread tests to cover enable/disable behavior.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
azurelinuxagent/ga/kernel_event_monitor.py |
New periodic operation for parsing dmesg, aggregating soft lockups, telemetry emission, and watermark persistence. |
azurelinuxagent/ga/monitor.py |
Conditionally adds MonitorKernelSoftLockup to the monitor thread’s periodic operations. |
azurelinuxagent/common/conf.py |
Adds config accessors for enabling soft lockup monitoring and setting its period. |
azurelinuxagent/common/event.py |
Adds WALAEventOperation.KernelSoftLockup operation name. |
config/waagent.conf |
Adds default config entries to enable soft lockup monitoring and set the default period. |
tests/ga/test_kernel_event_monitor.py |
New unit tests covering regex, parsing, aggregation, reporting, dmesg error handling, and persistence. |
tests/ga/test_monitor.py |
Extends monitor-thread tests to validate inclusion/exclusion of the new periodic operation based on config. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| agg["last_timestamp"] = kernel_timestamp | ||
|
|
||
| if not found_timestamp: | ||
| logger.warn( |
There was a problem hiding this comment.
When dmesg timestamps are missing (e.g. CONFIG_PRINTK_TIME=n or printk.time=0), this code will emit a logger.warn(...) on every execution of the operation. This can create recurring log noise on affected systems; the codebase commonly uses logger.periodic_warn(...) for this kind of persistent condition. Consider switching to periodic_warn (e.g., hourly) or only logging once per boot.
| logger.warn( | |
| logger.periodic_warn( | |
| logger.EVERY_HOUR, |
| with patch("azurelinuxagent.ga.monitor.MonitorHandler.stopped", side_effect=[False, True, False, True, False, True, False, True]): | ||
| with patch("time.sleep"): |
There was a problem hiding this comment.
This test hardcodes a side_effect list for MonitorHandler.stopped that must match the exact number of loop iterations (2 calls per run * 4 config combinations). This is brittle: adding another config combination or an extra stopped() check in the implementation will cause unrelated test failures. Consider generating the side_effect programmatically from the number of runs (e.g., [False, True] * num_runs) or using a callable side_effect that toggles per run.
| try: | ||
| return run_command(['dmesg'], track_process=False, timeout=self._DMESG_TIMEOUT) | ||
| except Exception as e: | ||
| logger.warn("KernelSoftLockup: Failed to read dmesg output: {0}".format(ustr(e))) |
There was a problem hiding this comment.
The _DMESG_TIMEOUT passed to run_command() will not actually be enforced on Python 2: shellutil.run_command explicitly ignores the timeout parameter on Py2. If the agent still supports running under Python 2, consider implementing an explicit timeout/kill for the dmesg subprocess (or document/remove the timeout to avoid a false sense of safety).
| with open(self._state_file_path, 'w') as f: | ||
| json.dump(state, f) | ||
| except Exception as e: | ||
| logger.warn("KernelSoftLockup: Failed to save state: {0}".format(ustr(e))) |
There was a problem hiding this comment.
_save_state() writes JSON directly to the final state file. If the agent is interrupted mid-write, the file can be left truncated/corrupt; on next start _load_state() will reset the watermark to 0, which can cause duplicate reporting (the opposite of what the watermark is trying to prevent). Consider writing to a temporary file and os.rename() to the final path (similar to EventLogger.save_event) to make the update atomic.
| with open(self._state_file_path, 'w') as f: | |
| json.dump(state, f) | |
| except Exception as e: | |
| logger.warn("KernelSoftLockup: Failed to save state: {0}".format(ustr(e))) | |
| tmp_path = self._state_file_path + ".tmp" | |
| with open(tmp_path, 'w') as f: | |
| json.dump(state, f) | |
| f.flush() | |
| os.fsync(f.fileno()) | |
| os.rename(tmp_path, self._state_file_path) | |
| except Exception as e: | |
| logger.warn("KernelSoftLockup: Failed to save state: {0}".format(ustr(e))) | |
| try: | |
| if 'tmp_path' in locals() and os.path.exists(tmp_path): | |
| os.remove(tmp_path) | |
| except Exception: | |
| # Best-effort cleanup of temp file; ignore cleanup errors | |
| pass |
Description
Add kernel soft lockup monitoring to the Azure Linux Agent. This new feature periodically parses dmesg output to detect CPU soft lockup events (BUG: soft lockup - CPU#N stuck for Xs!), aggregates them by CPU, and reports summarized telemetry to Azure. This helps detect and diagnose VM health issues caused by CPUs being stuck in kernel code
Changes:
Issue #
PR information
developbranch.Quality of Code and Contribution Guidelines
Distro maintenance information, if applicable