This repository keeps a reusable skill for UI parsing and desktop operations:
skills/ui-element-ops
- Parse UI screenshots into structured elements (
type,bbox,text,clickable) - Find / wait for elements
- Click, type, key, hotkey
- Take screenshots
- Calibrate coordinates for DPI / multi-display / window offsets
- In headless systems, screenshot parsing still works if you already have image files.
- In headless systems,
list/find/wait/calibratecan still run on existing*.elements.json. - In headless systems, interactive actions do not work:
click,click-xy,type,key,hotkey,screenshot,screen-info. - Interactive actions require an active GUI desktop session and OS permissions (Accessibility / Screen Recording where applicable).
- Bootstrap environment:
skills/ui-element-ops/scripts/bootstrap_omniparser_env.sh "$PWD"- Parse an image:
skills/ui-element-ops/scripts/run_parse_ui.sh /abs/path/to/screen.png- Operate UI:
python3 skills/ui-element-ops/scripts/operate_ui.py --help- Capture + parse with randomized names:
skills/ui-element-ops/scripts/capture_and_parse.shrun_parse_ui.shandcapture_and_parse.share compute-heavy (OCR + detection + captioning).- On CPU-only machines, one run can take tens of seconds and high CPU/RAM usage.
- If you already have a screenshot file, parse it directly instead of capturing again:
skills/ui-element-ops/scripts/run_parse_ui.sh /abs/path/to/screen.png- Avoid tight loops; increase polling intervals for repeated tasks.
- Prefer lower-frequency parsing and reuse the latest
*.elements.jsonwhen possible.
skills/ui-element-ops/SKILL.mdskills/ui-element-ops/agents/openai.yamlskills/ui-element-ops/scripts/parse_ui.pyskills/ui-element-ops/scripts/operate_ui.py