Skip to content

murongg/ui-element-ops

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 

Repository files navigation

ui-element-ops

This repository keeps a reusable skill for UI parsing and desktop operations:

  • skills/ui-element-ops

What It Can Do

  • Parse UI screenshots into structured elements (type, bbox, text, clickable)
  • Find / wait for elements
  • Click, type, key, hotkey
  • Take screenshots
  • Calibrate coordinates for DPI / multi-display / window offsets

GUI / Headless Notes

  • In headless systems, screenshot parsing still works if you already have image files.
  • In headless systems, list / find / wait / calibrate can still run on existing *.elements.json.
  • In headless systems, interactive actions do not work: click, click-xy, type, key, hotkey, screenshot, screen-info.
  • Interactive actions require an active GUI desktop session and OS permissions (Accessibility / Screen Recording where applicable).

Quick Start

  1. Bootstrap environment:
skills/ui-element-ops/scripts/bootstrap_omniparser_env.sh "$PWD"
  1. Parse an image:
skills/ui-element-ops/scripts/run_parse_ui.sh /abs/path/to/screen.png
  1. Operate UI:
python3 skills/ui-element-ops/scripts/operate_ui.py --help
  1. Capture + parse with randomized names:
skills/ui-element-ops/scripts/capture_and_parse.sh

Performance Notes

  • run_parse_ui.sh and capture_and_parse.sh are compute-heavy (OCR + detection + captioning).
  • On CPU-only machines, one run can take tens of seconds and high CPU/RAM usage.
  • If you already have a screenshot file, parse it directly instead of capturing again:
skills/ui-element-ops/scripts/run_parse_ui.sh /abs/path/to/screen.png
  • Avoid tight loops; increase polling intervals for repeated tasks.
  • Prefer lower-frequency parsing and reuse the latest *.elements.json when possible.

Main Files

  • skills/ui-element-ops/SKILL.md
  • skills/ui-element-ops/agents/openai.yaml
  • skills/ui-element-ops/scripts/parse_ui.py
  • skills/ui-element-ops/scripts/operate_ui.py

About

A skill for parsing UI screenshots into structured elements and automating desktop actions with find/wait/click/type and coordinate calibration.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors