Poor LLM Localization for Native OS UI Elements

Modern multimodal LLMs are great at vision/perception but poor at localizing UI elements in native OS applications. Accessibility trees are brittle, non-deterministic, and often stripped by developers, making RPA automation unreliable.

Updated: 5/22/2026
SoMatic solves this by using a finetuned YOLO model to detect UI elements purely from vision, creating visual bounding boxes with labels. This enables Set-Of-Marks prompting for native OS (Windows, Mac, Linux), achieving ~20% higher accuracy than raw models. The framework maps bounding box IDs to element coordinates, allowing agents to reference elements by label instead of pixel coordinates.

Did this solve your problem?

0 developers found this helpful