PROBLEM
Poor LLM Localization for Native OS UI Elements
Modern multimodal LLMs are great at vision/perception but poor at localizing UI elements in native OS applications. Accessibility trees are brittle, non-deterministic, and often stripped by developers, making RPA automation unreliable.
Updated: 5/22/2026
SoMatic solves this by using a finetuned YOLO model to detect UI elements purely from vision, creating visual bounding boxes with labels. This enables Set-Of-Marks prompting for native OS (Windows, Mac, Linux), achieving ~20% higher accuracy than raw models. The framework maps bounding box IDs to element coordinates, allowing agents to reference elements by label instead of pixel coordinates.
Did this solve your problem?
0 developers found this helpful