DEFINITION
Vision-Based Automation
Vision-Based UI Automation
Definition
An approach to automating user interfaces that relies on computer vision (typically finetuned YOLO models) to detect and localize UI elements rather than structural APIs like accessibility trees or DOM. Enables agents to interact with any interface by understanding visual content.
Examples in the Wild
- Example 1:SoMatic framework using YOLO to detect text and interactable elements
- Example 2:OmniParser v2 approach for UI element identification