YOLO Model

You Only Look Once

Definition

A real-time object detection neural network architecture. In the context of agent automation, finetuned YOLO models are used to detect and localize UI elements (text, buttons, inputs) in screenshots. Can run locally on CPU with ONNX for fast inference.

Examples in the Wild

  • Example 1:SoMatic's finetuned YOLO for identifying text and interactable elements in native OS UIs
  • Example 2:OmniParser v2 approach using YOLO-inspired detection