Vision-Based OS Automation with SoMatic

Enable AI agents to autonomously control any OS interface using vision and Set-Of-Marks prompting

Updated: 5/22/2026
Difficulty
medium
Time
30m
Use Case
Give AI agents full autonomy to control Windows, Mac, or Linux computers by detecting UI elements through vision
Popularity
0 views

About this automation

SoMatic uses a finetuned YOLO model running locally on CPU with ONNX to identify text and interactable elements in any UI. It draws bounding boxes with labels and maps IDs to element coordinates, enabling Set-Of-Marks prompting for native OS automation. Achieves ~20% higher accuracy than raw model on GPT-4.5.

How to implement

1

Install SoMatic CLI: npm install -g somatic-cli/cli

2

Add SoMatic skill: npx skills add Smyan1909/SoMatic

3

Configure your LLM API (GPT-4.5 or compatible)

4

Use stdio MCP server to parse b64-encoded screenshots directly

5

Define agent tasks that reference bounding box IDs instead of pixel coordinates

6

Run agent with full OS access across Windows, Mac, or Linux