Difficulty

medium

Time

30m

Use Case

Give AI agents full autonomy to control Windows, Mac, or Linux computers by detecting UI elements through vision

Popularity

0 views

About this automation

SoMatic uses a finetuned YOLO model running locally on CPU with ONNX to identify text and interactable elements in any UI. It draws bounding boxes with labels and maps IDs to element coordinates, enabling Set-Of-Marks prompting for native OS automation. Achieves ~20% higher accuracy than raw model on GPT-4.5.

How to implement

Install SoMatic CLI: npm install -g somatic-cli/cli

Add SoMatic skill: npx skills add Smyan1909/SoMatic

Configure your LLM API (GPT-4.5 or compatible)

Use stdio MCP server to parse b64-encoded screenshots directly

Define agent tasks that reference bounding box IDs instead of pixel coordinates

Run agent with full OS access across Windows, Mac, or Linux

Vision-Based OS Automation with SoMatic

About this automation

How to implement