● Research Scientist @ MacPaw AIR
● Applied Mathematics PhD Candidate @ Institute of Mathematics of NAS of Ukraine
● Lecturer @ Kyiv Academic University
Хто я?
Computer Use problem
01
Commercial CUA
02
Open CUA
03
04 RL
Evaluation and Benchmarks
05
Computer Use
https://arxiv.org/pdf/2501.12326
1. Feeding screenshot of a current state to CUA
https://platform.openai.com/docs/guides/tools-computer-use
2. CUA returns an action it considers appropriate
https://platform.openai.com/docs/guides/tools-computer-use
3. Action is executed in an environment
https://platform.openai.com/docs/guides/tools-computer-use
4. New screenshot is taken
https://platform.openai.com/docs/guides/tools-computer-use
Markov decision process
https://www.ibm.com/think/topics/reinforcement-learning
● Multimodal Large Language Model
● Multi-Agent System (bunch of MLLMs)
What is the agent?
● Screenshot
● Metadata (e.g. accessibility tree, list of
active applications etc.)
What is the state?
● Virtual Machine (safety considerations)
● Actual OS (in Docker, VMWare, UTM etc.)
● or dedicated (fake) one
What is the environment?
https://github.com/e2b-dev/desktop/
● Anything you want to allow and can
develop
● Typically you want your agent to be able
to click and type at least
● It is called action space
What is the action?
● In Computer Use there is no direct reward
● It could be handcrafted evaluation script or
automated LLM judge
● This makes it difficult but not impossible
What is the reward?
Browser Use? Phone Use? etc.
● Same but different
● Have to use corresponding environment and action space
○ Browser – Playwright, Selenium, BrowserGym.
○ Android – AndroidEnv.
● Typically easier than Computer Use (smaller state space, more
data)
Commercial Computer Use Agents
Claude Computer use
● Out since October 2024
● Still beta
● Regular Claude Sonnet 3.5/4 with the tool
● https://docs.anthropic.com/en/docs/agents-and-tools/tool
-use/computer-use-tool
OpenAI Computer use
● Out since January 2025
● Also beta
● GPT-4o or o3
● https://platform.openai.com/docs/guides/tools-computer-
use
OpenAI CUA demo https://github.com/mshamrai/openai-cua-gradio
Training
Recap: Training pipeline of LLMs
https://cameronrwolfe.substack.com/p/demystifying-reasoning-models
Recap: Training pipeline of LLMs
https://cameronrwolfe.substack.com/p/demystifying-reasoning-models
Trajectories dataset
Task:
Reply to John Doe that I like the agenda
Initial state
Trajectories dataset
click on “John Doe”
Trajectories dataset
click on “Reply”
Trajectories dataset
type …
Trajectories dataset
● Trajectory is a list of screenshots (states) and actions that correspond to
the successful completion of a task
● trajectory = (s0, a0, s1, a1, …)
● So, dataset consists of pairs (task, trajectory)
● Better to add “thoughts” before action
Supervised Fine-Tuning (SFT)
● From each record compose
a “conversation”
● Train a LLM using SFT
Open Computer Use MLLMs
CogAgent
● Out since December 2023
● New version released in December 2024
● Still not as good as others
https://arxiv.org/abs/2312.08914
Agent S2
● Out since April 2025
● Multi-agent framework
● Best to use commercial LLMs as a planner
https://arxiv.org/pdf/2504.00906
● Out since January 2025
● New version released in April 2025 (current SOTA)
https://arxiv.org/pdf/2501.12326
https://seed-tars.com/1.5
UI-TARS
https://arxiv.org/pdf/2409.12191
● Based on Qwen2-VL
● New version on Qwen2.5-VL
● Trained on 50B tokens of manual
annotated data (not open sourced)
RL in Computer Use
Recap: Reinforcement Learning from Human Feedback
https://huggingface.co/blog/rlhf
PPO
● Introduced by OpenAI in 2017
● Actor-Critic algorithm
● Still a good baseline for any RL task
https://openai.com/index/openai-baselines-ppo/
DPO
https://arxiv.org/pdf/2305.18290
DPO
● Introduced in 2023
● Designed specifically for LLMs
● Does not require a reward or value model
● Trains on pairs of outputs where one is a winner and the other is a loser
● Derived in closed-form
● Slightly less effective than PPO, but cheaper and offline
SFT vs DPO
https://arxiv.org/pdf/2501.12326
GRPO
https://arxiv.org/pdf/2402.03300
GRPO
● Introduced by DeepSeek in 2024
● Used to train a reasoning model
● Modification of PPO
● Without value model (critic)
● Rule-based reward (only works with automatically evaluated problems)
● Needs less resources than PPO but still online
ARPO
https://arxiv.org/pdf/2505.16282
ARPO
https://arxiv.org/pdf/2505.16282
ARPO
https://arxiv.org/pdf/2505.16282
“Classic” RL
Android in the Wild (AitW)
https://github.com/google-research/google-research/tree/master/android_in_the_wild
https://arxiv.org/pdf/2406.11896
● NeurIPS 2024
● Android Use?
DigiRL
● Off-Policy RL via Advantage-Weighted Regression
● Offline + Online RL
● Automatic evaluation with Gemini
● Out since February 2025
● Android Use!
https://arxiv.org/pdf/2502.15760
Digi-Q
https://arxiv.org/pdf/2502.15760
● Actor-Critic
● Offline!
Digi-Q
https://arxiv.org/pdf/2502.15760
DigiRL & Digi-Q
https://digirl-agent.github.io/DigiQ-agent.github.io/
Evaluation and Benchmarks
Evaluation
● Use pre-recorded trajectories
● Compare actions (perplexity, rule-
based)
Offline
● Use the environment and run the
agent
● Evaluate the execution using rules (or
automatically)
Online
Pros:
● Cheap and fast (no need to run the
environment)
Cons:
● Unreliable due to the dynamic nature
of the environment
Pros:
● Reliable metric
Cons:
● Expensive and slow
● Have to run several times
OSWorld
https://os-world.github.io/
● Online benchmark
● 369 manually labeled computer tasks with automatic evaluation code
● Ubuntu, Windows, and macOS
OSWorld
Tasks:
● Can you enable the 'Do Not Track' feature in Chrome to enhance my online privacy?
● Could you turn my image into CYMK mode?
● Fill all the blank cells in B1:E30 with the value in the cell above it.
● Color the first three textboxes on slide 1 yellow, red, and green, respectively, in top-to-bottom order. Use exactly
these colors—no variations (e.g., no dark red, light green, etc.).
● Make the line spacing of first two paragraph into double line spacing
● I need to include the experiment results from "~/Documents/awesome-desktop/expe-results.xlsx" into the currently
writing report. Specifically, extract the results of GPT-4 and insert a table into the "Main Results" section of my
report.
● I click in terminal: terminal->132x43 to change terminal size but after each reboot terminal size is set to default
setting and I have to change it again. Help me set it permanently
● Please help me install an extension in VS Code from a local VSIX file "/home/user/test.vsix".
OSWorld
https://os-world.github.io/
ScreenSpot Pro
https://huggingface.co/datasets/likaixin/ScreenSpot-Pro
● High-resolution
● Grounding (1 action)
ScreenSpot Pro
https://likaixin2000.github.io/papers/ScreenSpot_Pro.pdf
ScreenSpot Pro
https://gui-agent.github.io/grounding-leaderboard/
Q&A

"Computer Use Agents: From SFT to Classic RL", Maksym Shamrai

Editor's Notes

  • #3 Спочатку я розповім трохи чого саме така тема, далі ми розглянемо як можуть працювати такі інструменти на прикладі - розпізнавання облич та медіа пошуку. Підсумуємо це все і повідповідаємо на запитання.