"Computer Use Agents: From SFT to Classic RL", Maksym Shamrai

● Research Scientist @ MacPaw AIR
● Applied Mathematics PhD Candidate @ Institute of Mathematics of NAS of Ukraine
● Lecturer @ Kyiv Academic University
Хто я?

Computer Use problem
01
Commercial CUA
02
Open CUA
03
04 RL
Evaluation and Benchmarks
05

https://arxiv.org/pdf/2501.12326

1. Feeding screenshot of a current state to CUA
https://platform.openai.com/docs/guides/tools-computer-use

2. CUA returns an action it considers appropriate

3. Action is executed in an environment

4. New screenshot is taken

Markov decision process
https://www.ibm.com/think/topics/reinforcement-learning

● Multimodal Large Language Model
● Multi-Agent System (bunch of MLLMs)
What is the agent?
● Screenshot
● Metadata (e.g. accessibility tree, list of
active applications etc.)
What is the state?

● Virtual Machine (safety considerations)
● Actual OS (in Docker, VMWare, UTM etc.)
● or dedicated (fake) one
What is the environment?
https://github.com/e2b-dev/desktop/

● Anything you want to allow and can
develop
● Typically you want your agent to be able
to click and type at least
● It is called action space
What is the action?

● In Computer Use there is no direct reward
● It could be handcrafted evaluation script or
automated LLM judge
● This makes it difficult but not impossible
What is the reward?

Browser Use? Phone Use? etc.
● Same but different
● Have to use corresponding environment and action space
○ Browser – Playwright, Selenium, BrowserGym.
○ Android – AndroidEnv.
● Typically easier than Computer Use (smaller state space, more
data)

Commercial Computer Use Agents

Claude Computer use
● Out since October 2024
● Still beta
● Regular Claude Sonnet 3.5/4 with the tool
● https://docs.anthropic.com/en/docs/agents-and-tools/tool
-use/computer-use-tool

OpenAI Computer use
● Out since January 2025
● Also beta
● GPT-4o or o3
● https://platform.openai.com/docs/guides/tools-computer-
use

OpenAI CUA demo https://github.com/mshamrai/openai-cua-gradio

Recap: Training pipeline of LLMs
https://cameronrwolfe.substack.com/p/demystifying-reasoning-models

Trajectories dataset
Task:
Reply to John Doe that I like the agenda
Initial state

click on “John Doe”

click on “Reply”

● Trajectory is a list of screenshots (states) and actions that correspond to
the successful completion of a task
● trajectory = (s0, a0, s1, a1, …)
● So, dataset consists of pairs (task, trajectory)
● Better to add “thoughts” before action

Supervised Fine-Tuning (SFT)
● From each record compose
a “conversation”
● Train a LLM using SFT

CogAgent
● Out since December 2023
● New version released in December 2024
● Still not as good as others
https://arxiv.org/abs/2312.08914

Agent S2
● Out since April 2025
● Multi-agent framework
● Best to use commercial LLMs as a planner

● Out since January 2025
● New version released in April 2025 (current SOTA)
https://seed-tars.com/1.5

UI-TARS
● Based on Qwen2-VL
● New version on Qwen2.5-VL
● Trained on 50B tokens of manual
annotated data (not open sourced)

Recap: Reinforcement Learning from Human Feedback
https://huggingface.co/blog/rlhf

PPO
● Introduced by OpenAI in 2017
● Actor-Critic algorithm
● Still a good baseline for any RL task
https://openai.com/index/openai-baselines-ppo/

DPO

DPO
● Introduced in 2023
● Designed specifically for LLMs
● Does not require a reward or value model
● Trains on pairs of outputs where one is a winner and the other is a loser
● Derived in closed-form
● Slightly less effective than PPO, but cheaper and offline

SFT vs DPO

GRPO

GRPO
● Introduced by DeepSeek in 2024
● Used to train a reasoning model
● Modification of PPO
● Without value model (critic)
● Rule-based reward (only works with automatically evaluated problems)
● Needs less resources than PPO but still online

ARPO

Android in the Wild (AitW)
https://github.com/google-research/google-research/tree/master/android_in_the_wild

● NeurIPS 2024
● Android Use?

DigiRL
● Off-Policy RL via Advantage-Weighted Regression
● Offline + Online RL
● Automatic evaluation with Gemini

● Out since February 2025
● Android Use!

Digi-Q
● Actor-Critic
● Offline!

Digi-Q

DigiRL & Digi-Q
https://digirl-agent.github.io/DigiQ-agent.github.io/

Evaluation
● Use pre-recorded trajectories
● Compare actions (perplexity, rule-
based)
Offline
● Use the environment and run the
agent
● Evaluate the execution using rules (or
automatically)
Online
Pros:
● Cheap and fast (no need to run the
environment)
Cons:
● Unreliable due to the dynamic nature
of the environment
Pros:
● Reliable metric
Cons:
● Expensive and slow
● Have to run several times

OSWorld
https://os-world.github.io/
● Online benchmark
● 369 manually labeled computer tasks with automatic evaluation code
● Ubuntu, Windows, and macOS

OSWorld
Tasks:
● Can you enable the 'Do Not Track' feature in Chrome to enhance my online privacy?
● Could you turn my image into CYMK mode?
● Fill all the blank cells in B1:E30 with the value in the cell above it.
● Color the first three textboxes on slide 1 yellow, red, and green, respectively, in top-to-bottom order. Use exactly
these colors—no variations (e.g., no dark red, light green, etc.).
● Make the line spacing of first two paragraph into double line spacing
● I need to include the experiment results from "~/Documents/awesome-desktop/expe-results.xlsx" into the currently
writing report. Specifically, extract the results of GPT-4 and insert a table into the "Main Results" section of my
report.
● I click in terminal: terminal->132x43 to change terminal size but after each reboot terminal size is set to default
setting and I have to change it again. Help me set it permanently
● Please help me install an extension in VS Code from a local VSIX file "/home/user/test.vsix".

OSWorld
https://os-world.github.io/

ScreenSpot Pro
https://huggingface.co/datasets/likaixin/ScreenSpot-Pro
● High-resolution
● Grounding (1 action)

ScreenSpot Pro
https://likaixin2000.github.io/papers/ScreenSpot_Pro.pdf

ScreenSpot Pro
https://gui-agent.github.io/grounding-leaderboard/

"Computer Use Agents: From SFT to Classic RL", Maksym Shamrai

More Related Content

Similar to "Computer Use Agents: From SFT to Classic RL", Maksym Shamrai(20)

More from Fwdays(20)

Recently uploaded(20)

"Computer Use Agents: From SFT to Classic RL", Maksym Shamrai

Editor's Notes