Prerequisites
- Install
poetry
(see Poetry docs). - Install
surfkit
(see Quickstart). - Set up your local or cloud environment (see Configuration).
- Install Tesseract on your machine (see Tesseract docs).
Creating an Agent
Creating a dummy agent that follows the SurfKit protocol is super easy:How does it work?
Code
Let’s briefly look inside the repo. There are a bunch of files there, but the most critical files are the following:agent.yaml
is a configuration file; it contains a few self-explanatory sections:- you will need to change the docker image repo later when you’re ready to publish it;
- you may also want to change the icon;
- by default, your agent will run locally;
- this agent is designed to work with a “desktop” device as are the other GUI-driven agents that we build;
server.py
is just a utility class used to host the server with the agent process;agent.py
is the main class that implements the logic of the agent:- at the beginning of the task execution we explain to the MLLM/LLM what this agent is expected to do;
- then we enter the loop:
- given the task and the history of the chat, as well as the state of the desktop, we ask an MLLM/LLM to give us the next action and the reason for it;
- the next action is returned by MLLM/LLM as a JSON, which is checked against the available device and executed upon it;
- we exit the loop when either the MLLM/LLM returns an action marked as “result” (which means that it thinks that the task is solved), or the maximum amount of iterations is reached (30 by default).
Architecture
This type of agent works with a Desktop device. The Desktop device has an interface that allows it to control the VM (in the cloud or a local one) via a mouse and a keyboard programmatically. You’ve just created it above withsurfkit create device
.
When the agent works to solve a given task, it uses the device. The device implements
the Tool
interface and therefore has a schema of actions that the agent can take at any give time. We get this schema in JSON format,
ask an MLLM/LLM to return the next action in that format, and then on each step of solving a task,
we have an action that can be passed back to the device to be executed.
For example, the action returned by MLLM/LLM, might look like this:
Problem
And this is a big problem: at the time of this tutorial (June 2024, “gpt-4o” was released not so long ago), all frontier MLLMs are horrible at identifying coordinates of the object on a screenshot. They can reason quite well about what should be clicked or typed to achieve their goal, but they can’t return correct coordinates for that value. So it’s time to make this agent better with some additional tricks. Our goal is to help them convert an idea on what should be clicked into actual, correct screen coordinates. To do that, we need to expand the toolset of our device and add a semantic layer to it, so that instead of “click on (400, 350)” our agent will return something like to “click on the big ‘Search’ button at the bottom of the screen”. From the code point of view, we’ll do the following:- We’ll introduce the
SemanticDesktop
class, which is essentially a wrapper around the Desktop that we already have, and it inherits all the actions that it provides. - We’ll then update the
Agent
class to use theSemanticDesktop
alongside the actualDesktop
: we want the actions to be taken from and bySemanticDesktop
and translated to the low-levelDesktop
operations when needed; we also want to keep using the screenshotting and mouse-clicking abilities fromDesktop
; - After that, we’ll introduce a new action to the
SemanticDesktop
:click_object
.
Adding SemanticDesktop
You can find the code for this step of the tutorial here.
SemanticDesktop
. It will interit from Tool
, which would allow us pass it to the MLLM.
See the full code for tool.py
here.
The most interesting part of this class is that we add a new method, click_object
:
Agent
class too. See the full code for agent.py
here.
Note that we replace some of the usages of the Desktop
device by the SemanticDesktop
device, but not all of them. The best way to explain it is that we get observations from the Desktop
(a screenshot and mouse coordinates), but we run the actions of and by the SemanticDesktop
.
We also remove some actions we don’t need our agent to know:
click_object
function is still pretty dumb. So let’s fix that now.
Adding Grid
You can find the code for this step of the tutorial here.
You’ll need to add the fonts that you can see in the repository.

image.py
here.
Now, we can update the click_object
function to give it some more power and perception:
- We generate the image with the grid, same as shown above.
- We craft the prompt to instruct our MLLM to return to us exactly what we need: the number of the closest dot.
- We run the prompt and get our result.
- We convert the number back to screen coordinates.
- Along the way, we record the stuff in the “debug” channel of our agent, so that you can see what exactly is going on, in the UI.
tool.py
here.
When you run the agent now, you can see the images with the grid that it generates, in the debug tab. The MLLM picks the correct number pretty reliably. This method is obviously more intelligent than picking the middle of the screen. However, there is a good chance the bot misses the correct spot because the element we’re interested in is right under the dot.
To address this issue, we add a new capability, zooming in. We zoom in and scale up the part of the screenshot surrounding the chosen dot.
You can see the implementation in the SurfSlicer agent.
Adding Tesseract
You can find the code for this step of the tutorial here.
- Finding text with a bounding box using Tesseract is exceptionally fast in comparison to OpenAI API calls: You get the result in a fraction of a second.
- The bounding box is very accurate: we can safely click in the middle of the coordinates and be sure that we hit the right oject.

pytesseract
:
Tesseract
and to find bounding boxes for a given text. See the code for ocr.py
here.
When we have this, we update click_object
. We move the grid-related logic to a separate method, add a similar one with the OCR-related logic, and update the main action method like this:
tool.py
here.
If you look closely on the debug channel now, you’ll see that our agent tries to use OCR whenever it makes sense, and if this operation succeeds, it goes on with the next iteration; if it doesn’t succeed, it falls back to the grid approach.
What’s next?
Now it’s your turn! There are a lot of techniques that we’ve personally tried with different level of success; to name a few:- Locating elements on a page with Grounding Dino.
- Cutting the image into pieces and compositing them on a new image with numbers alongside the various pieces.
- Zooming into the Grid 2-3 times with new numbers.
- Layering coordinates over a screenshot
- Upscaling a screenshot with a GAN
- OCR, as noted above, but with some tweaks.
- Many more…