Building Your Agent from Scratch
A detailed guide from zero to GUI surfer with multiple approaches
In this guide, we’ll show how to create your own agent using SurfKit and some of the techniques we came up along the way to help your agent navigate GUIs and accomplish its goals.
Prerequisites
- Install
poetry
(see Poetry docs). - Install
surfkit
(see Quickstart). - Set up your local or cloud environment (see Configuration).
- Install Tesseract on your machine (see Tesseract docs).
Creating an Agent
Creating a dummy agent that follows the SurfKit protocol is super easy:
The last command will ask you to answer a few questions:
Feel free to leave the docker image repo empty for now and use a standard icon. When you run these commands, an Agent project will get initialized inside the folder you created. It contains all the components you need to give it a try:
If the browser tab with a VM desktop and an agent log opens and the agent starts solving the task, congratulations: you just created your first agent!
How does it work?
Code
Let’s briefly look inside the repo. There are a bunch of files there, but the most critical files are the following:
agent.yaml
is a configuration file; it contains a few self-explanatory sections:- you will need to change the docker image repo later when you’re ready to publish it;
- you may also want to change the icon;
- by default, your agent will run locally;
- this agent is designed to work with a “desktop” device as are the other GUI-driven agents that we build;
server.py
is just a utility class used to host the server with the agent process;agent.py
is the main class that implements the logic of the agent:- at the beginning of the task execution we explain to the MLLM/LLM what this agent is expected to do;
- then we enter the loop:
- given the task and the history of the chat, as well as the state of the desktop, we ask an MLLM/LLM to give us the next action and the reason for it;
- the next action is returned by MLLM/LLM as a JSON, which is checked against the available device and executed upon it;
- we exit the loop when either the MLLM/LLM returns an action marked as “result” (which means that it thinks that the task is solved), or the maximum amount of iterations is reached (30 by default).
Architecture
This type of agent works with a Desktop device. The Desktop device has an interface
that allows it to control the VM (in the cloud or a local one) via a mouse and a keyboard programmatically. You’ve just created it above with surfkit create device
.
When the agent works to solve a given task, it uses the device. The device implements
the Tool
interface and therefore has a schema of actions that the agent can take at any give time. We get this schema in JSON format,
ask an MLLM/LLM to return the next action in that format, and then on each step of solving a task,
we have an action that can be passed back to the device to be executed.
For example, the action returned by MLLM/LLM, might look like this:
When this object gets returned to a device, the device can execute it: in this case, type the text “cool cat images” using the keyboard.
The actions include the low-level operations of a mouse and a keyboard, like moving the mouse, clicking on coordinates, typing letters, and sending key commands. They also include taking the screenshot and getting the current mouse coordinates, which helps an MLLM to choose the next action towards the task completion.
The agent we just created works using these primitives: typing text and clicking the mouse. Cool, right? However, there is one little problem.
Problem
And this is a big problem: at the time of this tutorial (June 2024, “gpt-4o” was released not so long ago), all frontier MLLMs are horrible at identifying coordinates of the object on a screenshot. They can reason quite well about what should be clicked or typed to achieve their goal, but they can’t return correct coordinates for that value.
So it’s time to make this agent better with some additional tricks. Our goal is to help them convert an idea on what should be clicked into actual, correct screen coordinates. To do that, we need to expand the toolset of our device and add a semantic layer to it, so that instead of “click on (400, 350)” our agent will return something like to “click on the big ‘Search’ button at the bottom of the screen”.
From the code point of view, we’ll do the following:
- We’ll introduce the
SemanticDesktop
class, which is essentially a wrapper around the Desktop that we already have, and it inherits all the actions that it provides. - We’ll then update the
Agent
class to use theSemanticDesktop
alongside the actualDesktop
: we want the actions to be taken from and bySemanticDesktop
and translated to the low-levelDesktop
operations when needed; we also want to keep using the screenshotting and mouse-clicking abilities fromDesktop
; - After that, we’ll introduce a new action to the
SemanticDesktop
:click_object
.
Adding SemanticDesktop
You can find the code for this step of the tutorial here.
First, we need to refine our SemanticDesktop
. It will interit from Tool
, which would allow us pass it to the MLLM.
See the full code for tool.py
here.
The most interesting part of this class is that we add a new method, click_object
:
As you can see, it is not so smart at the moment: it simply returns the coordinates in middle of the screen. Don’t worry, we’ll work on it later!
As we now have this class, we can update the Agent
class too. See the full code for agent.py
here.
Note that we replace some of the usages of the Desktop
device by the SemanticDesktop
device, but not all of them. The best way to explain it is that we get observations from the Desktop
(a screenshot and mouse coordinates), but we run the actions of and by the SemanticDesktop
.
We also remove some actions we don’t need our agent to know:
If you run the agent now, you’ll notice in the console logs, that the schema (the available actions) have changed, and the agent can now return a new kind of action:
The only problem now is that the implementation of this click_object
function is still pretty dumb. So let’s fix that now.
Adding Grid
You can find the code for this step of the tutorial here. You’ll need to add the fonts that you can see in the repository.
There are many ways to assist an MLLM in picking the right location of the object on a screenshot. None of them are perfect (to the best of our knowledge at the moment of writing this tutorial), but combining a few in one agent can get you pretty high accuracy. Let’s start with something simple.
We call this approach “The Grid”. The idea is to put a bunch of dots with numbers in the corners of the cells on the NxN grid on a screen.
Honestly, it’s easier to show than to explain:
If we desaturate the original screenshot and put this grid on top, we can ask an MLLM which dot is the closest one to the place the agent wants to click (for example, a search bar or a button). In order to do that, we’ll start a tiny thread with an MLLM (outside of the main thread) just to address this question.
We then simply convert the number that an MLLM returns back to the coordinates on a screen.
First, we need to define a bunch of utility functions to generate this grid, merge it with the main image, and also convert images to and from b64 because it’s the only image format gpt-4o accepts. See the code for image.py
here.
Now, we can update the click_object
function to give it some more power and perception:
It looks like a lot is going on here, but if you look closely, we’re just doing a few simple steps:
- We generate the image with the grid, same as shown above.
- We craft the prompt to instruct our MLLM to return to us exactly what we need: the number of the closest dot.
- We run the prompt and get our result.
- We convert the number back to screen coordinates.
- Along the way, we record the stuff in the “debug” channel of our agent, so that you can see what exactly is going on, in the UI.
You can find full code for tool.py
here.
When you run the agent now, you can see the images with the grid that it generates, in the debug tab. The MLLM picks the correct number pretty reliably. This method is obviously more intelligent than picking the middle of the screen. However, there is a good chance the bot misses the correct spot because the element we’re interested in is right under the dot.
To address this issue, we add a new capability, zooming in. We zoom in and scale up the part of the screenshot surrounding the chosen dot.
You can see the implementation in the SurfSlicer agent.
Adding Tesseract
You can find the code for this step of the tutorial here.
As noted above, you can achieve the best results in your agent by combining many methods. One very simple but powerful idea is to use plain old fashioned OCR to find the text elements whenever it makes sense and click on them. But we use it with a twist. Not only does OCR return the text, it returns the position of the text.
In case there is no text to click on (because the object is an icon, for example) or the OCR engine doesn’t find any text we need (because it’s a white text on a blue background, for example), we fall back to the Grid. But if we can find the text, it gives us two benefits:
- Finding text with a bounding box using Tesseract is exceptionally fast in comparison to OpenAI API calls: You get the result in a fraction of a second.
- The bounding box is very accurate: we can safely click in the middle of the coordinates and be sure that we hit the right oject.
First of all, install pytesseract
:
Now, we need another bunch of utility methods, to run Tesseract
and to find bounding boxes for a given text. See the code for ocr.py
here.
When we have this, we update click_object
. We move the grid-related logic to a separate method, add a similar one with the OCR-related logic, and update the main action method like this:
Grab the complete code for the final version of tool.py
here.
If you look closely on the debug channel now, you’ll see that our agent tries to use OCR whenever it makes sense, and if this operation succeeds, it goes on with the next iteration; if it doesn’t succeed, it falls back to the grid approach.
What’s next?
Now it’s your turn!
There are a lot of techniques that we’ve personally tried with different level of success; to name a few:
- Locating elements on a page with Grounding Dino.
- Cutting the image into pieces and compositing them on a new image with numbers alongside the various pieces.
- Zooming into the Grid 2-3 times with new numbers.
- Layering coordinates over a screenshot
- Upscaling a screenshot with a GAN
- OCR, as noted above, but with some tweaks.
- Many more…
We strongly believe that the key of the success of the agent is mixing and matching a bunch of techniques, including everything from classical ML to deep learning to the most bleeding edge features of frontier models, spiced up with traditional programming. So get in there and try your own techniques! Get creative. Get tricky. Think of it as outthinking the model to get what you want.
We can’t wait to see what you come up with!