A detailed guide from zero to GUI surfer with multiple approaches
poetry
(see Poetry docs).surfkit
(see Quickstart).agent.yaml
is a configuration file; it contains a few self-explanatory sections:
server.py
is just a utility class used to host the server with the agent process;agent.py
is the main class that implements the logic of the agent:
surfkit create device
.
When the agent works to solve a given task, it uses the device. The device implements
the Tool
interface and therefore has a schema of actions that the agent can take at any give time. We get this schema in JSON format,
ask an MLLM/LLM to return the next action in that format, and then on each step of solving a task,
we have an action that can be passed back to the device to be executed.
For example, the action returned by MLLM/LLM, might look like this:
SemanticDesktop
class, which is essentially a wrapper around the Desktop that we already have, and it inherits all the actions that it provides.Agent
class to use the SemanticDesktop
alongside the actual Desktop
: we want the actions to be taken from and by SemanticDesktop
and translated to the low-level Desktop
operations when needed;
we also want to keep using the screenshotting and mouse-clicking abilities from Desktop
;SemanticDesktop
: click_object
.SemanticDesktop
. It will interit from Tool
, which would allow us pass it to the MLLM.
See the full code for tool.py
here.
The most interesting part of this class is that we add a new method, click_object
:
Agent
class too. See the full code for agent.py
here.
Note that we replace some of the usages of the Desktop
device by the SemanticDesktop
device, but not all of them. The best way to explain it is that we get observations from the Desktop
(a screenshot and mouse coordinates), but we run the actions of and by the SemanticDesktop
.
We also remove some actions we don’t need our agent to know:
click_object
function is still pretty dumb. So let’s fix that now.
image.py
here.
Now, we can update the click_object
function to give it some more power and perception:
tool.py
here.
When you run the agent now, you can see the images with the grid that it generates, in the debug tab. The MLLM picks the correct number pretty reliably. This method is obviously more intelligent than picking the middle of the screen. However, there is a good chance the bot misses the correct spot because the element we’re interested in is right under the dot.
To address this issue, we add a new capability, zooming in. We zoom in and scale up the part of the screenshot surrounding the chosen dot.
You can see the implementation in the SurfSlicer agent.
pytesseract
:
Tesseract
and to find bounding boxes for a given text. See the code for ocr.py
here.
When we have this, we update click_object
. We move the grid-related logic to a separate method, add a similar one with the OCR-related logic, and update the main action method like this:
tool.py
here.
If you look closely on the debug channel now, you’ll see that our agent tries to use OCR whenever it makes sense, and if this operation succeeds, it goes on with the next iteration; if it doesn’t succeed, it falls back to the grid approach.