跳到主要内容

Stagehand:大模型驱动的网页自动化

· 阅读需 9 分钟

最近在GitHub Trending上发现了一个比较有意思的项目:Stagehand,它是个大模型驱动的网页自动化工具。官网:Stagehand: A browser automation SDK built for developers and LLMs. 看到可以使用本地浏览器,于是乎我稍微尝试了下,下面分享下我对于Stagehand的初体验。

环境搭建

Stagehand提供了TypeScript和Python两种语言的接口,考虑到我对Python更为熟悉,于是选择了后者。安装还是比较简单的,直接pip 安装就行,不过我默认的清华源没有,临时换源到阿里源就行😂。

pip install stagehand -i https://mirrors.aliyun.com/pypi/simple/

然后就是启动浏览器并打开调试端口了,这里需要指定用户目录,不能使用默认的用户目录,这一点官方文档没有提到。

# Windows
"C:\Program Files\Google\Chrome\Application\chrome.exe" --remote-debugging-port=9222 --user-data-dir="D:\Desktop\chrome-debug-profile"

# Linux
/usr/bin/google-chrome-stable --proxy-server="socks5://192.168.17.2:7897" --remote-debugging-port=9222 --user-data-dir=/tmp/chrome-debug-profile

Stagehand默认使用OpenAI,本来我想换成DeepSeek,结果发现DeepSeek不支持json_schema ,只支持json_object ,也就是Stagehand并不兼容DeepSeek,虽然官方文档里写了兼容但实际是不兼容的。

为了使用OpenAI,可以到包含OpenAI的大模型聚合类站点注册,通过替换地址来使用OpenAI的模型。我使用的是 DMXAPI官网:中国多模态大模型API聚合平台 ,然后在运行代码时使用环境变量指定Key和base地址就行。

# 使用自己的key
OPENAI_API_KEY=sk-YourToken

# 变量名使用 OPENAI_API_BASE, OPENAI_BASE_URL 均可
OPENAI_API_BASE=https://www.dmxapi.cn/v1/

演示

我使用Stagehand做了个简单的Demo:项目地址 。效果如下:

image

因为代码比较简单我直接贴过来:

import asyncio

import litellm
from stagehand import Stagehand, StagehandConfig


async def main():
# litellm._turn_on_debug()
sh = Stagehand(StagehandConfig(
model_name="openai/gpt-4.1-mini",
local_browser_launch_options={
"cdp_url": "http://127.0.0.1:9222",
}
), env="LOCAL")

try:
await sh.init()
page = sh.page
await page.goto("https://esf.fang.com/")
await page.act("fill search input with 利泽西园")
await page.act("click 搜索 button")
houses = await page.extract("extract all house info, include title,area,total_price,unit_price", schema=Houses)
for house in houses.list_of_houses:
print(house)
finally:
await sh.close()

from litellm import BaseModel
class House(BaseModel):
title: str
area: str
total_price: str
unit_price: str
class Houses(BaseModel):
list_of_houses: list[House]


if __name__ == '__main__':
asyncio.run(main())

代码要完成的功能非常简单,就是打开页面,然后搜索,最后提取内容。从上面的动图可以发现,最后也是顺利完成了内容的提取。整个过程不需要手动解析网页,不需要XPath或CSS选择器。

需要注意的是,Stagehand需要在异步方法里使用,这可能是因为Stagehand原生支持的是TypeScript。

原理介绍

image

当用户发起指令时,比如点击“搜索”按钮,Stagehand observe会通过浏览器获取页面的无障碍树,然后将用户指令与无障碍树发送给大模型,大模型分析用户指令与无障碍树,然后返回具体的操作指令,Stagehand observe将具体的操作指令告诉Stagehand act,Stagehand act通过Playwright操控浏览器执行具体的指令。

Stagehand observe向大模型发送的内容如下(无障碍树内容较长,做了精简):

{'model': 'gpt-4.1-mini', 'messages': [{'role': 'system', 'content': 'You are helping the user automate the browser by finding elements based on what the user wants to observe in the page. You will be given: 1. an instruction of elements to observe 2. a hierarchical accessibility tree showing the semantic structure of the page. The tree is a hybrid of the DOM and the accessibility tree. Return an array of elements that match the instruction if they exist, otherwise return an empty array. Whenever suggesting actions, use supported playwright locator methods or preferably one of the following supported actions: scrollIntoView, scrollTo, scroll, mouse.wheel, fill, type, press, click, nextChunk, prevChunk, selectOptionFromDropdown'}, {'role': 'user', 'content': "instruction: Find the most relevant element to perform an action on given the following action: click 搜索 button.\nProvide an action for this element such as scrollIntoView, scrollTo, scroll, mouse.wheel, fill, type, press, click, nextChunk, prevChunk, selectOptionFromDropdown, or any other playwright locator method. Remember that to users, buttons and links look the same in most cases.\nIf the action is completely unrelated to a potential action to be taken on the page, return an empty array.\nONLY return one action. If multiple actions are relevant, return the most relevant one.\nIf the user is asking to scroll to a position on the page, e.g., 'halfway' or 0.75, etc, you must return the argument formatted as the correct percentage, e.g., '50%' or '75%', etc.\nIf the user is asking to scroll to the next chunk/previous chunk, choose the nextChunk/prevChunk method. No arguments are required here.\nIf the action implies a key press, e.g., 'press enter', 'press a', 'press space', etc., always choose the press method with the appropriate key as argument — e.g. 'a', 'Enter', 'Space'. Do not choose a click action on an on-screen keyboard. Capitalize the first character like 'Enter', 'Tab', 'Escape' only for special keys.\nIf the action implies choosing an option from a dropdown, AND the corresponding element is a 'select' element, choose the selectOptionFromDropdown method. The argument should be the text of the option to select.\nIf the action implies choosing an option from a dropdown, and the corresponding element is NOT a 'select' element, choose the click method.\nAccessibility Tree: [1] RootWebArea: 【北京二手房|北京二手房出售】 - 北京房天下\n  [141] scrollable\n    [301] body\n      [8] div\n        [306] link\n        [307] div\n          [308] link: 北京\n          [3] image\n        [354] link: 首页\n        [356] link: 新房\n        [382] link: 二手房\n        [410] link: 租房\n        [429] link: 查房价\n        [431] link: 探房笔记\n        [433] link: 装修家居\n        [452] link: 商铺写字楼\n        [472] link: 海外房产\n        [488] link: 资讯\n        [502] link: 直播看房\n        [504] link: 地产数据\n        [4337] StaticText: 更多\n        [563] div\n          [565] link: 更多服务\n          [571] link: 商家中心\n          [576] link: 登录\n      [131] div\n        [4363] StaticText: 下载房天下APP\n        [602] list\n          [603] listitem\n            [604] link: 在售\n          [605] listitem\n            [606] link: 特价房\n          [607] listitem\n            [608] link: 成交\n          [609] listitem\n            [610] link: 找别墅\n          [611] listitem\n            [612] link: 排行榜\n          [613] listitem\n            [614] link: 地图找房\n          [615] listitem\n            [616] link: 法拍房\n          [617] listitem\n            [618] link: 小区\n        [133] div\n          [619] link: 我要卖房\n          [9] form\n            [620] list\n              [621] listitem\n                [10] textbox: 请输入小区名称…\n                  [6791] StaticText: 利泽西园\n                [6] button: 搜 索\n            [634] div\n              [635] paragraph\n                [4376] StaticText: 输入中文/拼音/拼音首字母或用上下键选择\n              [636] list\n                [637] listitem\n                  [4377] StaticText: 最近搜过\n                [6796] listitem\n                  [6801] StaticText: 利泽西园\n      [644] div\n        [648] div\n          [649] link: 房天下\n          [4380] StaticText: >\n          [651] link: 北京二手房\n        [652] div\n          [653] div\n            [654] paragraph\n              [4383] StaticText: 区域\n            [655] paragraph\n              [4384] StaticText: 地铁\n            [656] paragraph\n              [657] link: 地图\n          [658] div\n            [660] list\n              [661] listitem\n                [4387] StaticText: 区域\n                [663] list\n                  [4227] Iframe\n"}], 'temperature': 0.1, 'response_format': {'type': 'json_schema', 'json_schema': {'schema': {'$defs': {'ObserveElementSchema': {'properties': {'element_id': {'title': 'Element Id', 'type': 'integer'}, 'description': {'description': 'A description of the observed element and its purpose.', 'title': 'Description', 'type': 'string'}, 'method': {'title': 'Method', 'type': 'string'}, 'arguments': {'items': {'type': 'string'}, 'title': 'Arguments', 'type': 'array'}}, 'required': ['element_id', 'description', 'method', 'arguments'], 'title': 'ObserveElementSchema', 'type': 'object', 'additionalProperties': False}}, 'properties': {'elements': {'items': {'$ref': '#/$defs/ObserveElementSchema'}, 'title': 'Elements', 'type': 'array'}}, 'required': ['elements'], 'title': 'ObserveInferenceSchema', 'type': 'object', 'additionalProperties': False}, 'name': 'ObserveInferenceSchema', 'strict': True}}, 'extra_body': {}}

大模型返回的内容如下:

{
"id": "chatcmpl-Bx0qE8zjrzPfjybG7uURR435u4LZ6",
"choices": [
{
"finish_reason": "stop",
"index": 0,
"logprobs": null,
"message": {
"content": "{\"elements\":[{\"arguments\":[\"click\"],\"description\":\"Button labeled '搜 索' to perform the search action when clicked.\",\"element_id\":6,\"method\":\"click\"}]}",
"refusal": null,
"role": "assistant",
"annotations": [],
"audio": null,
"function_call": null,
"tool_calls": null
}
}
],
"created": 1753404286,
"model": "gpt-4.1-mini-2025-04-14",
"object": "chat.completion",
"service_tier": "default",
"system_fingerprint": "fp_6f2eabb9a5",
"usage": {
"completion_tokens": 36,
"prompt_tokens": 33971,
"total_tokens": 34007,
"completion_tokens_details": {
"accepted_prediction_tokens": 0,
"audio_tokens": 0,
"reasoning_tokens": 0,
"rejected_prediction_tokens": 0
},
"prompt_tokens_details": {
"audio_tokens": 0,
"cached_tokens": 0
}
}
}

由于依赖无障碍树,而很多网页在制作时并没有太关注无障碍,尤其是国内的一些网站,这导致大模型有时无法返回合适的结果。看Stagehand的文档似乎还支持向大模型发送截图或DOM树,大模型还支持返回坐标,不过并没有找到开启的方法,也没有示例,也许是在描述里明确指定?

总结

总之,Stagehand的原理还是比较简单的,使用起来也并不复杂。不过,在PC浏览器日渐式微的当下,Stagehand的应用场景是什么呢?也许是RPA?

不过,Stagehand提供了一种基于大模型的自动化思路,也就是将界面输入给大模型,然后由大模型来进行决策,并输出动作,最后执行动作。在这个过程中,大模型有点类似于大脑,眼睛等感知器官负责外界的输入,手脚等执行器官负责动作的执行。