01
AppAgent会成为新的趋势吗?
近日,腾讯团队发表了一篇论文,并开源了一款基于大语言模型的,用于手机端执行复杂任务的多模态智能代理框架——AppAgent。该框架设计的初衷,是让 AI 智能体能自己操作手机,完成特定的任务,这一创新技术引起了广泛关注。
图片
项目地址:https://appagent-official.github.io/
开源地址:https://github.com/mnotgod96/AppAgent
论文地址:https://arxiv.org/abs/2312.13771
根据该论文的说明,大型语言模型(LLM)的最新进展促使人们创建了能够执行复杂任务的智能代理。
论文介绍了一种新颖的基于 LLM 的多模态代理框架,旨在操作智能手机应用程序。
AppAgent功能展示视频
AppAgent框架使代理能够通过简化的操作空间来操作智能手机应用,模仿人类的交互方式,如点击和轻扫。这种新颖的方法绕过了对系统后端访问的需求,从而扩大了其在各种应用程序中的适用性。代理功能的核心是其创新的学习方法。该代理通过自主探索或观察人类演示来学习导航和使用新应用。这个过程会生成一个知识库,代理在执行不同应用程序的复杂任务时可以参考这个知识库。
为了证明我们的代理的实用性,我们对 10 种不同应用中的 50 个任务进行了广泛测试,包括社交媒体、电子邮件、地图、购物和复杂的图像编辑工具。测试结果肯定了我们的代理在处理各种高级任务方面的能力。
图片
AppAgent 的能力展示。AppAgent 是一种由大型语言模型驱动的高级多模态代理,能够掌握并利用 ANY 应用程序执行复杂的任务。它通过直观的点击和轻扫手势与应用程序进行交互,模仿人类的动作。
图片
X平台(原Twitter)科技大V“AK”也转发了这篇论文,腾讯团队论文作者Chi Zhang转发表示感谢,并公布了这一项目的一个重大更新:AppAgent现在支持Android模拟器,建议没有Android设备的的研究人员可以尝试一下!
02
AppAgent简单试用
以下为TesterHome社区的小伙伴的试用总结:
每一次有自动化的新工具问世,就有一堆人会说:“啊呀呀呀,测试要失业了。” 几天前 AppAgent 出来时,嗅觉灵敏的自媒体就开始搬运,然后剑锋直指测试工程师,于是咱们又失业了一次。因为团队内部对应用自动化测试也有诉求,所以第一时间就在自己电脑上跑起来看看。
安装步骤
安装很简单,我用的是 Windows 11 64bit,Android 环境已经装好(其实只要装了 adb 就可以了),python 环境也安装好了(我的 python 环境用的是 conda,大家可以自行百度)。然后把代码下载下来,pip install -r requirements.txt 安装好依赖就可以用了。
英语好的,直接看:https://github.com/mnotgod96/AppAgent
运行前配置
其实就是因为它用了 openAI 的 Gpt-4-vision-preview 模型,所以咱们必须得有 openAI 的收费账户,然后拿到对应的 OPENAI_API_KEY。对应 AppAgent 的配置文件 config.yaml
...OPENAI_API_BASE: "https://api.openai.com/v1/chat/completions"OPENAI_API_KEY: "sk-xxxx" ?# Set the value to sk-xxx if you host the openai interface for open llm modelOPENAI_API_MODEL: "gpt-4-vision-preview" ?# The only OpenAI model by now that accepts visual input...
这些参数会在 model.py 里调用,
ask_gpt4v 方法:这个方法是和 openAI 交互的方法。
def ask_gpt4v(content):
? ? headers = {
? ? ? ? "Content-Type": "application/json",
? ? ? ? "Authorization": f"Bearer {configs['OPENAI_API_KEY']}"
? ? }
? ? payload = {
? ? ? ? "model": configs["OPENAI_API_MODEL"],
? ? ? ? "messages": [
? ? ? ? ? ? {
? ? ? ? ? ? ? ? "role": "system",
? ? ? ? ? ? ? ? "content": content
? ? ? ? ? ? }
? ? ? ? ],
? ? ? ? "temperature": configs["TEMPERATURE"],
? ? ? ? "max_tokens": configs["MAX_TOKENS"]
? ? }
? ? response = requests.post(configs["OPENAI_API_BASE"], headers=headers, json=payload)
? ? print_with_color("resp: ", response)
? ? if "error" not in response.json():
? ? ? ? usage = response.json()["usage"]
? ? ? ? prompt_tokens = usage["prompt_tokens"]
? ? ? ? completion_tokens = usage["completion_tokens"]
? ? ? ? print_with_color(f"Request cost is "
? ? ? ? ? ? ? ? ? ? ? ? ?f"${'{0:.2f}'.format(prompt_tokens / 1000 * 0.01 + completion_tokens / 1000 * 0.03)}",
? ? ? ? ? ? ? ? ? ? ? ? ?"yellow")
? ? return response.json()
从 openAI 回来的数据会在 parse_explore_rsp 里进行解析,我感觉这个方法是最重要的,它利用 openAI 的 Thought/Action/Action Input/Observation 机制,对结构化的返回进行解析。很多这种 agent 其实都是基于这个机制,openai 的这块做的比较好,每次都能按照这个模式来给你返回,所以目前来说插件体系啥的也只有 openai 的搞起来了(From 挺神)。这里也挺有意思的,本来我想 openAI 太贵,AppAgent 调用一次,0.02 刀的样子,想换成阿里云的通义千问,翻了一遍文档,似乎没有 Thought/Action/Action Input/Observation 机制,这个我不专业,有懂的同学可以指正下。
所以这里话又说回来了,你还得花这个 openAI 的钱,否则你得大改 APPAgent 的代码。
图片
运行
运行很简单,按官方文档,先 learn 再 run。我这里拿 CSDN 做例子,先在手机上把 CSDN 打开,然后执行 python .\learn.py
图片
这里我选 human demonstration,autonomous exploration 没时间跑。在终端输入2,回车,就会进入下一步:
What is the name of the target app?
CSDN
Warning! No module named 'sounddevice'Warning! No module named 'matplotlib'Warning! No module named 'keras'List of devices attached:['42954ffb']Device selected: 42954ffb
Screen resolution of 42954ffb: 1440x3216
这里会通过 adb 命令,把设备信息拿回来。APPAgent 里自己封装了 adb 命令,比如点击就是用的 adb shell input tap 坐标,比较原始(我一开始以为会封装个啥 Appium 之类的),在文件 and_controller.py 里。这些信息打印好之后,会立刻让你输入你后面动作的描述。这里我就写 “search for testerhome”,然后回车,就会弹出一个界面来。
Please state the goal of your following demo actions clearly, e.g. send a message to John
search for testerhome
(然后回车,就会弹出一个界面来,看英语说的,红色的是可以点击的,蓝色的是可以滚动的,看下面这个图。)All interactive elements on the screen are labeled with red and blue numeric tags. Elements labeled with red tags are clickable elements; elements labeled with blue tags are scrollable elements.
图片
我们鼠标聚焦到图片之后,按回车,图片就会消失,接着提示我们就可以根据可以点击的地方,来操作,比如这里搜索的按钮是25,那我就需要点击25这个元素。
Choose one of the following actions you want to perform on the current screen:
tap, text, long press, swipe, stop
tap
Which element do you want to tap? Choose a numeric tag from 1 to 83:
25
这个时候,点击就成功了,会再把点击搜索按钮之后的界面截图出来,
图片
接下来都是一样的操作,总共 5 个步骤。
Which element do you want to tap? Choose a numeric tag from 1 to 14:
3
Choose one of the following actions you want to perform on the current screen:
tap, text, long press, swipe, stop
text
Which element do you want to input the text string? Choose a numeric tag from 1 to 14:
3
Enter your input text below:
testerhome
Choose one of the following actions you want to perform on the current screen:
tap, text, long press, swipe, stop
tap
Which element do you want to tap? Choose a numeric tag from 1 to 15:
4
Choose one of the following actions you want to perform on the current screen:
tap, text, long press, swipe, stop
stop
Demonstration phase completed. 5 steps were recorded.
然后就是 chatGPT 开始工作了,
Warning! No module named 'sounddevice'
Warning! No module named 'matplotlib'
Warning! No module named 'keras'
Starting to generate documentations for the app CSDN based on the demo demo_CSDN_2023-12-24_20-46-47
Waiting for GPT-4V to generate documentation for the element net.csdn.csdnplus.id_ll_order_tag_net.csdn.csdnplus.id_iv_home_bar_search_1
resp:
Request cost is $0.00
Documentation generated and saved to ./apps\CSDN\demo_docs\net.csdn.csdnplus.id_ll_order_tag_net.csdn.csdnplus.id_iv_home_bar_search_1.txt
Waiting for GPT-4V to generate documentation for the element android.widget.LinearLayout_1008_144_net.csdn.csdnplus.id_et_search_content_1
resp:
Request cost is $0.00
Documentation generated and saved to ./apps\CSDN\demo_docs\android.widget.LinearLayout_1008_144_net.csdn.csdnplus.id_et_search_content_1.txt
Waiting for GPT-4V to generate documentation for the element android.widget.LinearLayout_1008_144_net.csdn.csdnplus.id_et_search_content_1
resp:
Request cost is $0.00
Documentation generated and saved to ./apps\CSDN\demo_docs\android.widget.LinearLayout_1008_144_net.csdn.csdnplus.id_et_search_content_1.txt
Waiting for GPT-4V to generate documentation for the element android.widget.LinearLayout_1440_176_net.csdn.csdnplus.id_tv_search_search_2
resp:
Request cost is $0.00
Documentation generated and saved to ./apps\CSDN\demo_docs\android.widget.LinearLayout_1440_176_net.csdn.csdnplus.id_tv_search_search_2.txt
Documentation generation phase completed. 4 docs generated.
最后生成的样子是这样的:
图片
其中 task_desc 就是我们前面的 search for testerhome,record 是每一步的命令的合并,然后有打标签的截图等等。
到这里,我们的学习就完成了,下面就要运行了, python run.py
Warning! No module named 'sounddevice'Warning! No module named 'matplotlib'Warning! No module named 'keras'Welcome to the deployment phase of AppAgent!
Before giving me the task, you should first tell me the name of the app you want me to operate and what documentation base you want me to use. I will try my best to complete the task without your intervention. First, please enter the main interface of the app on your phone and provide the following information.
What is the name of the target app?
CSDN
Warning! No module named 'sounddevice'Warning! No module named 'matplotlib'Warning! No module named 'keras'Documentations generated from human demonstration were found for the app CSDN. The doc base is selected automatically.
List of devices attached:['42954ffb']Device selected: 42954ffb
Screen resolution of 42954ffb: 1440x3216
Please enter the description of the task you want me to complete in a few sentences:
search for testerhome
Round 1
Documentations retrieved for the current interface:
Documentation of UI element labeled with the numeric tag '25':
This UI element is clickable. To assist you effectively, I'll need to see the screenshots of the mobile app you're referring to. Please provide the images showing the UI element before and after it has been tapped.
Thinking about what to do in the next step...
resp:
Request cost is $0.01
Observation:
The image shows a smartphone screen with a Google search bar at the top. There are various app icons on the screen, and the UI element labeled '25' appears to be the Google search bar.
Thought:
To complete the given task, I should tap on the Google search bar to bring up the keyboard and enter the search term "testerhome."Action:
tap(25)Summary:
I have tapped on the Google search bar to initiate a search for "testerhome."Round 2
Documentations retrieved for the current interface:
Documentation of UI element labeled with the numeric tag '3':
This UI element is clickable. I'm sorry, but I can't provide assistance without the images being displayed. If you can provide the screenshots of the mobile app before and after tapping the UI element, I would be able to help you describe its functionality.
This UI element can receive text input. The text input is used for the following purposes: I'm sorry, but I can't provide assistance without the images being displayed. If you can provide the screenshots you mentioned, I'd be happy to help describe the functionality of the UI element labeled with the number 3.
Documentation of UI element labeled with the numeric tag '4':
This UI element is clickable. I'm sorry, but I can't provide assistance without the images being displayed. If you can provide the screenshots you mentioned, I'd be happy to help you with the description of the UI element.
Thinking about what to do in the next step...
resp:
Request cost is $0.01
Observation:
I'm sorry, but I cannot provide observations without an image being displayed. If you can provide the screenshot you mentioned, I'd be happy to assist you further.
Thought:
Since I cannot see the image, I am unable to determine the next step to complete the given task.
Action:
FINISH
Summary:
I have tapped on the Google search bar to initiate a search for "testerhome," but without the image, I cannot proceed further with the task.
Task completed successfully
这个过程,其实就是拿着前面 learn 的时候,记录的这些信息,去组成 prompt 模板,再去调用 chatGPT。代码是下图,里面的 image_url,就是打标签的图片。把某一步的操作和对应的图片提交给 GPT。
图片
我前面运行 run.py 里面第一步就成功的把图片和 tap 的操作给传给 chatGPT 了,GPT 说 tap(25) 。但是大家再往下看的时候,就发现 GPT 开始胡说八道了,所以很遗憾,我 learn 时候的操作,并没有在 run 的时候重放出来。
试用总结
至此,基本把 APPAgent 跑了一遍了,我和群友说,demo 很性感,现实很骨感,显然 chatGPT 对 CSDN 不够了解。在我看来,现阶段的 AppAgent 只不过是一个客户端录制回放的,而且非常简陋的工具。但是思路非常不错,我自己组里准备着手改造,看看能不能真正用起来。
图片
针对TesterHome社区小伙伴的本次试用,AppAgent的作者之一,也在TesterHome社区进行了回应。所以,对于AppAgent这个创新项目,大家可以试用起来,有什么问题,也可以在TesterHome社区发帖、留言,AppAgent项目官方的同学们也都会看得到。