- Pure visual solution, independent of XML and system metadata.
- Unrestricted operation scope, capable of multi-app operations.
- Multiple visual perception tools for operation localization.
- No need for exploration and training, plug and play.
- [3.10]🔥🔥Mobile-Agent has been accepted by the ICLR 2024 Workshop on Large Language Model (LLM) Agents.
- [3.4]🔥🔥We provide Mobile-Agent via Qwen-vl-max. Qwen-vl-max is a free multi-modal large language model.
- [2.21] 🔥🔥We provide a demo that can upload screenshots from mobile devices. Now you can experience it at Hugging Face and ModelScope.
- [2.5] 🔥🔥We provide a free API and deploy the entire process for experiencing Mobile Agent, even if you don't have an OpenAI API Key. Check out Quick Start.
- [1.31] 🔥Our code is available! Welcome to try Mobile-Agent.
- [1.31] 🔥Human-operated data in Mobile-Eval is in preparation and will be open-sourced soon.
- [1.30] Our paper is available at LINK.
- [1.30] Our evaluation results on Mobile-Eval are available.
- [1.30] The code and Mobile-Eval benchmark are coming soon!
Mobile-Agent.mp4
The demo can now be experienced at Hugging Face and ModelScope.
git clone https://github.com/X-PLUG/MobileAgent.git
cd MobileAgent
pip install -r requirements.txt
- Download the Android Debug Bridge.
- Turn on the ADB debugging switch on your Android phone, it needs to be turned on in the developer options first.
- Connect your phone to the computer with a data cable and select "Transfer files".
- Test your ADB environment as follow:
/path/to/adb devices
. If the connected devices are displayed, the preparation is complete. - If you are using a MAC or Linux system, make sure to turn on adb permissions as follow:
sudo chmod +x /path/to/adb
- If you are using Windows system, the path will be
xx/xx/adb.exe
❗Since the GPT-4V will have severe hallucinations when perceiving non-English screenshots, we strongly recommend using Mobile-Agent under English-only systems and apps to ensure the performance.
❗Due to current limited resources, please contact us to get a free API Key consisting of a url and a token.
- Email: junyangwang@bjtu.edu.cn, junyangwang287@gmail.com(If the former cannot be reached)
- WeChat: Wangjunyang0410
python run_api.py --adb_path /path/to/adb --url "The url you got" --token "The token you got" --instruction "your instruction"
- Download the icon detection model Grounding DION
- The text detection model will be automatically downloaded from modelscope after you run Mobile-Agent.
python run.py --grounding_ckpt /path/to/GroundingDION --adb_path /path/to/adb --api "your API_TOKEN" --instruction "your instruction"
API_TOKEN is an API Key from OpenAI with the permission to access gpt-4-vision-preview
.
Mobile-Eval is a benchmark designed for evaluating the performance of mobile device agents. This benchmark includes 10 mainstream single-app scenarios and 1 multi-app scenario.
For each scenario, we have designed three instructions:
- Instruction 1: relatively simple and basic task
- Instruction 2: additional requirements added on top of the difficulty of Instruction 1
- Instruction 3: user demands with no explicit task indication
The detailed content of Mobile-Eval is as follows:
Application | Instruction |
---|---|
Alibaba.com | 1. Help me find caps in Alibaba.com. 2. Help me find caps in Alibaba.com. If the "Add to cart" is available in the item information page, please add the item to my cart. 3. I want to buy a cap. I've heard things are cheap on Alibaba.com. Maybe you can find it for me. |
Amazon Music | 1. Search singer Jay Chou in Amazon Music. 2. Search a music about "agent" in Amazon Music and play it. 3. I want to listen music to relax. Find an App to help me. |
Chrome | 1. Search result for today's Lakers game. 2. Search the information about Taylor Swift. 3. I want to know the result for today's Lakers game. Find an App to help me. |
Gmail | 1. Send an empty email to to {address}. 2. Send an email to {address}n to tell my new work. 3. I want to let my friend know my new work, and his address is {address}. Find an App to help me. |
Google Maps | 1. Navigate to Hangzhou West Lake. 2. Navigate to a nearby gas station. 3. I want to go to Hangzhou West Lake, but I don't know the way. Find an App to help me. |
Google Play | 1. Download WhatsApp in Play Store. 2. Download Instagram in Play Store. 3. I want WhatsApp on my phone. Find an App to help me. |
Notes | 1. Create a new note in Notes. 2. Create a new note in Notes and write "Hello, this is a note", then save it. 3. I suddenly have something to record, so help me find an App and write down the following content: meeting at 3pm. |
Settings | 1. Turn on the dark mode. 2. Turn on the airplane mode. 3. I want to see the real time internet speed at the battery level, please turn on this setting for me. |
TikTok | 1. Swipe a video about pet cat in TikTok and click a "like" for this video. 2. Swipe a video about pet cat in TikTok and comment "Ohhhh, so cute cat!". 3. Swipe videos in TikTok. Click "like" for 3 pet video cat. |
YouTube | 1. Search for videos about Stephen Curry on YouTube. 2. Search for videos about Stephen Curry on YouTube and open "Comments" to comment "Oh, chef, your basketball spirit has always inspired me". 3. I need you to help me show my love for Stephen Curry on YouTube. |
Multi-App | 1. Open the calendar and look at today's date, then go to Notes and create a new note to write "Today is {today's data}". 2. Check the temperature in the next 5 days, and then create a new note in Notes and write a temperature analysis. 3. Search the result for today's Lakers game, and then create a note in Notes to write a sport news for this result. |
We evaluated Mobile-Agent on Mobile-Eval. The evaluation results are available at LINK.
- We have stored the evaluation results for the 10 apps and the multi-app scenario in folders named after each app.
- The numbers within each app's folder represent the results for different types of instruction within that app.
- For example, if you want to view the results of Mobile-Agent for the second instruction in Google Maps, you should go to the following path:
results/Google Maps/2
. - If the last action of Mobile-Agent is not "stop", it indicates that Mobile-Agent did not complete the corresponding instruction. During the evaluation, we manually terminated these cases where completion was not possible.
- Development of Mobile-Agent app on Android platform.
- Adaptation to other mobile device platforms.
If you find Mobile-Agent useful for your research and applications, please cite using this BibTeX:
@article{wang2024mobile,
title={Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception},
author={Wang, Junyang and Xu, Haiyang and Ye, Jiabo and Yan, Ming and Shen, Weizhou and Zhang, Ji and Huang, Fei and Sang, Jitao},
journal={arXiv preprint arXiv:2401.16158},
year={2024}
}
- AppAgent: Multimodal Agents as Smartphone Users
- mPLUG-Owl & mPLUG-Owl2: Modularized Multimodal Large Language Model
- Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
- GroundingDINO: Marrying DINO with Grounded Pre-Training for Open-Set Object Detection
- CLIP: Contrastive Language-Image Pretraining