A new MLLM tailored for enhanced understanding of mobile UI screens, equipped with referring, grounding, and reasoning capabilities. [Paper]
Keen You, Haotian Zhang, Eldon Schoop, Floris Weers, Amanda Swearngin, Jeffrey Nichols, Yinfei Yang, Zhe Gan
Key contributions:
- We introduce Ferret-UI, the first UI-centric MLLM that is capable of effectively executing referring, grounding, and reasoning tasks.
- We define a set of elementary and advanced UI tasks, for which we have meticulously gathered training samples for model training.
- We develop a comprehensive test benchmark encompassing all the tasks under investigation.
- We release two Ferret-UI checkpoints, built on gemma-2b and Llama-3-8B models respectively, for public exploration.
While Ferret-UI-base closely follows Ferret’s architecture, Ferret-UI-anyres incorporates additional fine-grained image features. Particularly, a pre-trained image encoder and projection layer produce image features for the entire screen. For each sub-image obtained based on the original image aspect ratio, additional image features are generated. For text with regional references, a visual sampler generates a corresponding regional continuous feature. The LLM uses the full-image representation, sub-image representations, regional features, and text embeddings to generate a response.
- [10/08/2024] 🔥 We release two Ferret-UI checkpoints trained with any-resolution schema: FerretUI-Gemma2b, FerretUI-Llama8B.
- [07/03/2024] 🔥 Ferret-UI is accepted to ECCV 2024.
Usage and License Notices: The data and code is intended and licensed for research use only. They are also restricted to uses that follow the license agreement of Gemma and LLaMA. The dataset is CC BY NC 4.0 (allowing only non-commercial use) and models trained using the dataset should not be used outside of research purposes.
Model checkpoints can be found at Ferret-UI-gemma2b and Ferret-UI-Llama8B.
Model | Ref-i | Ref-A | Grd-i | Grd-A |
---|---|---|---|---|
Ferret-UI-Gemma2b | 82.75 | 80.80 | 72.21 | 73.20 |
Ferret-UI-Llama8b | 87.04 | 85.18 | 78.63 | 76.58 |
Both model are trained using the data collection described in the paper, using a learning rate of 2e-5 for 3 epochs with a global batch size of 128.
- Clone this repository and navigate to
ferretui
folder
git clone https://github.com/apple/ml-ferret
cd ferretui
- Install Package
conda create -n ferret python=3.10 -y
conda activate ferretui
pip install --upgrade pip # enable PEP 660 support
pip install -e .
pip install flash-attn --no-build-isolation
- Training script examples are located in
scripts/train/
. - Evaluation script examples are located in
scripts/eval/
. - Example data is located in
playground/sample_data/
.
If you find Ferret-UI useful, please cite using this BibTeX:
@misc{you2024ferretuigroundedmobileui,
title={Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs},
author={Keen You and Haotian Zhang and Eldon Schoop and Floris Weers and Amanda Swearngin and Jeffrey Nichols and Yinfei Yang and Zhe Gan},
year={2024},
eprint={2404.05719},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2404.05719},
}