ByteDance’s UI-TARS can replace your computer, better than GPT-4o and Claude


Join our daily and weekly newsletters for the latest updates and exclusive content on industry leading AI coverage. Learn more


A new AI agent has emerged from TikTok’s parent company to take control of your computer and perform complex workflows.

Like Anthropic’s Computer UseByteDance’s new UI-TARS understands graphical user interfaces (GUIs), applies logic and takes autonomous, sequential actions.

Trained on nearly 50B tokens and offered in 7B and 72B parameter versions, the PC/MacOS agents achieve state-of-the-art (SOTA) performance in 10-plus GUI benchmarks across performance, perception, grounding and overall agent capabilities, consistently beat. from OpenAI’s GPT-4o, Claude and Gemini at Google.

“Through iterative training and reflection, UI-TARS continuously learns from its mistakes and adapts to unexpected situations with minimal human intervention,” researchers from ByteDance and Tsinghua University wrote. in a new research paper.

Source: Arxiv

How UI-TARS explains its thinking

UI-TARS works across desktop, mobile and web applications, using multimodal inputs (text, images, interactions) to understand the visual environment.

Its UI has two tabs – one on the left that shows a series of “thinking,” and a larger one on the right where it pulls in files, websites and apps and works automatically.

For example, in a demo video released today, the model is prompted to “Find round trip flights from SEA to NYC on the 5th and back on the 10th of the next month and filter by price up to order.”

In response, UI-TARS navigates to the website for Delta Airlines, fills in the “from” and “to” fields, clicks on relevant dates and classes and filters by price, clicks -see each step in its thinking box before acting.

In another scenario, it is recommended to install an autoDocstring extension in VS Code. Here’s the next thought process while completing that task:

  • It first reports that “to proceed with the installation of the autoDocstring extension in VS Code, I must first open the VS Code application.”
  • Recognizing that the VS Code window is not yet fully loaded, the model states that “it is necessary to wait for the application to finish initializing before proceeding with the installation of the autoDocstring extension. This ensures that all the necessary tools are available for the next steps.”
  • This then explains the need to access the Extensions view in VS Code, which can be done by clicking a tab in the left sidebar.
  • Running into a small glitch, it argued: “The previous attempt to click on the ‘Extensions’ tab may not have been accurate enough, so I’ll try it again to make sure the view opens successfully -look at the Extension.”
  • Once in the Extensions view, it types ‘autoDocstring’ before ending: “Since the installation process has already started, no further action is required at this time. I will wait for the installation to complete -install and confirm its success when the process is finished.

More than his opponents

In various benchmarks, researchers report that UI-TARS consistently outperforms OpenAI’s GPT-4o; Claude-3.5-Sonnet by Anthropic; Gemini-1.5-Pro ​​and Gemini-2.0; four Qwen models; and many academic models.

For example, on VisualWebBench — which measures a model’s ability to ground web elements including webpage quality assurance and optical character recognition — UI-TARS 72B scored 82.8%, better than GPT -4o (78.5%) and Claude 3.5 (78.2%).

It also performed better in the WebSRC benchmarks (understanding semantic content and layout in web context) and short ScreenQA (understanding complex mobile screen layouts and web structure). UI-TARS-7B achieved top scores of 93.6% in WebSRC, while UI-TARS-72B achieved 88.6% in ScreenQA-brief, better than Qwen, Gemini, Claude 3.5 and GPT-4o .

“These results demonstrate the superior perception and understanding capabilities of UI-TARS in web and mobile environments,” the researchers wrote. “Such perceptual ability lays the foundation for agent tasks, where accurate understanding of the environment is essential for task execution and decision-making.”

UI-TARS also showed impressive results in ScreenSpot Pro and ScreenSpot v2 , which assess a model’s ability to understand and localize elements of GUIs. In addition, researchers tested its capabilities in planning multi-step actions and low-level tasks in mobile environments, and marked it with OSWorld (which assesses open computing tasks) and AndroidWorld (which scores autonomous agents on 116 programming tasks in 20 mobile apps. ).

Source: Arxiv
Source: Arxiv

Under the hood

To help it take a series of actions and recognize what it sees, UI-TARS is trained on a large dataset of screenshots parsing metadata that includes element description and type , visual description, bounding boxes (position information), element function. and text from various websites, applications and operating systems. This allows the model to provide a comprehensive, detailed description of a screenshot, capturing not only elements but spatial relationships and overall layout.

The model also uses state transition captioning to identify and describe differences between two consecutive screenshots and determine when an action – such as a mouse click or keyboard input – has occurred. . Meanwhile, the set-of-mark (SoM) prompt allows it to overlay different marks (letters, numbers) on specific regions of an image.

The model is equipped with short- and long-term memory to handle the tasks at hand while also maintaining historical interactions to improve decision-making later. The researchers trained the model to perform System 1 (fast, automatic and intuitive) and System 2 (slow and deliberate) reasoning. It allows for multi-step decision making, “reflective” thinking, milestone recognition and error correction.

The researchers emphasize that it is critical that the model is able to maintain consistent goals and engage in trial and error to hypothesize, test and evaluate potential actions before completing a task. They introduced two types of data to support this: error correction and post-reflection data. For error correction, they identify errors and mark corrective actions; for post-reflection, they simulated recovery steps.

“This strategy ensures that the agent not only learns to avoid errors but also adapts dynamically when they occur,” the researchers wrote.

Clearly, UI-TARS shows impressive capabilities, and it will be interesting to see its developing use cases in the growing competition of AI agents. As the researchers note: “Looking ahead, while native agents represent a significant leap forward, the future lies in the integration of active and lifelong learning, where agents autonomously driving their own learning through continuous, real-world interactions.”

The researchers pointed out that Claude Computer Use “performs strongly in web-based tasks but struggles more in mobile scenarios, indicating that Claude’s GUI operational capability has not yet transferred well to the mobile domain. ”

In contrast, “UI-TARS shows excellent performance in the website and mobile domain.”



Source link
  • Related Posts

    Mysterious Radio Bursts Traced to the Edge of an Old, Dying Galaxy

    In February 2024, scientists on Earth detected a powerful radio burst from space. Desperate to discover where it came from, they trace the brief flash of energy back to its…

    Psychology Can Be Used to Counter Violent Extremism

    This prediction is based on several decades of research that my colleagues and I work at the University of Oxford to establish what makes people willing to fight and die…

    Leave a Reply

    Your email address will not be published. Required fields are marked *