Widget2Code

Abstract

User interface to code (UI2Code) aims to generate executable code that can faithfully reconstruct a given input UI. Prior work focuses largely on web pages and mobile screens, leaving app widgets underexplored. Unlike web or mobile UIs with rich hierarchical context, widgets are compact, context-free micro-interfaces that summarize key information through dense layouts and iconography under strict spatial constraints. Moreover, while (image, code) pairs are widely available for web or mobile UIs, widget designs are proprietary and lack accessible markup.

We formalize this setting as the Widget-to-Code (Widget2Code) and introduce an image-only widget benchmark with fine-grained, multi-dimensional evaluation metrics. Benchmarking shows that although generalized multimodal large language models (MLLMs) outperform specialized UI2Code methods, they still produce unreliable and visually inconsistent code.

To address these limitations, we develop a baseline that jointly advances perceptual understanding and structured code generation. At the perceptual level, we follow widget design principles to assemble atomic components into complete layouts, equipped with icon retrieval and reusable visualization modules. At the system level, we design an end-to-end infrastructure, WidgetFactory, which includes a framework-agnostic widget-tailored domain-specific language (WidgetDSL) and a compiler that translates it into multiple front-end implementations (e.g., React, HTML/CSS). An adaptive rendering module further refines spatial dimensions to satisfy compactness constraints. Together, these contributions substantially enhance visual fidelity, establishing a strong baseline and unified infrastructure for future Widget2Code research.

Framework Architecture

Our Widget2Code framework consists of three key components: (1) Data Curation, which collects and processes widget images to construct our benchmark; (2) the Perceptual Agent, which decomposes the input into atomic components and extracts visual, semantic, and stylistic cues; and (3) WidgetFactory, an end-to-end infrastructure that generates, compiles, and adaptively renders WidgetDSL to reconstruct the input widget.

Widget2Code Benchmark

We introduce Widget2Code, the first image-only benchmark specifically designed to evaluate widget-to-code generation. Unlike existing UI2Code datasets that focus on web pages or mobile screens with accessible markup, widgets are compact, context-free micro-interfaces with dense layouts and strict spatial constraints, making code-paired data nearly impossible to collect. Our benchmark comprises 2,825 high-quality widgets curated from design platforms (Figma, Dribbble, Refero) and real device screenshots, with 1,000 samples reserved for rigorous testing.

To enable fine-grained evaluation beyond coarse global similarity metrics, we propose a suite of visual-only metrics inspired by Apple's Human Interface Guidelines. These metrics assess three critical dimensions: Layout (margin symmetry, content aspect ratio, area ratio), Legibility (text overlap, contrast consistency), and Style (palette fidelity, vibrancy, polarity). Our benchmark reveals that while generalized MLLMs (GPT-4o, Gemini-2.5) outperform specialized UI2Code models, all methods struggle with structural consistency, color accuracy, and dimension preservation—highlighting the unique challenges of widget reconstruction.

(a) Web UI

(b) Mobile UI

(c) Widget 1

(d) Widget 2

Comparison across interface modalities. Web and mobile UIs provide rich structural and textual context that supports rule-based code mapping, whereas widgets employ dense iconography, embedded graphs, and vivid color schemes within highly constrained layouts. These stylistic and structural compactness factors pose substantial challenges for UI-to-Code reconstruction.

Main Results

We benchmark two groups of works: (1) generalized MLLMs like GPT-4o, Gemini2.5-Pro, Seed1.6-Thinking, Qwen3-VL and Qwen3-VL-235b; (2) specialized UI2Code methods built based upon MLLMs, e.g., ScreenCoder, UI-UG, DCGen, UICopilot, LatCoder, Design2Code, and WebSight-VLM-8B.

Key Findings: Specialized UI2Code models, although effective on web and mobile datasets, exhibit pronounced performance degradation on widgets. In contrast, general-purpose MLLMs such as GPT-4o and Gemini achieve higher visual fidelity, suggesting better perceptual grounding, yet they still struggle to preserve structural consistency and stylistic accuracy. Moreover, all methods fail to reproduce the exact widget dimensions, even when explicitly prompted to match the input size.

Style score comparison on our widget benchmark. Generalized MLLMs outperform specialized UI2Code models, which are tuned for other UI formats instead of widgets.

Leaderboard

Detailed quantitative results on our Widget2Code benchmark. The table shows performance across all evaluation metrics: Layout (Margin, Content, Area), Legibility (Text, Contrast, LocCon), Style (Palette, Vibrancy, Polarity), Perceptual (SSIM, LPIPS, CLIP), and Geometry. The best-performing model in each metric is in-bold, and the second best is underlined. Note that for LPIPS, lower values are better.

Our Method Generalized MLLM Specialized UI2Code

	Methods	Layout			Legibility			Style			Perceptual			Geometry
	Methods	Margin	Content	Area	Text	Contrast	LocCon	Palette	Vibrancy	Polarity	SSIM	LPIPS↓	CLIP	Geometry

We present qualitative comparisons between our method (Widget2Code) and baseline approaches across diverse widget designs. Each example shows the input widget alongside outputs from generalized MLLMs (Gemini-2.5-Pro, GPT-4o, Qwen3-VL), specialized UI2Code models (ScreenCoder, UI-UG), and our method. Our approach demonstrates superior visual fidelity, structural consistency, and accurate color reproduction across various widget types.

1 / 5

Target Design

Model Generations

Ours

Widget2Code

GPT-4o

Gemini 1.5 Pro

Qwen3-VL

ScreenCoder

UI-UG

Target Design

Model Generations

Ours

Widget2Code

GPT-4o

Gemini 1.5 Pro

Qwen3-VL

ScreenCoder

UI-UG

Target Design

Model Generations

Ours

Widget2Code

GPT-4o

Gemini 1.5 Pro

Qwen3-VL

ScreenCoder

UI-UG

Target Design

Model Generations

Ours

Widget2Code

GPT-4o

Gemini 1.5 Pro

Qwen3-VL

ScreenCoder

UI-UG

Target Design

Model Generations

Ours

Widget2Code

GPT-4o

Gemini 1.5 Pro

Qwen3-VL

ScreenCoder

UI-UG

Qualitative comparison across methods. For each example, we compare outputs from generalized MLLMs (Gemini-2.5-Pro, GPT-4o, Qwen3-VL), specialized UI2Code models (ScreenCoder, UI-UG), our method (Widget2Code), and the ground truth input. Our method consistently produces widgets with better visual fidelity, accurate color schemes, and proper structural layout compared to baseline approaches.

Overview

To address the limitations exposed by our benchmark, we develop WidgetFactory, an end-to-end infrastructure that bridges perceptual understanding and executable code generation. At the perceptual level, we design a modular agent that follows widget design principles to decompose input images into atomic components—integrating icon retrieval from a 50k SVG library, reusable component templates for charts and buttons, and automated color palette extraction. This structured analysis prevents hallucination and ensures semantic consistency.

At the system level, WidgetFactory introduces WidgetDSL, a compact domain-specific language that encodes layouts, styles, and hierarchies in a framework-agnostic format. A deterministic compiler translates WidgetDSL into multiple front-end implementations (React, HTML/CSS), while an adaptive rendering module uses feedback-guided binary search to optimize dimensions, prevent overflow, and satisfy compactness constraints. This unified pipeline substantially enhances visual fidelity, achieving perfect geometry scores and establishing a strong baseline for future Widget2Code research.

Component Library & Usage Examples

For non-icon components, we define a library of reusable templates written in our DSL format. Each template encodes the structural and functional logic of common widget components such as buttons, charts, and text blocks, exposing configurable parameters for style, data binding, and runtime behavior. Given an extracted component, the system retrieves the corresponding component template and prompts the MLLM to refine or populate the template, producing a customized DSL instance that preserves the predefined component structure while adapting its visual style and data semantics to the input widget.

Below we showcase usage examples from our component library: BarChart, LineChart, PieChart, RadarChart, StackedBar, Sparkline, Button, Icon, ProgressBar, AppLogo, Text, and Checkbox. Each component type is demonstrated through multiple instantiations with varying visual styles and data configurations, illustrating how the same reusable template can be adapted to different design requirements.

BarChart

LineChart

PieChart

RadarChart

StackedBar

Sparkline

Button

Icon

ProgressBar

AppLogo

Text

Checkbox

Component library usage examples. The figure above demonstrates how each reusable component template (BarChart, LineChart, PieChart, RadarChart, StackedBar, Sparkline, Button, Icon, ProgressBar, AppLogo, Text, and Checkbox) can be instantiated with different visual styles, color schemes, and data configurations while maintaining structural consistency.

BibTeX

@article{widget2code2025,
  title={Widget2Code: From Visual Widgets to UI Code via Multimodal LLMs},
  author={Houston H. Zhang, Tao Zhang, Baoze Lin, Yuanqi Xue, Yincheng Zhu, Huan Liu, Li Gu, Linfeng Ye, Ziqiang Wang, Xinxin Zuo, Yang Wang, Yuanhao Yu, Zhixiang Chi},
  journal={arXiv preprint},
  year={2025}
}

Widget2Code

From Visual Widgets to UI Code via Multimodal LLMs

Abstract

Overview

Framework Architecture

Dataset & Benchmark

Widget2Code Benchmark

Benchmark Results

Main Results

Leaderboard

Qualitative Comparison

WidgetFactory: Baseline Framework

Overview

Component Library & Usage Examples

BarChart

LineChart

PieChart

RadarChart

StackedBar

Sparkline

Button

Icon

ProgressBar

AppLogo

Text

Checkbox

BibTeX