An Empirical Study of GPT-4o Image Generation Capabilities

Paper · arXiv 2504.05979 · Published April 8, 2025
Task Planning

The landscape of image generation has rapidly evolved, from early GAN-based approaches to diffusion models and, most recently, to unified generative architectures that seek to bridge understanding and generation tasks. Recent advances, especially the GPT-4o, have demonstrated the feasibility of high-fidelity multimodal generation, their architectural design remains mysterious and unpublished. This prompts the question of whether image and text generation have already been successfully integrated into a unified framework for those methods. In this work, we conduct an empirical study of GPT-4o’s image generation capabilities, benchmarking it against leading open-source and commercial models. Our evaluation covers four main categories, including text-to-image, image-to-image, image-to-3D, and image-to-X generation, with more than 20 tasks. Our analysis highlights the strengths and limitations of GPT-4o under various settings, and situates it within the broader evolution of generative modeling. Through this investigation, we identify promising directions for future unified generative models, emphasizing the role of architectural design and data scaling.

Introduction. Over the past decade, image generation has undergone a remarkable evolution—from the early successes of GANs [35] to the dominance of diffusion models [89, 82, 26], which have significantly advanced image fidelity and diversity [37, 7]. In parallel, Large Language Models (LLMs) have achieved exceptional performance across diverse natural language tasks by scaling autoregressive next-token prediction, demonstrating the power of unified modeling principles. These advances naturally raise a compelling question: can such principles be extended to image generation? However, fundamental differences between autoregressive and diffusion-based paradigms present non-trivial challenges. Autoregressive models excel in sequential text generation, while diffusion models have become the de facto standard for high-quality image synthesis. Bridging these modalities within a unified framework remains an open challenge.

Discussion / Conclusion. Although GPT-4o demonstrates impressive capabilities across a wide range of image generation tasks, several limitations remain. These challenges highlight key areas for future improvement in developing unified foundation models for vision-language generation. Despite strong alignment between text and vision modalities, GPT-4o struggles with data bias issue, which fail in generating underrepresented cultural elements and rendering non-Latin scripts such as Chinese, Japanese, and Arabic. The generated characters are often incomplete, distorted, or replaced with Latin-like approximations. These artifacts reflect underlying challenges in multilingual representation, likely due to limited exposure to diverse scripts during training and the inherent difficulty of accurate typographic rendering in pixel space. This phenomenon is emblematic of a larger issue in AI systems—data bias. The training data used to develop models like GPT-4o may disproportionately represent certain languages, cultures, and writing systems, leading to disparities in performance across different linguistic groups.