Abstract
The emergence of visual autoregressive (AR) models has revolutionized image generation while presenting new challenges for synthetic image detection. Unlike previous GAN or diffusion-based methods, AR models generate images through discrete token prediction, exhibiting both marked improvements in image synthesis quality and unique characteristics in their vector-quantized representations. In this paper, we propose to leverage Discrete Distribution Discrepancy-aware Quantization Error D³QE for autoregressive-generated image detection that exploits the distinctive patterns and the frequency distribution bias of the codebook existing in real and fake images. We introduce a discrete distribution discrepancy-aware transformer that integrates dynamic codebook frequency statistics into its attention mechanism, fusing semantic features and quantization error latent. To evaluate our method, we construct a comprehensive dataset termed ARForensics covering 7 mainstream visual AR models. Experiments demonstrate superior detection accuracy and strong generalization of D³QE across different AR models, with robustness to real-world perturbations.
Introduction
Autoregressive (AR) models create forgeries in the discrete latent space via discrete token prediction, evading conventional detectors. We observe a strong Discrete Distribution Discrepancy: real images follow a long-tail token distribution, while fakes concentrate probability in high-frequency regions, showing polarized codebook usage.
Our Main Contributions:
- D³QE Framework: Analyzes the codebook distribution bias and quantization error from AR generation.
- D³AT Transformer: Features Discrepancy-Aware Self-Attention (D³ASA), integrating codebook statistics (ΔD) to fuse quantization error with semantic features.
- ARForensics Benchmark: The first dataset for AR-generated image detection, covering 7 mainstream models to test generalization.
Dataset Visualization
The ARForensics dataset contains samples from 7 mainstream AR models ( LlamaGen, VAR, Infinity, Janus-Pro, RAR, Switti, Open-MAGVIT2), serving as a robust visual benchmark for testing the generalization of ai-generated image detection models.
Methodology

The D³QE framework fuses local discrete artifacts with global semantic features through four key components:
- Quantization Error Representation: A frozen VQVAE Encoder extracts the error between the continuous latent map z and its discrete representation z_q.
- Discrete Distribution Statistics: Computes the discrete distribution discrepancy (ΔD) from real vs. fake token usage statistics.
- D³AT Transformer: Its core Discrepancy-Aware Self-Attention (D³ASA) module processes the quantization error, guided by global ΔD.
- Semantic Feature Fusion: A frozen CLIP extracts semantic features, which are fused with the D³AT's output for final classification.
Experiments
We evaluate D³QE extensively on our proposed ARForensics benchmark, demonstrating its superior performance in both intra-model testing and cross-model generalization.

The model's generalization to GAN-generated images is assessed on ForenSynths.

Its generalization to Diffusion-based images is evaluated on the GenImage dataset.

We also evaluate the model's robustness under common real-world corruptions, such as JPEG compression and Gaussian blurring.

Poster
BibTeX
@article{zhang2025d3qe,
title={$$\backslash$bf $\{$D\^{} 3$\}$ $ QE: Learning Discrete Distribution Discrepancy-aware Quantization Error for Autoregressive-Generated Image Detection},
author={Zhang, Yanran and Yu, Bingyao and Zheng, Yu and Zheng, Wenzhao and Duan, Yueqi and Chen, Lei and Zhou, Jie and Lu, Jiwen},
journal={arXiv preprint arXiv:2510.05891},
year={2025}
}