UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection

Yanran Zhang, Wenzhao Zheng^†, Yifei Li, Bingyao Yu, Yu Zheng, Lei Chen, Jie Zhou^*, Jiwen Lu

Department of Automation, Tsinghua University, China
^* Corresponding author ^† Project leader

CVPR 2026

UniGenDet bridges generation and detection in a unified, co-evolutionary framework.

Image generation and generated-image detection have both advanced rapidly, yet they are typically developed in isolation: generators focus on perceptual realism, while detectors react to a moving target by training on snapshots of existing forgeries. This “arms race” creates a persistent detection lag: detectors may overfit to transient artifacts and struggle to generalize to new generators and post-processing.

UniGenDet breaks this separation by unifying generation and authenticity discrimination within one framework. As illustrated in Fig. 1, (a) generative knowledge helps detection reduce distribution gaps and improves interpretability; (b) detection feedback refines generation toward higher realism; and (c) a single architecture supports multiple modalities and tasks (detection + explanation, and text-to-image generation).

Abstract

In recent years, significant progress has been made in both image generation and generated image detection. Despite their rapid, yet largely independent, development, these two fields have evolved distinct architectural paradigms: the former predominantly relies on generative networks, while the latter favors discriminative frameworks. A recent trend in both domains is the use of adversarial information to enhance performance, revealing potential for synergy. However, the significant architectural divergence between them presents considerable challenges. Departing from previous approaches, we propose UniGenDet: a Unified generative-discriminative framework for co-evolutionary image Generation and generated image Detection. To bridge the task gap, we design a symbiotic multimodal self-attention mechanism and a unified fine-tuning algorithm. This synergy allows the generation task to improve the interpretability of authenticity identification, while authenticity criteria guide the creation of higher-fidelity images. Furthermore, we introduce a detector-informed generative alignment mechanism to facilitate seamless information exchange. Extensive experiments on multiple datasets demonstrate that our method achieves state-of-the-art performance.

Key Contributions

Framework design and methodological highlights

Unified generative-discriminative co-evolution: We jointly optimize image generation and detection in a closed loop, so that the “spear” and “shield” evolve synchronously instead of independently.

Bridging the task gap via SMSA + unified fine-tuning: We introduce a Symbiotic Multi-modal Self-Attention mechanism and a unified fine-tuning algorithm to transfer generative distributional knowledge to authenticity classification and explanation.

Detector-informed generative alignment (DIGA): We inject the detector's forensic criteria back into the generator through feature alignment, encouraging authenticity-aware synthesis while preserving generative fidelity.

These contributions establish a novel paradigm for bridging image generation and detection, effectively mitigating the traditional lag between generative model advances and detection capabilities.

Method Overview

Two main modules: GDUF and DIGA

Fig. 2. Generation-Detection Unified Fine-tuning (GDUF) pipeline. The Symbiotic Multi-modal Self-Attention transfers generator knowledge to the detector, enabling both accurate classification and explanatory output.

Fig. 3. Detector-Informed Generative Alignment (DIGA). The detector’s forensic knowledge is injected into the generator, guiding image synthesis toward undetectable realism while maintaining generative fidelity.

Stage I — GDUF (Generation-Detection Unified Fine-tuning). We build upon a unified generation-understanding model and fine-tune it on both generation data and detection-with-explanation data. For detection, the Symbiotic Multi-modal Self-Attention (SMSA) lets the detector attend to generator-side latents together with visual/text features, enabling accurate real/fake prediction and more grounded artifact explanations. For generation, discriminative cues from the detector are injected as conditions, making synthesis more aligned with authenticity criteria.

Stage II — DIGA (Detector-Informed Generative Alignment). After obtaining a strong detector, we freeze it and use it as an authenticity teacher to guide the generator. Specifically, DIGA aligns generator intermediate features to the detector's representation of real images (cosine-similarity feature alignment), combined with the flow matching objective. This forms an explicit feedback loop: the generator is pushed away from detector-sensitive (easily detectable) feature subspaces, improving realism while keeping the base generative capability.

Qualitative Visualizations

Generation and detection comparisons

Generation results: UniGenDet produces more natural lighting, coherent shadows, and artifact-free textures than BAGEL, validating detector-informed training benefits for visual realism.

Detection results: UniGenDet achieves higher accuracy in identifying real and fake images, while providing clearer artifact explanations and interpretability for forensic evaluation compared to the baseline BAGEL.

Quantitative Results

Detection and generation metrics explained

Detection Performance: UniGenDet outperforms existing methods on FakeClue, DMImage, and ARForensics datasets. Accuracy and F1 improvements highlight the efficacy of joint training and detector-guided generator feedback.

Generation Metrics: Fréchet Inception Distance (FID) and GenEval results demonstrate improved realism and text-image alignment. The two-stage training approach significantly enhances visual fidelity over baseline BAGEL models.

Overall, quantitative results confirm the effectiveness of the co-evolutionary framework. Detection accuracy, F1, semantic consistency scores, and FID collectively indicate that UniGenDet successfully balances both tasks, offering state-of-the-art performance in generation realism and forensic detection.

BibTeX

@inproceedings{zhang2026unigendet,
  title     = {UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection},
  author    = {Zhang, Yanran and Zheng, Wenzhao and Li, Yifei and Yu, Bingyao and Zheng, Yu and Chen, Lei and Zhou, Jie and Lu, Jiwen},
  booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year      = {2026}
}

More Works from Yanran Zhang

Joint 3D Geometry Reconstruction and Motion Generation for 4D Synthesis from a Single Image

D³QE: Learning Discrete Distribution Discrepancy-aware Quantization Error for Autoregressive-Generated Image Detection

UniPre3D: Unified Pre-training of 3D Point Cloud Models with Cross-Modal Gaussian Splatting