UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection

Department of Automation, Tsinghua University, China
* Corresponding author    Project leader

CVPR 2026

UniGenDet teaser
UniGenDet bridges generation and detection in a unified, co-evolutionary framework.

Image generation and generated-image detection have both advanced rapidly, yet they are typically developed in isolation: generators focus on perceptual realism, while detectors react to a moving target by training on snapshots of existing forgeries. This “arms race” creates a persistent detection lag: detectors may overfit to transient artifacts and struggle to generalize to new generators and post-processing.

UniGenDet breaks this separation by unifying generation and authenticity discrimination within one framework. As illustrated in Fig. 1, (a) generative knowledge helps detection reduce distribution gaps and improves interpretability; (b) detection feedback refines generation toward higher realism; and (c) a single architecture supports multiple modalities and tasks (detection + explanation, and text-to-image generation).

Abstract

In recent years, significant progress has been made in both image generation and generated image detection. Despite their rapid, yet largely independent, development, these two fields have evolved distinct architectural paradigms: the former predominantly relies on generative networks, while the latter favors discriminative frameworks. A recent trend in both domains is the use of adversarial information to enhance performance, revealing potential for synergy. However, the significant architectural divergence between them presents considerable challenges. Departing from previous approaches, we propose UniGenDet: a Unified generative-discriminative framework for co-evolutionary image Generation and generated image Detection. To bridge the task gap, we design a symbiotic multimodal self-attention mechanism and a unified fine-tuning algorithm. This synergy allows the generation task to improve the interpretability of authenticity identification, while authenticity criteria guide the creation of higher-fidelity images. Furthermore, we introduce a detector-informed generative alignment mechanism to facilitate seamless information exchange. Extensive experiments on multiple datasets demonstrate that our method achieves state-of-the-art performance.

Key Contributions

Framework design and methodological highlights

  • Unified generative-discriminative co-evolution: We jointly optimize image generation and detection in a closed loop, so that the “spear” and “shield” evolve synchronously instead of independently.
  • Bridging the task gap via SMSA + unified fine-tuning: We introduce a Symbiotic Multi-modal Self-Attention mechanism and a unified fine-tuning algorithm to transfer generative distributional knowledge to authenticity classification and explanation.
  • Detector-informed generative alignment (DIGA): We inject the detector's forensic criteria back into the generator through feature alignment, encouraging authenticity-aware synthesis while preserving generative fidelity.
  • These contributions establish a novel paradigm for bridging image generation and detection, effectively mitigating the traditional lag between generative model advances and detection capabilities.

    Method Overview

    Two main modules: GDUF and DIGA

    GDUF pipeline
    Fig. 2. Generation-Detection Unified Fine-tuning (GDUF) pipeline. The Symbiotic Multi-modal Self-Attention transfers generator knowledge to the detector, enabling both accurate classification and explanatory output.
    DIGA pipeline
    Fig. 3. Detector-Informed Generative Alignment (DIGA). The detector’s forensic knowledge is injected into the generator, guiding image synthesis toward undetectable realism while maintaining generative fidelity.

    Stage I — GDUF (Generation-Detection Unified Fine-tuning). We build upon a unified generation-understanding model and fine-tune it on both generation data and detection-with-explanation data. For detection, the Symbiotic Multi-modal Self-Attention (SMSA) lets the detector attend to generator-side latents together with visual/text features, enabling accurate real/fake prediction and more grounded artifact explanations. For generation, discriminative cues from the detector are injected as conditions, making synthesis more aligned with authenticity criteria.

    Stage II — DIGA (Detector-Informed Generative Alignment). After obtaining a strong detector, we freeze it and use it as an authenticity teacher to guide the generator. Specifically, DIGA aligns generator intermediate features to the detector's representation of real images (cosine-similarity feature alignment), combined with the flow matching objective. This forms an explicit feedback loop: the generator is pushed away from detector-sensitive (easily detectable) feature subspaces, improving realism while keeping the base generative capability.

    Qualitative Visualizations

    Generation and detection comparisons

    Generation comparison
    Generation results: UniGenDet produces more natural lighting, coherent shadows, and artifact-free textures than BAGEL, validating detector-informed training benefits for visual realism.
    Detection comparison
    Detection results: UniGenDet achieves higher accuracy in identifying real and fake images, while providing clearer artifact explanations and interpretability for forensic evaluation compared to the baseline BAGEL.

    Quantitative Results

    Detection and generation metrics explained

    Detection table
    Detection Performance: UniGenDet outperforms existing methods on FakeClue, DMImage, and ARForensics datasets. Accuracy and F1 improvements highlight the efficacy of joint training and detector-guided generator feedback.
    Generation table
    Generation Metrics: Fréchet Inception Distance (FID) and GenEval results demonstrate improved realism and text-image alignment. The two-stage training approach significantly enhances visual fidelity over baseline BAGEL models.

    Overall, quantitative results confirm the effectiveness of the co-evolutionary framework. Detection accuracy, F1, semantic consistency scores, and FID collectively indicate that UniGenDet successfully balances both tasks, offering state-of-the-art performance in generation realism and forensic detection.

    BibTeX

    @inproceedings{zhang2026unigendet,
      title     = {UniGenDet: A Unified Generative-Discriminative Framework for Co-Evolutionary Image Generation and Generated Image Detection},
      author    = {Zhang, Yanran and Zheng, Wenzhao and Li, Yifei and Yu, Bingyao and Zheng, Yu and Chen, Lei and Zhou, Jie and Lu, Jiwen},
      booktitle = {Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
      year      = {2026}
    }