TL;DR: We simplify and improve Representation Autoencoders. Introducing RAEv2: over 10× faster convergence, better reconstruction, better generation, and better on T2I and world models.
Generalized Representation Encoders. Pretrained vision encoders are more than their final layer. Aggregating features across layers of a pretrained vision encoder greatly improves reconstruction without encoder finetuning or specialized data (e.g., text, faces).
RAE and REPA exhibit complementary working mechanisms. RAE leverages semantic quality while REPA regularizes spatial structure. This complementary nature allows using same pretrained representation as both encoder (RAE) and target for intermediate diffusion features (REPA). This also explains why stronger representations like DINOv3-L, which excel in both global and spatial performance, achieve the best generation with RAEv2.
REPA enables self-guidance. REPA is x-prediction in RAE latent space. By reformulating the output head also as x-prediction, the REPA head itself can be used for internal guidance. This eliminates the need for a separate model (AutoGuidance) or extra forward pass (CFG).
Representation Autoencoders (RAE)[1] replace the traditional VAE with pretrained vision encoders (DINOv2[8], DINOv3[9], SigLIP, ...). This provides an elegant solution for unified tokenization across understanding and generation.
We make representation autoencoders simpler and better:
Introducing RAEv2
Click to jump to each section.
Prior RAE work treats the encoder output as the final-layer feature of a pretrained vision encoder. However, different layers of a pretrained encoder capture complementary features. Can we leverage these features across all layers without encoder finetuning or specialized data?
We consider a simple solution. We define a generalized formulation where the RAE output is the sum of the last $K$ encoder layers. The original RAE is recovered at $K{=}1$. By simply varying $K$, we get easy control over reconstruction quality while also improving generation performance and preserving understanding performance. See Section 3.1 for the full $K$ sweep.
| $K$ | LP top-1 (%) ↑ |
|---|---|
| 1 (last layer; RAE) | 85.39 |
| 4 | 85.15 |
| 7 | 85.10 |
| 23 (full MLS) | 85.24 |
The prevailing assumption[1][15][16] is that RAE eliminates the need for REPA[2], since RAE already uses the encoder representation as the latent space. Distilling the same representation again to intermediate diffusion layers looks like a wasteful skip connection. We test this assumption at scale across 27 vision encoders and find the opposite: using REPA on top of RAE consistently improves generation. This suggests fundamentally different working mechanisms.
REPA[2] on top of RAE has minimal impact on the peak global semantic information (linear probing) of intermediate diffusion features. It instead substantially improves their spatial self-similarity structure. This effect was identified in iREPA[3].
To quantify this, we correlate two encoder properties with downstream gFID across 27 encoders: ImageNet linear probing (LP, global semantics) and local distance similarity (LDS, spatial structure). A more negative Pearson $r$ means a stronger predictor of gFID.
| Method | LP ($r$) ↓ | LDS ($r$) ↓ | Avg ($r$) ↓ |
|---|---|---|---|
| REPA alone | +0.34 | −0.89 | −0.56 |
| RAE alone | −0.81 | −0.13 | −0.55 |
| RAE + REPA | −0.64 | −0.53 | −0.83 |
REPA alone correlates most with spatial structure (LDS). RAE alone benefits most from global semantics (LP). Together, RAE+REPA benefit from encoders strong in both. This is why stronger encoders like DINOv3-L, which excel on both axes, yield the best generation with RAEv2.
RAE struggles with traditional CFG[4] and instead relies on AutoGuidance[5]: a separately-trained weaker diffusion model. We show this is unnecessary.
REPA is x-prediction[6] in RAE latent space. In RAE, the clean latent is the encoder representation: $\mathbf{x} = E(\mathbf{I})$. The REPA head $h_\phi$ predicts $\hat{\mathbf{x}}_{\text{repa}} = h_\phi(\mathbf{h})$ from early-layer features $\mathbf{h}$. Since $h_\phi$ is a lightweight MLP on early features, its prediction is naturally weaker than the full model's, the same role as the AutoGuidance model.
Reformulating the full DiT output also as x-prediction puts both outputs in the same space, enabling internal guidance[7] in a single forward pass:
$\hat{\mathbf{x}}_{\text{guided}} \;=\; \hat{\mathbf{x}}_{\text{full}} \;+\; w \cdot \bigl(\hat{\mathbf{x}}_{\text{full}} - \hat{\mathbf{x}}_{\text{repa}}\bigr)$
No AutoGuidance model. No extra forward pass. Halves the NFEs versus CFG.
| Guidance | gFID ($K{=}7$) ↓ | gFID ($K{=}23$) ↓ |
|---|---|---|
| w/o Guidance | 1.65 | 3.01 |
| CFG | 1.49 | 2.83 |
| AutoGuidance (AG) | 1.14 | 1.37 |
| REPA Guidance (ours) | 1.06 | 1.25 |
Ablation on guidance mechanism in RAEv2. Guidance with REPA and x-prediction achieves the best results at no extra inference cost.
Across various vision encoders, RAEv2 converges substantially faster than the original RAE.
| Method | Epochs | E@FID-2 ↓ | gFID ↓ |
|---|---|---|---|
| SiT-XL/2 | 800 | >800 | 2.12 |
| DDT-XL | 800 | – | 1.26 |
| SiT-XL/2-REPA | 800 | >800 | 1.42 |
| LightningDiT | 800 | >800 | 1.42 |
| REG | 800 | 560 | 1.54 |
| REPA-E | 800 | 480 | 1.12 |
| RAE-XL | 800 | 177 | 1.13 |
| RAEv2 ($K{=}7$, ours) | 80 | 35 | 1.06 |
Training efficiency. Compared to gFID, E@FID-$k$ (epochs to reach unguided gFID $\le k$; $k{=}2$ by default) shows much better variance across methods. Notably, RAE marks a huge jump over previous methods, going from 480 to 177. RAEv2 further improves, achieving E@FID-2 of just 35 epochs.
Suggestion. Incremental improvements in absolute gFID values might provide limited signal for practical applications. Inspired by the recent speedrun in the language domain, we also report training convergence using E@FID-$k$ (epochs to reach unguided gFID $\le k$).
Beyond gFID, we evaluate sample fidelity in six feature spaces.
| Method | Incep. | ConvNeXt | DINOv2 | MAE | SigLIP | CLIP | FD$_r^6$ ↓ |
|---|---|---|---|---|---|---|---|
| SiT-XL/2 | 1.26 | 2.02 | 7.89 | 5.62 | 16.14 | 17.69 | 8.44 |
| DDT-XL | 0.75 | 1.02 | 4.26 | 4.11 | 10.16 | 13.86 | 5.70 |
| SiT-XL/2-REPA | 0.85 | 1.22 | 4.27 | 3.85 | 9.87 | 12.65 | 5.45 |
| LightningDiT | 0.85 | 1.09 | 3.76 | 3.02 | 8.47 | 10.21 | 4.57 |
| REG | 0.92 | 1.14 | 3.45 | 3.02 | 8.42 | 10.86 | 4.64 |
| REPA-E | 0.70 | 1.28 | 2.44 | 2.52 | 5.04 | 6.28 | 3.04 |
| RAE-XL | 0.69 | 1.79 | 2.11 | 3.30 | 3.79 | 7.87 | 3.26 |
| RAEv2 ($K{=}7$, ours) | 0.64 | 0.77 | 1.15 | 2.67 | 2.54 | 5.21 | 2.17 |
Representation Fréchet Distance. FD$_r$ measured in 6 feature spaces, with FD$_r^6$ as the average. All baselines train for 800 epochs. RAEv2 achieves state-of-the-art FD$_r^6$ of 2.17 in just 80 epochs without any post-training.
We also validate our approach for large-scale text-to-image generation. We simply adapt DiT$^{DH}$-XL for T2I by replacing the in-context class-embedding tokens with 256 text-condition tokens from a Qwen3-0.6B model.
| Method | Pretraining | Finetuning | ||
|---|---|---|---|---|
| GenEval ↑ | DPG ↑ | GenEval ↑ | DPG ↑ | |
| Flux-VAE | 41.7 | 77.6 | 78.3 | 79.2 |
| RAE | 58.4 | 80.1 | 81.5 | 80.6 |
| RAEv2 (ours) | 62.4 | 81.7 | 82.7 | 82.3 |
Quantitative text-to-image generation results. RAEv2 leads to consistent improvements over Flux-VAE and the original RAE for T2I generation.
We further test our improved training recipe (RAEv2) on the navigation world model task[12]. Given $N{=}4$ past RGB frames, a sequence of egocentric actions, and a target time step, the model predicts the future RGB frame autoregressively. We train on RECON[13] at 4 FPS, reusing the DiT$^{DH}$-XL backbone and flow-matching recipe from our ImageNet experiments.
Video prediction quality. RAEv2-NWM achieves an FVD of 105.61 on RECON, substantially better than DIAMOND, NWM, and RAE. The same ordering holds at every horizon from 1 to 16 seconds on both FID and LPIPS. Qualitative rollouts also exhibit much less flickering between consecutive frames.
| Method | DIAMOND | NWM | RAE | RAEv2 (ours) |
|---|---|---|---|---|
| FVD ↓ | 762.73 | 200.97 | 312.01 | 105.61 |
Video prediction quality up to 16s on RECON.
Importance of generalized formulation. A large fraction of these gains comes from the generalized RAE formulation (Section 2.1). Earlier encoder layers retain low-level texture and geometry critical for temporally consistent navigation rollouts. This leads to better future-state prediction and video quality across rollout horizons.
Convergence speed. We also find that the generalized formulation leads to significant improvements in convergence speed for NWM. Since the model relies on features from previous frames, a representation that captures more low-level details not only gives better final performance but also allows much faster training.
We study an improved baseline which simplifies and improves RAE. We find that frozen vision encoders themselves contain low-level details for reconstruction. Simply aggregating the last $K$ layers leads to Pareto-optimal reconstruction-generation performance.
We next perform large-scale empirical analysis showing that RAE and REPA exhibit complementary working mechanisms. Their combination is not only useful but also simplifies guidance with RAE. Furthermore it enables stronger representations (e.g., DINOv3-L) which excel in both spatial and global performance to also give better generation performance.
Overall, RAEv2 achieves 10× faster convergence over RAE, improves reconstruction, and achieves state-of-the-art gFID and FD$_r^6$ in just 80 epochs without any post-training. We also validate our improved recipe across diverse tasks, including T2I generation and world models, showing consistent improvements. We hope our work provides useful insights for practical adoption of representation autoencoders.
@article{singh2026raev2,
title = {Improved Baselines with Representation Autoencoders},
author = {Singh, Jaskirat and Zheng, Boyang and Wu, Zongze and Zhang, Richard and Shechtman, Eli and Xie, Saining},
journal = {arXiv preprint arXiv:2605.18324},
year = {2026},
}