Generative models are revolutionizing daily life through applications such as image and audio synthesis, while also enabling breakthroughs in scientific discovery. Despite their huge practical successes, the interpretability of modern generative models remains relatively underexplored. In this talk, I will present one line of my recent work that investigates the intrinsic dynamics and latent geometric structures of generative models, by drawing on both theoretical and physical perspectives, and demonstrates how these insights can be harnessed during sampling stage to guide and control pre-trained multimodal models in fine-grained scenarios. This enables versatile downstream applications, including text based image editing [NeurIPS’23], image customization [ICLR’24], controllable enhancement of low-level visual attributes [ICCV’25 Highlight], acoustic masking [NeurIPS’25a], and diversity enhancement [ArXiv’26].