Step3 is a cutting-edge multimodal reasoning model from StepFun. It is built on a Mixture-of-Experts (MoE) architecture with 321B total parameters and 38B active parameters. The model is designed end-to-end to minimize decoding costs while delivering top-tier performance in vision-language reasoning. Through the co-design of Multi-Matrix Factorization Attention (MFA) and Attention-FFN Disaggregation (AFD), Step3 maintains exceptional efficiency across both flagship and low-end accelerators. During pretraining, Step3 processed over 20T text tokens and 4T image-text mixed tokens, spanning more than ten languages. The model has achieved state-of-the-art performance for open-source models on various benchmarks, including math, code, and multimodality...