Microsoft releases Phi-4-reasoning-vision-15B, a compact multimodal model rivaling much larger systems

Microsoft released Phi-4-reasoning-vision-15B, a 15B-parameter open-weight model that processes both images and text for complex math, science, chart interpretation, and GUI navigation. Available on HuggingFace, GitHub, and Azure under a permissive license, the model uses a novel "selective thinking" approach that lets it decide when deep reasoning is worth the compute and when a quick response suffices. The model matches or exceeds much larger systems on multimodal reasoning benchmarks while using a fraction of the compute and training data. Microsoft trained it using a three-stage pipeline: supervised fine-tuning with chain-of-thought data, reinforcement learning to improve reasoning efficiency, and model merging to recover general capabilities lost during specialization.

View full digest for March 5, 2026