Published on February 11th, 2025
This paper introduces DeepSeekMath 7B, a cutting‐edge language model dedicated to mathematical reasoning. The authors combine large-scale pre-training on 120B math-related tokens—sourced from Common Crawl—with innovative einforcement learning techniques to approach the performance levels of state-of-the-art closed models on benchmarks like GSM8K and MATH.
The authors detail an iterative data collection and decontamination pipeline that begins with a high-quality seed corpus (OpenWebMath). A fastText classifier, trained with 500,000 positive examples and an equal number of negatives, s employed to recall math-related pages from a deduplicated Common Crawl (reduced to 40B HTML pages). By iterating over data collection—with additional domain-based filtering and manual URL annotation—the corpus expands to 35.5 million pages (120B tokens), while stringent filtering safeguards against contaminating benchmark datasets.
Pre-training experiments using DeepSeek-LLM 1.3B across various mathematical corpora (such as MathPile, OpenWebMath, and Proof-Pile-2) revealed that the DeepSeekMath Corpus consistently led to higher accuracy. Its multilingual nature—emphasizing both English and Chinese content—underscores its robustness and highlights the potential of high-quality, large-scale web data for advancing mathematical reasoning.
Initialized from DeepSeek-Coder-Base-v1.5 7B, the authors train DeepSeekMath-Base 7B for 500B tokens with a meticulously balanced data mix: 56% math content, 20% GitHub code, 10% arXiv papers, 4% AlgebraicStack data, and 10% natural language from Common Crawl. This extensive training enables the model to excel at step-by-step problem solving, tool-assisted reasoning (via Python), and even formal theorem proving using Isabelle.
To further refine the model’s reasoning capabilities, the authors compiled a 776K-example instruction-tuning dataset covering diverse mathematical problems in both English and Chinese. Each problem is paired with solutions formatted as chain-of-thought, program-of-thought, or tool-integrated reasoning. This supervised fine-tuning process yields DeepSeekMath-Instruct 7B, which demonstrates notable improvements across a range of quantitative reasoning benchmarks.
The authors introduce Group Relative Policy Optimization (GRPO), an innovative reinforcement learning method that forgoes the need for a separate value network. By sampling multiple outputs per question and computing advantages based on group-relative reward normalization, GRPO efficiently reinforces beneficial reasoning steps while reducing computational overhead. This approach results in significant performance gains—evidenced by accuracies of 88.2% on GSM8K and 51.7% on MATH—surpassing many open-source models and even challenging certain closed-source systems.
Beyond summarizing the paper’s technical contributions, several key insights emerge from the authors' work:
Overall, the authors’ work not only pushes the boundaries of mathematical reasoning in open-source models but also lays a solid foundation for future explorations in data curation, multimodal training, and reinforcement learning strategies.
I'm excited for what the future holds for LLM and AI research. Like many folks in the tech space, I've been closely following and doing my best to understand the latest revolution in AI technology with DeepSeek. In trying to understand what DeepSeek-R1 did to so vastly improve on other reasoning models, I came across DeepSeekMath, and now I feel more prepared to tackle big problems in the AI space.