2025 February 9

The impressive Qwen2.5-Math-Instruct

From the Qwen2.5-Math series blog:

🚨 Qwen2.5-Math mainly supports solving English and Chinese math problems through CoT and TIR. We do not recommend using this series of models for other tasks.

As a software developer, I’m always feeling pessimistic towards applications that utilize Large Language Models (LLMs) for solving math problems. The Qwen2.5-Math technical report states my concerns clearly: “While CoT plays a vital role in enhancing the reasoning capabilities of LLMs, it faces challenges in achieving computational accuracy and handling complex mathematical or algorithmic reasoning tasks, such as finding the roots of a quadratic equation or computing the eigenvalues of a matrix.” LLMs can “understand”^[1] math problems in natural language, but LLMs cannot replace your phone’s calculator. Don’t ever trust and use them for adding, subtracting, multiplying, and dividing numbers unless they use external tools to do the computation.

What impressed me about Qwen2.5-Math, from an application developer’s perspective, is its Tool-integrated Reasoning (TIR) capability. This means Qwen2.5-Math can reason over a math problem AND also write the programming code (Python) to be precise with the calculations.^[2] TIR makes it easier to avoid arithmetic mistakes. I did try using a coding-specific LLM for solving math problems but this specialized math model is more consistent with its output format and therefore reliable. There are models other than Qwen2.5-Math that integrate code for solving math problems such as Microsoft’s ToRa but I haven’t tried using it.

I integrated Qwen2.5-Math in a project where it is responsible for computing how much a customer will pay based on their order. Particularly, I used a Qwen2.5-Math-7B-Instruct-GGUF with llama.cpp server. To enable the model to answer in TIR mode, the system prompt should be worded in a specific way as instructed in the model card:

# TIR
messages = [
    {"role": "system", "content": "Please integrate natural language reasoning with programs to solve the problem above, and put your final answer within \\boxed{}."},
    {"role": "user", "content": prompt}
]

For my use case, it’s like this:

messages = [
    {
        "role": "system",
        "content": "You will assist in computing sales transactions in pesos as the currency or monetary unit. Please integrate natural language reasoning with programs to solve the problems, and put your final answer within \\boxed{}.",
    },
    {
        "role": "user",
        "content": f"{transcript}\n\nImportant: When writing the python code, ALWAYS store the final answer in the 'answer' variable.",
    }
]

Rather than streaming the model’s response, I opted for getting it as a whole. This made extracting the Python code easy. Code execution is then done by Python’s exec() (should only be for prototyping; use a container sandbox like gVisor for production). Below is an example query I threw to Qwen2.5-Math-7B-Instruct-Q4_K_M. Though this is just a simple arithmetic problem, notice how the model “reasons” and writes the code for it.

The customer ordered a Spanish Latte at 110 pesos with a 5% discount, an Americano at 100 pesos without a discount, and a Strawberry Matcha at 105 pesos with a 20% discount. How much will the customer pay?

Model response:

Let's calculate the total cost step-by-step using Python code.

1. Calculate the discounted price of the Spanish Latte.
2. The Americano has no discount.
3. Calculate the discounted price of the strawberry matcha.

Then sum up the prices to get the final amount the customer will pay.

Here's the Python code to do this:

# Prices of the items
spanish_latte_price = 110
americano_price = 100
strawberry_matcha_price = 105

# Discounts
spanish_latte_discount = 0.05
strawberry_matcha_discount = 0.20

# Calculate the discounted prices
discounted_spanish_latte = spanish_latte_price * (1 - spanish_latte_discount)
discounted_strawberry_matcha = strawberry_matcha_price * (1 - strawberry_matcha_discount)

# Calculate the total cost
total_cost = discounted_spanish_latte + americano_price + discounted_strawberry_matcha
print(total_cost)

The customer will pay (288.5) pesos.

And here’s a demo of Saia which shows the model in action.

[1] The idea that LLMs can understand and/or reason and what is understanding and/or reasoning in the context of LLMs are controversial. I stand that since they are probabilistic machines, they relate the concepts involved in a given context by their probabilities; they do not truly understand or reason.

[2] Yang, A., Zhang, B., Hui, B., Gao, B., Yu, B., Li, C., … Zhang, Z. (2024). Qwen2.5-Math Technical Report: Toward Mathematical Expert Model via Self-Improvement. arXiv [Cs.CL]. Retrieved from http://arxiv.org/abs/2409.12122