Training LLMs to Predict World Events (Guest Post with Mantic)

Mantic have been using Tinker since it launched. This guest post is a technical deep dive on what they have built so far.

The top AI forecasting systems are approaching superforecaster-level accuracy on geopolitics and current affairs.It’s over (Polymarket, 2026) This is exciting because scalable, automated forecasting could significantly improve the quality of decision-making across the economy and in government.

To date, the most successful recipe in forecasting tournaments has been to combine an off-the-shelf LLM (like Gemini 3 or GPT-5) with forecasting-specific context-gathering. These models, to our knowledge, have not been explicitly trained for forecasting. Can we improve the recipe by replacing them with models fine-tuned specifically for forecasting?

We target “judgmental forecasting”: prediction problems that require human-like research and reasoning. Judgmental forecasting is needed for domains like geopolitics, politics, technology, business, and economic policy, where there often isn’t enough data for a standard statistical approach like time-series extrapolation. It was popularized by the book Superforecasting, and now prediction markets like Polymarket and Kalshi.

In this post, we show it’s possible to significantly improve the forecasting performance of gpt-oss-120b using reinforcement learning. With Tinker, we fine-tune a model on around 10,000 binary questions of the form “Will [event] occur before [date]?”. We reward the model for putting greater probability on the correct real-world outcome.

In a head-to-head contest, the fine-tuned model achieves marginally superior performance to the frontier LLMs (see Figure 1), despite much lower initial performance. We find that providing forecast-specific context increases the gains from fine-tuning.

In the optimal ensemble of different models (which outperforms any single model), Grok 4 and our fine-tuned model are the most important contributors. The fine-tuned model learns a forecasting policy that is as accurate as the frontier LLMs, yet decorrelated from them.

Together, the results demonstrate that on-task training can extend the state-of-the-art in AI forecasting.

RL fine-tuning makes gpt-oss-120b competitive with the frontier LLMs on questions from the Metaculus AI Benchmark Q2 2025. Naively predicting 50% on every question would get a score of 0, and perfect foresight would get a score of 100, per the construction of the Metaculus “baseline score”. Naively predicting 18.8% on every question (the rate at which the equivalent questions resolved “yes” in the previous tournament, Q1 2025) yields a score of 22.3, which we use to truncate the Y-axis. Fine-tuning improves gpt-oss-120b‘s score from 38.6 to 45.8, on par with the best general models.

The best existing recipe uses off-the-shelf LLMs

The past two years have seen considerable progress in AI judgmental forecasting capabilities.Approaching human-level forecasting with language models (Halawi et al., 2024) In the Metaculus Cup, a major tournament for amateur and professional forecasters, the best AI systems now rival the top humans (Figure 2).

Human and AI scores in the Metaculus Cup, a premier forecasting competition. Scores from the top 5 AI forecasters have been steadily improving since they first entered in the Summer of 2024. Mantic first entered in the Summer of 2025 and then in Fall 2025 beat the community prediction and the majority of professional forecasters. These results were without fine-tuning.

The trend has been driven by more capable off-the-shelf LLMs, and accelerated by more sophisticated forecasting architectures. Our architecture – which has performed well in recent Metaculus tournaments – consists of two standard phases: (1) a research phase, and (2) a prediction phase (Figure 3).This two-phase process appears in early work on AI forecasting. See: Approaching human-level forecasting with language models (Halawi et al., 2024); Forecasting Future World Events with Neural Networks (Zou et al, 2022).

Mantic’s architecture. The research phase takes the forecasting question as input and performs deep research to collect information relevant to the question which goes into the prompt for the prediction LLM. The prediction LLM outputs chain-of-thought reasoning and specifies a probability distribution using specialized tools.

The research phase is conducted by deep research agents that collect the context needed to make a good prediction. For example, for the question “Will the United States attack Venezuela before 2026?”, search agents will find information about military buildup in the Caribbean, statements from President Trump, the health of the Venezuelan economy, and so on. The collected research is summarized into a prompt for the prediction phase.

The model’s task at the prediction phase is to use our specialized tools to output a probability distribution. In this post, we consider a canonical type of forecasting question: “Will [event] occur before [date]?”. We instruct the LLM to parameterize a mixture model for when the event will next occur – illustrated in Figure 4. The mixture model defines a cumulative distribution function, and from that we can read off the probability of the event occurring before the date specified in the original question.

Illustrative mixture model. The LLM selects: the number of components in the mixture, their parameters, and their respective weights. The LLM is prompted to select components capturing different scenarios that could lead to the event occurring. The final prediction is a weighted combination of the components.

In past tournaments, we’ve used off-the-shelf models as the prediction LLM. Existing literature has shown promising results from RL fine-tuning small models using a simple architecture.Outcome-based Reinforcement Learning to Predict the Future (Turtel et al, 2025) Can we improve frontier AI forecasting systems through on-task fine-tuning?

Training details

Datasets

We train the prediction LLM on ~10k questions about whether an event will happen by a given date. The questions are from August 2024 to December 2025 and the model’s knowledge cutoff is prior to that, so the resolution is known to us but not to the model. We generated the training set using an LLM pipeline similar to existing work.Automating Forecasting Question Generation and Resolution for AI Evaluation (Bosse et al, 2026); Future-as-Label: Scalable Supervision from Real-World Outcomes (Turtel et al, 2026) Before training, we run the research phase for each question and store static prompts for the prediction LLM.

We test on unseen questions from the Q2 2025 Metaculus AI Benchmark Tournament. The Fall 2025 iteration would have been a more obvious choice (for being more recent) but contains lower quality questions, indicated by less performance differentiated between strong and weak forecasters. We compare models whose knowledge cutoff is before this tournament’s start date. The full list of questions can be accessed on Github.

Three example binary event questions from the Metaculus Q2 2025 AI Benchmark.

We evaluate using the baseline score, following the Metaculus platform. This is log scoring, i.e. ln(probability assigned to true outcome), rescaled such that 100 is the maximum possible score and 0 is the score for a uniform prediction (in our setting, 50%).

Implementation

We run the experiments on Tinker. Of the models available through the API, we choose to train gpt-oss-120b because of its strong initial performance — second only to Kimi K2.5 — while being cheaper and faster.

We use a standard policy gradient algorithm with GRPO-style advantage normalisation and no division by the standard deviation. For rewards we use the Brier score which is strictly proper. We found that the Brier score leads to more stable training than the log score, even though the log score is also strictly proper. This could be because the Brier score is bounded in [0, 1] and so produces lower-variance policy gradient estimates.

Open-source RL packages often use vLLM for the sampling backend and FSDP for the training backend. These can disagree on token probabilities produced by identical policies, which biases policy gradient estimates and destabilises training. We found these discrepancies to be lower on Tinker’s integrated infrastructure, but further mitigate them with an importance sampling correction on the advantages.

The Brier score reward function is strictly monotonic in the predicted probability for a fixed outcome, so different rollouts almost always produce different rewards. This makes within-group reward ties extremely unlikely and lets us train with a relatively small group size (8) without needing to break ties or induce variance. We use a batch size of 64, as we find that larger batch sizes tend to destabilise training.

Results

Fine-tuning elevates gpt-oss-120b to frontier LLM performance

The model’s test set score improves through training (Figure 6), moving from an initial score of 38.6 mean baseline points per question (below all frontier models) to a final score of 45.8 mean baseline points per question (marginally above). This demonstrates that on-task forecasting training can provide a large performance uplift.

Test set baseline score of gpt-oss-120b with and without Mantic research and tools. In the model-only setup, test set performance improves, but never reaches the initial score of gpt-oss-120b with Mantic research and tools. With Mantic research and tools, gpt-oss-120b climbs 7 points through training and marginally exceeds the performance of Gemini 3 Pro. Training continued for further steps but performance no longer improved.

The LLM fine-tuned without the benefit of our pre-generated research phase, and without the tools to construct mixture models, gains only 3 points from training instead of 7. This suggests these prerequisites positively influence the optimization dynamics, in addition to improving initialization.

The fine-tuned model is an important member of the optimal ensemble

In human forecasting, there is a well-known “wisdom of the crowd” effect: aggregate forecasts from multiple people often outperforms any one individual. This effect, in part, explains the impressive accuracy of prediction markets. Can we get the same benefit from ensembling the predictions of different LLMs?See also Wisdom of the Silicon Crowd: LLM Ensemble Prediction Capabilities Rival Human Crowd Accuracy (Schoenegger et al, 2024)

To contribute to an ensemble, an LLM must be sufficiently accurate on its own and also decorrelated from other LLMs in the ensemble. Predictions from most frontier LLMs, while accurate, contribute little diversity to the top-performing model (in our case, Gemini 3 Pro) – Figure 7. Among the frontier LLMs, Grok 4 is the exception: its predictions score well whilst correlating less with other frontier LLMs.

Mean baseline score on binary event questions from the Metaculus Q2 AIB Benchmark with respect to their Jensson Shannon divergence from Gemini 3 Pro for a suite of closed source and open source LLMs. Marker colour indicates each model’s weight in the optimal 5-sample ensemble. The optimal ensemble consists of fine-tuned gpt-oss-120b (40%), gemini 3 pro (20%), gpt-5 (20%) and grok 4 (20%).

The optimal ensemble, on a budget of 5 samples, is 2 samples from our fine-tuned gpt-oss-120b plus 1 sample from each of Gemini 3 Pro, Grok 4, and GPT-5. We can test each model’s contribution by removing it, recomputing the optimal ensemble, and seeing how much the score degrades. We find that Grok 4 is the least replaceable, with fine-tuned gpt-oss-120b in second place. Other models can be replaced with little to no performance degradation (Figure 8).

When selecting an ensemble of frontier and open-source models, Grok 4 and fine-tuned gpt-oss-120b are the least replaceable. Model replaceability is defined as the reduction in score incurred when removing a model from the optimal ensemble. By definition, if a model is not included in the optimal ensemble,there is no cost to removing it.

GPT-5 and Gemini 3 Pro make similar predictions, and thus don’t benefit much from ensembling with each other. Both models improve from mixing their predictions with either Grok 4 or the fine-tuned gpt-oss-120b, and Grok 4 also benefits most from mixing with the fine-tuned model. In either optimal 3-way ensemble from Figure 9, the fine-tuned model gets about half of the total weight.

Three-way ensembles. Left: The optimal three-way ensemble between the fine-tuned gpt-oss-120b, Gemini 3 Pro and GPT-5 is weighted 56%, 26% and 18% respectively, depicted by the black star. Right: The optimal three-way ensemble between the fine-tuned gpt-oss-120b, Gemini 3 Pro and Grok 4 is weighted 48, 26% and 26% respectively, depicted by the white star.

Conclusions and next steps

We have shown that we can elevate the forecasting performance of gpt-oss-120b to match frontier LLMs with RL fine-tuning. This work can be extended in many ways, some of which we have already begun exploring:

  1. Training larger models. Tinker enables training larger models with higher initial performance than gpt-oss, specifically Kimi K2.5.
  2. Training on all question formats. We are already training models on numerical questions such as economic indicators and multiple choice questions such as election results.
  3. Improved question sets. As the models become stronger forecasters, we need more challenging forecasting questions.
  4. Information retrieval inside the loop. We could give the prediction LLM tools for information retrieval and include this in the training loop.

Citation

Please cite this work as:

Jeen, Scott; Aitchison, Matthew; and Mantic, "Training LLMs to Predict World Events", 
Thinking Machines Lab: News, Mar 2026.

Or use the BibTeX citation:

@article{scott2026forecasting,
  author = {Scott Jeen, Matthew Aitchison, and Mantic},
  title = {Training LLMs to Predict World Events},
  journal = {Thinking Machines Lab: News},
  year = {2026},
  note = {https://thinkingmachines.ai/news/training-llms-to-predict-world-events/}
}