Vincent: Building a specialist model that stays fresh

Dec 3, 2025 by Terminal Team

Vincent is our new specialized model designed specifically for scaffolding AI applications. On our internal benchmarks, the model significantly outperforms general frontier models, which often struggle with the rapid pace of breaking changes in AI SDKs.

We achieve this by abandoning the "one-size-fits-all" approach. Instead, we train a smaller, focused model on a clean, sandboxed dataset. Crucially, Vincent is refreshed regularly via lightweight adapters and uses tool-based search to bridge the gap between "frozen" training data and reality.

Search isn't enough. Native knowledge + search wins.

SnapBench Score (Compilation Success Rate %)

Figure 1: The "Velocity Gap." Frozen models (Gray) fail on breaking changes. Simple RAG (Cyan) struggles because it retrieves conflicting docs (v1 vs v2) and lacks the context to distinguish them. Vincent (Green)—combining fresh adapters with tool use—adapts immediately.

The Problem: The "Unknown Unknowns"

Our motivation comes from wrestling with the "Velocity Gap." General AI models are frozen in time. In the AI engineering world, a knowledge cutoff of even a few months can be fatal.

We found that simply giving a frozen model access to search (RAG) often fails.

Why? Because simple RAG retrieves both outdated and current documentation. A frozen model, lacking the context of the breaking change, often treats the outdated docs as authoritative. It hallucinates a broken hybrid of v1 and v2 syntax that looks correct but fails to compile.

Phase 1: The "Clean Set" (SFT)

To fix this, we realized we couldn't just fine-tune on a raw dump of GitHub. Public code is often broken or insecure.

We built a custom evaluation engine that sandboxes and verifies open-source snippets. Instead of training on everything, we aggressively filter the dataset. We only train on code that compiles and adheres to modern patterns.

This "Clean Set" is used for Supervised Fine-Tuning (SFT). This teaches Vincent the high-level "shape" of modern architecture like how to structure a Next.js App Router project or a LangChain agent before it ever sees a user prompt.

Phase 2: Hybrid Updates (LoRA + Search)

However, even a clean dataset decays. To handle "Day 0" changes, we use a two-part system:

Regular LoRA Adapters: We don't fully retrain the model every single time. Instead, we train lightweight Low-Rank Adapters (LoRA) on recent changelogs and verified repos. This efficiently overwrites outdated weights with fresh syntax without the cost of a full training run.
Tool-Use Policy: We trained Vincent to treat "Search" as a browser tool. When the model detects ambiguity (e.g., a library version mismatch), it pauses generation to fetch the exact new syntax, rather than guessing.

Phase 3: Alignment (DPO and RL)

To refine the model, we use Direct Preference Optimization (DPO) and Reinforcement Learning (RL).

For DPO, we show the model pairs of code: one using an old pattern and one using the new pattern. We train the model to prefer the new syntax, effectively balancing out the massive amount of outdated tutorials online.

For RL, we optimize for Tool Use. We define a reward function based on successful execution. The model is rewarded when it successfully calls a search tool that leads to valid code, and penalized when it guesses blindly or searches unnecessarily.

A Continuous Battle

It is important to acknowledge that this is not a solved problem. The AI ecosystem moves faster than any single team can track perfectly. While Vincent drastically reduces the time spent debugging "yesterday's syntax," it is a continuous battle against entropy.

We view Vincent not as a final answer, but as the best current attempt at a "living" code generator.

¹ Benchmarked on SnapBench, our internal evaluation set of 500 unseen AI app scaffolding tasks. "Compilation Rate" measures the percentage of generated projects that start the dev server without fatal errors on the first attempt.