AI Is Not Software. Yet.

Where LLMs actually sit on the maturity curve — and what that means if you’re building with them in finance.

Every technology follows a maturity arc. It starts rough, gets iterated on, stabilises, and eventually reaches a point where you can depend on it. Software has well-understood stages for this: pre-alpha, alpha, beta, release candidate, general availability, and eventually long-term support. Each stage has a contract with the people using it.

AI language models don’t follow that contract. And that’s something anyone building with them — especially in finance — needs to understand before they go too far down the path.

The Software Maturity Arc (A Quick Recap)

Before we talk about AI, it helps to anchor the conversation in how traditional software matures.

Pre-Alpha / Alpha — Proof of concept. Expect it to break. Only internal teams or brave early adopters go near it. No stability guarantees. APIs change overnight.

Beta — Wider access, but still experimental. Known bugs. Rough edges. You’re a tester, not a user. Companies explicitly say: don’t use this in production.

Release Candidate (RC) — Feature-complete. Entering a hardening phase. Breaking changes are rare but not unheard of. Most bugs are cosmetic. Some serious teams start building.

General Availability (GA) — Declared stable. Versioned. Documented. You can build against it and expect it to stay that way. This is the production-ready milestone.

Long-Term Support (LTS) — The gold standard. Patches only, no new features. Often the version enterprise software lives on for years. The contract is: we will not change how this behaves.

Most of the software your firm relies on sits at GA or LTS. Your accounting platform, your ERP, your file storage. They change — but on scheduled release cycles, with changelogs, with migration paths, and with deprecation notices measured in months, not days.

Where AI LLMs Actually Sit

If you apply that framework honestly, most commercial LLMs today are sitting somewhere between Beta and RC — with the behaviour of a Pre-Alpha when it comes to stability guarantees.

Here’s why.

Models Are Continuously Updated — Without Versioning Contracts

When OpenAI, Anthropic, or Google updates their flagship model, they don’t always issue a deprecation notice. They don’t always version in a way that lets you pin to a known state. The model you integrated against in January may respond differently by April — sometimes subtly, sometimes in ways that break your application logic entirely.

This is not a bug in the traditional sense. It’s a design choice. The companies running these models are in a race to improve them, and stability is a second-order concern compared to capability. The model gets smarter, but your application gets unpredictable.

For consumer chat products, this is mostly fine. A slightly different writing style doesn’t matter if you’re drafting a birthday message.

For anything in finance, it matters a lot.

The Hallucination Problem Hasn’t Been Solved

AI models can confidently state things that are wrong. This is called hallucination, and while newer models hallucinate less than older ones, it has not been eliminated. In a general chatbot context, this is an inconvenience. In a financial context, it is a liability.

A knowledge-based chatbot that tells a client the wrong tax treatment for a transaction, or surfaces an outdated regulatory reference with full confidence, is worse than no chatbot at all. It creates a false sense of authority.

Context Windows Are Finite — And That Has Consequences

Every AI model has a limit to how much it can “see” at once. Throw a 300-page consolidation report at it and it may silently skip sections, compress details, or lose track of earlier context. There’s no error message. There’s no flag. You just get an answer that looks complete but isn’t.

Prompt Sensitivity Means Inconsistency

Two users asking the same question in slightly different ways can get substantially different answers from the same model. This is fine for creative tasks. It’s a problem for anything that needs to be auditable, consistent, and defensible.

What This Means for Finance Use Cases

Financial Reporting and Consolidation

This is the highest-stakes area. Consolidation under IFRS requires mechanical precision — NCI calculations, intercompany eliminations, foreign currency translation, disposal accounting. These are not fuzzy tasks. They are deterministic. The answer is either right or it isn’t.

AI as the engine for consolidation arithmetic is premature. The model doesn’t know your chart of accounts, your group structure, or your specific IFRS elections. And when it’s wrong, it doesn’t tell you.

Where AI adds genuine value here is in the layer above the numbers — natural language summaries of variance analysis, automated commentary on movement in equity, pattern recognition across periods. Work where “approximately right” is useful and an accountant is reviewing the output before it leaves the building.

Knowledge-Based Chatbots

This is where AI is most useful in its current state — but it needs to be structured correctly.

A raw LLM pointed at your firm’s knowledge base is unreliable. It will mix up document versions. It will blend IFRS and US GAAP if both are in the corpus. It will hallucinate citations.

A retrieval-augmented system — where the AI is grounded against specific, verified documents before it generates a response — is a different story. The model’s job shifts from “knowing the answer” to “articulating the answer from a document you’ve already vetted.” That’s a more controllable problem.

Even so: the AI layer is still evolving. A workflow built tightly against one model version may behave differently when that model is updated. Any chatbot deployed in a client-facing context needs human review in the loop, clear disclosure that it’s AI-assisted, and a feedback mechanism to catch drift over time.

Document Processing and Mapping

Extracting data from financial statements, trial balances, and management accounts is one of the more reliable AI use cases right now. The task is bounded, the outputs are checkable, and errors are catchable before they cascade. This is where firms are already seeing productivity gains without taking on significant risk.

Even here, though, “reliable” is not the same as “stable.” An update to the underlying model can change how it interprets column headers or handles edge cases in your document format. You monitor. You test. You keep a human in the loop for exception handling.

So Should You Use AI at All?

Yes — with clear eyes about what you’re doing.

The mistake isn’t using AI. The mistake is treating it like production-grade, LTS software when it isn’t there yet. The firms getting value from AI right now are the ones who:

Define the scope tightly. Use AI for specific, bounded tasks where outputs can be reviewed before they matter.
Keep humans in the verification loop. Not as a formality — as a genuine quality gate.
Abstract the AI layer. Don’t build directly against a specific model. Use an architecture that lets you swap models when the one you’re on changes behaviour or gets deprecated.
Monitor continuously. The same prompt can behave differently next month. Test regularly against known good outputs.
Document the AI’s role. Internally, and in client-facing contexts. Transparency is both good practice and increasingly a regulatory expectation.

The firms that sit out AI entirely are falling behind on productivity. The firms that over-trust AI are building on sand. The sustainable path is disciplined adoption — using AI where it adds clear value, with guardrails that account for the technology’s current maturity level.

Where We Stand on This

At BrizoSystem, we’ve made deliberate choices about where AI belongs in our products — and where it doesn’t.

In BrizoConsol, our multi-entity financial consolidation platform, we use AI in specific, controlled ways: account mapping, pattern recognition in trial balances, narrative generation layered on top of verified numbers. Commentary drafts that a human reviews before they leave the building.

The consolidation engine itself — the IFRS logic, the NCI calculations, the elimination entries — is deterministic code. It behaves the same way every time. It’s auditable. It doesn’t hallucinate.

That’s a deliberate architectural choice. We’re not waiting for AI to mature before we ship. We’re using it where it adds genuine value today, while keeping the high-stakes accounting logic in code that we control and can stand behind.

AI will reach LTS eventually. The models will stabilise, the versioning contracts will improve, and the hallucination rate will come down to something auditors can work with. When it does, more of the workflow will be AI-native.

Until then, the answer isn’t all-in or stay out. It’s build with intention.

BrizoSystem builds tools for accounting firms managing complex, multi-entity clients. BrizoConsol is our financial consolidation and reporting platform — built for precision, not approximation.

Explore BrizoConsol →