Why Probably

Probably is for analytically minded people who want an AI that follows a rigorous analytical process when working with raw tabular and unstructured data.

So far our users include:

CEOs and other executives
data analysts
data engineers
software engineers
marketers
journalists

Their use cases share a number of commonalities:

Their data lives in many different sources, notably databases.
The data is large in volume and organized across many (often wide) tables.
The data has mixed types, both quantitative and qualitative data.
The data is raw and messy.
The data may lack curation or context beyond what the operator can provide.
The analysis needs to be accurate, verifiable, and reproducible.

Put simply, Probably is for people who are doing serious things with serious data; often at large scale and requiring the highest degrees of both accuracy and precision.

Problems it solves

There are two categories of problems with using AI for tabular data analysis.

Mechanical problems (LLMs cannot do math).
Methodology problems (LLMs are sycophants, not objective reasoners).

Each of these categories is deep. This page gives an overview and plenty of links to go further.

Mechanical problems

A high-level view of the mechanical problems.

LLMs cannot do math

Simply put: LLMs are not calculators and calculators are not LLMs.

The ability to perform a large number of precise calculations efficiently is central to accurate data analysis.

Given that transformers are not calculators, this presents a foundational impediment to applying them to data analysis.

The Probably harness solves this in two ways:

The harness strictly disallows the LLM from directly attempting (aka hallucinating) calculations, interpolations, extrapolations or estimations of any kind (obvious first step).
The harness contains built-in verifiers for absolutely every single calculation and output attempted by the LLM. Verification happens continuously during analysis plan execution, as well as comprehensively over the final output. Period. If the LLM hallucinates during analysis or during report writing, the actions are rejected by the engine and sent back for correction before you ever see them.

Unlike calculation, LLMs are exceptionally good translators. The key to this architecture is removing the LLM from activities for which it is unsound while maximizing it for the activities where it is unrivaled.

Importantly this also creates an execution environment that can be deterministically evaluated both during development and in production. This is critical to developing reliable agentic systems at increasing scales.

Tabular data eats context

Ever tried uploading a large spreadsheet or a large JSON file directly into your favorite AI chatbot or CLI agent?

You may have noticed a range of problems with this.

Very low limits on file size.
Amnesia-like behavior. The LLM is both compacting context and diffusing attention as the chat grows longer. It starts randomly losing chunks of the data.
Generally poor recall over the full range of the data set. Analysis will be missing critical data points such that the conclusions become wrong by omission.
Higher latency than normal language tasks for two reasons:
- Tabular data creates a lot of noise in the context, which inflates reasoning latency.
- Most models compensate for the aforementioned computational limits by generating software programs on the fly. This is a time-consuming and error-prone process.
Higher token costs for all the same reasons of higher latency.

What this means for data analysis: when you ask an LLM to look directly at raw tabular data, especially if it’s more than a few thousand rows, it is almost always missing something.

Methodology problems

By far the biggest methodology issue with applying LLMs to data analysis is their inherent bias towards providing satisfaction back to the user.

Inherent bias

This bias comes from many sources: some during pre-training, some during post-training, and, depending on the lab, the product may even be designed to maximize user engagement via sycophancy and reflecting users’ biases back to them.

Lack of awareness

Even if LLMs were aligned to a more neutral starting place, they don’t really have an internal concept of objectivity because they are pattern replicators. They don’t possess a present state of awareness of their own bias in the way we might observe in a self-aware person.

The Probably harness attacks this problem from two angles:

Deterministic validation of values cited in claims, facilitated by the computational traces.
Dynamic context management, paired with a meta-process to force the LLM into a repeatable set of neutral observations grounded in the deterministic outputs.

The two approaches work in tandem to constrain the variance of LLM outputs to only observations that can be checked against the computation trace and conclusions thereof.

The Probably harness is designed to prevent the LLM from speculating or guessing in any way, shape, or form.

All data is bad data

A notable catch phrase in machine learning has always been “garbage in, garbage out” and despite the magic of LLMs, they are more noise-sensitive than most people realize.

You may have seen a number of vendors offering to make your data AI-ready. We have a different value proposition: we specialize in data that is very much not AI-ready.

The Probably harness comes with a sophisticated ingestion layer that maps and profiles raw data very efficiently. This enables the agent to be immediately aware of quality issues before committing to an analysis that would be inherently defeated by them.

The agent has a wide range of tools at its disposal to address data quality issues. Notably it can communicate issues to the user, explain inherent limitations, and still execute those analyses that are possible.

Crucially, it will not base conclusions off of low-quality data.

Examples of this include:

sparse distributions
mismatched sample sizes
heavily skewed distributions

…and many more.

The agent will always couch a response with quality callouts and make it clear why conclusions cannot be drawn and if there are any options for resolving or mitigating quality issues.

This ability is absolutely critical to produce an effective data agent and we are honored to provide it to you.

Why now?

Suffice it to say, accurate and objective tabular data analysis presents a uniquely challenging domain for the current state of transformers.

Even before AI, objective data analysis was hard enough for humans. We are not exactly calculators either.

But finally we have arrived at a unique moment in time. We have the technology to solve the problem from both sides.

We can drop the transformer into a data mech suit, giving it superhuman analysis ability. Its conversational interface makes this whole process accessible to absolutely anybody in the simplest possible terms.

The Probably agent was an extraordinarily challenging harness to build but we believe it is worth it.

Our team is proud to provide and support this agent to any individuals or companies looking to do serious data work effortlessly and efficiently.