How AI assistants should learn from your behavior without becoming creepy
There is a Hawaii example that stuck with us. If you tell an AI agent "I want to lose weight before Hawaii," that sentence means completely different things depending on who said it. For the type-A founder it means "give me the most efficient five-day-a-week interval training plan." For the rest of us it means "saw a TikTok about Hawaii, kind of want to look ok if I go in September." Same words. Different commitment. If the agent treats both the same, it is wrong half the time.
This is the personalisation problem. And almost every AI assistant gets it wrong, because they confuse "knows things about you" with "calibrated to you."
The two kinds of personalisation
There is a shallow kind and a deep kind, and most products only do the shallow.
Shallow personalisation. The agent stores facts. Your name. Your role. Your dietary preferences. Your timezone. These are useful but they are static — facts that do not change with time and behaviour. Most AI assistants get this far.
Deep personalisation. The agent calibrates how it behaves based on what you actually do. If you keep dismissing a particular kind of suggestion, the agent quietly stops making them. If you accept "repeat last year" suggestions reliably, the agent offers to auto-prepare them. If you tend to plan birthdays a month ahead, the agent surfaces them at month-out instead of week-out.
Shallow personalisation is what you tell the agent. Deep personalisation is what the agent learns from you. Both matter. Most products only have the first.
What outcome learning actually means
The mechanism is simple in concept. For each kind of suggestion the agent makes, track whether the user acted on it, dismissed it, snoozed it, or ignored it until it expired. Over time you have a per-user-per-type acceptance rate.
If the rate is high — say above 80% — the suggestion is well-calibrated for this user. Keep making it. Maybe even auto-prepare it.
If the rate is below 25% — the user is dismissing most of these — the suggestion is noise to this user. Demote it. Make it quieter. If it has to come, surface it as a low-priority briefing line, not a push notification.
If the sample size is too small — below 4 decisive interactions — leave the suggestion alone. Cold start matters. You cannot calibrate based on three data points.
Why this matters more than it sounds
Without outcome learning, every detector runs at the same volume forever. The user is constantly pulling against suggestions that do not apply to them. They learn to ignore the channel.
Once the channel is being ignored, the agent has lost. Even when a genuinely useful suggestion comes through, the user is not paying attention. The product becomes wallpaper.
With outcome learning, the noisy detectors get auto-quiet. The user starts to feel like the agent is on their side. The signal-to-noise ratio improves over time without anyone configuring anything. Trust compounds quietly.
This is the no compiler for taste problem we wrote about. Coding agents have tests; consumer AI does not. Outcome learning is the closest equivalent — the user clicks become the eval.
The conservative version (which is the right one)
There is a temptation to build outcome learning aggressively. If acceptance is below 50%, demote. If it is above 90%, promote. Run a per-user, per-type, per-time-of-day model and fully personalise everything.
That is wrong for three reasons.
Conservative beats clever for trust. Users notice when the agent gets suddenly louder. They do not notice when it gets gradually quieter. So the right asymmetry is to demote aggressively but never auto-promote.
Cold start is a real problem. With three data points you can talk yourself into anything. Wait for at least 20 decisive interactions before you make any meaningful change. That sometimes means the agent feels generic for the first few weeks. That is fine; better than feeling jumpy.
Sample bias is a real problem. A user who never sees a kind of suggestion cannot accept it, so the agent has to surface enough variety even at low acceptance rates to keep learning. Pure exploitation of "what works" leads to local maxima.
The right rule is something like: 90-day window, minimum 4 decisive interactions, acceptance below 25% triggers a single-step demotion (urgent → high → medium → low). Never silence completely; even a noisy detector occasionally produces something the user wants.
Snooze is the third option
Most agent UIs offer two responses to a suggestion: accept or dismiss. That is wrong. There is a third response that captures most of the actual decisions: "ask me again later."
Treating snooze as a first-class action does two things. First, it captures a real user intent — "this is potentially useful but not now" — without forcing them into a bad binary. Second, it keeps the data clean. A snooze should not count as a dismissal in the calibration loop. The user is not telling you the suggestion was wrong; they are telling you the timing was.
When the snooze timer expires, the suggestion comes back at its original priority. If the user accepts it then, that is positive signal. If they dismiss it, that is the negative signal that calibration was already missing.
The trust loop made visible
There is one more piece. Outcome learning runs invisibly by default. The user benefits from it but does not see it. Over time, that means the user does not understand why the agent is getting better — just that it is.
That is fine until it is not. The first time a calibration causes the user to MISS a suggestion they would have wanted, they are confused. "Why did Prio not tell me about X?"
The fix is to make the calibration visible. Add an audit field to the insight metadata: originalPriority: "high", calibratedTo: "medium", reason: "rate 18% over 11 samples". Show this in the dashboard when the user looks for it. When acceptance for a category is genuinely low, surface a soft acknowledgment in the briefing — "Acceptance is low here, Prio is dialing it down."
This is the trust contract. The system can do quiet things on the user's behalf, but the user can always see what it is doing and override.
What this looks like in practice
In Prio's anticipation system, here is the actual flow.
A user opts in to morning briefings. Each cron tick (every six hours), several detectors run — meeting prep, deadline warnings, networking nudges, anticipation, calendar hygiene, more. Each detector emits insights with priority levels.
Before insight insertion, a calibration pass runs. For each (user × type) pair, look up last 90 days of decisive interactions. If acceptance rate is below 25% with 4+ interactions, demote the priority by one step. Tag the metadata so the demotion is auditable.
The insights then surface at their calibrated priority. The user reads, dismisses, snoozes, or acts. The interactions feed back into the next calibration pass. The system tightens.
After three months of usage, a user who keeps dismissing networking-stale insights has them quietly downgraded to briefing-only mentions. A user who keeps acting on board prep insights has them prioritised aggressively. Same code, different UX, no configuration needed.
What about sensitive learning
Outcome learning is fundamentally privacy-respecting because it is per-user. There is no cross-user model that learns from one person's behaviour and applies it to another. The patterns extracted stay local.
It is also reversible. Drop the user's interaction data and the agent goes back to defaults. There is nothing baked in.
We get asked about training. The short answer is: outcome data does not train models. It calibrates thresholds in a deterministic system. Models are trained on aggregated, anonymised, opt-in data only — and only for the purpose of improving the underlying capabilities, not for personalising one user with another's data.
The takeaway
Personalisation in AI is not a magic feature. It is a system design choice. Most products only do the shallow version because the deep version requires infrastructure: per-user-per-type interaction tracking, calibration windows, snooze handling, visibility surfaces.
The result of doing it well is an agent that gets quieter where it should be quiet and louder where it should be loud — without anyone configuring it. The user feels the agent is on their side. Trust compounds.
If you are evaluating an AI assistant, the question to ask is "does this product get smarter the longer I use it, in ways I can verify?" If the answer is "yes, here are the dials and the audit," you are looking at a product that has solved deep personalisation. If the answer is "yes, just trust us, we have a model," that is a different conversation.
We have written separately about why memory matters and the trust ladder for autonomous agents. Outcome learning is the connective tissue. Without it the other two pieces never fully click into place.