Agentic AI in science

Can we trust AI agents
to reason autonomously in science?

As agentic AI systems become increasingly capable of complex reasoning, they will be increasingly adopted in scientific workflows. As in many other domains, in science we cannot blindly trust conclusions or predictions but need to carefully consider reasoning. So, when deploying agentic AI, how do we ensure accountability when they work autonomously?

Consider a concrete scenario: an AI agent spends 20 minutes autonomously reasoning and analyzing data and produces a conclusion. How can you verify its work? How do you know it didn't hallucinate at a critical decision point five minutes in, leading to convincing reasoning in the second half that is still fundamentally flawed because of the initial mistake?

Learning from human collaboration

In this work, we approached this challenge by first asking how do human scientists build trust with each other in collaborations on single-cell biology projects. Here, individual scientists accumulate reasoning over weeks and months before sharing their work. To convey reasoning, human scientists document their work in computational notebooks, e.g. in Jupyter notebooks for python-centric analyses. These notebooks interleave code, results, and interpretations, creating a transparent record in this complex setting that enables feedback and iterative refinement.

Introducing kai – an agentic AI assistant
for single-cell biology

Building on this insight, we designed kai: an agentic AI system that adopts computational notebooks as its primary reasoning interface. Rather than operating as a black box , kai generates Jupyter notebooks that can be interpreted by human scientists. This is enabled by several components:

A chat interface in VS Code where scientists interact with kai through language
A notebook editing interface that allows kai to modify and execute Jupyter notebook cells and observe their outputs
A retrieval system that synthesizes information from thousands of published computational workflows to inform kai’s analysis planning
Specialized agents for planning, coding, and reasoning that work together to address complex queries

The agentic architecture separates planning from execution. Planning agents consult a database of published analyses to design workflows´and execution agents generate code, interpret results, and reason over their findings. This separation enables kai to tackle analyses that require extended reasoning – often 20 minutes or more of autonomous work. Critically, the output isn't just a final answer. It's a complete Jupyter notebook documenting every analysis step and decision. Human scientists can inspect this notebook cell by cell, modify the approach, and provide feedback that kai can incorporate in subsequent iterations.

Where we are at now

We tested kai on complex scenarios in single-cell biology that require integrating multiple analysis types, interpreting specialized tools, and critically evaluating published hypotheses. In comparisons with one-shot analysis generation by large language models, kai demonstrated several advantages:

Reliability: kai consistently produced executable notebooks
Reasoning quality: kai produced data-driven conclusions, avoiding hallucinations
Sycophancy: kai showed partial resistance to sycophancy – the tendency to blindly agree with user suggestions regardless of validity
Autonomy: kai successfully formulated and addressed its own research questions based on data and literature

Why transparency matters

kai is an assistant for single-cell biology optimized for human-agent collaboration. Like AI assistants in other domains, kai enhances human efficiency while maintaining accountability – a key advantage over fully autonomous systems in science where the reasoning details matter as much as the final conclusion.

read our preprint

GitHub repository