A hot topic since the advent of commercial LLMs is just how “smart” they are. They’ve been benchmarked (maybe too much) and given standardized tests, all in an effort to create standardized ways of understanding what they’re capable of. A common refrain has been that the initial GPT-4o type models have high school intelligence, reasoning models have college intelligence, and the newest slate of models have Master’s or even PhD-level intelligence.
To be honest, I don’t really know what PhD-level intelligence means. What I do know is that these models are now fully capable of facilitating the work that doctoral students and academics do (whether this work is the “real work” of PhDs is not something I’m interested in discussing). That’s why the entire field of social science is now undergoing the kind of existential reflection that many writers have been having for the past several years.
I don’t think these models can autonomously do the whole pipeline of empirical research, yet (that said, we should entertain the real possibility that humans slowly and imperfectly doing capital-R Research becomes a quaint notion). But I already see a divergence happening between those who do their research with agentic AI and those who do not. Ethan Mollick calls this co-intelligence; educational institutions call this AI literacy; I am increasingly calling it table stakes.
For the past several weeks I’ve been writing about redesigning my statistics course for a world of agentic AI, and I will get back to that. But I want to spend this post taking seriously the question of what it actually takes to do rigorous analytical work with AI agents. I don’t think we have a good answer yet. What I keep seeing, both in myself and in others, is a gap between people who are getting genuinely impressive results from these tools and people who are getting confidently wrong ones. And the difference is not “prompt engineering.” It’s something more fundamental, and I think it’s worth naming.
In an effort to put these thoughts down on paper, I’ve been working on a course proposal called Agentic Analysis for Policy (the closest analog I’ve found is Gabor Bekes’ Doing Data Analysis with AI course at CEU). It’s designed as a follow-on to our existing quantitative methods sequence, not a replacement for it. Students would arrive with the statistical reasoning they need to formulate questions, choose methods, and interpret results. This course teaches them how to wield the AI tools that are rapidly becoming the standard way that reasoning gets applied in practice.
Why “Agentic Analysis”?
As we reflect on the future of quantitative analysis, we can look to the transformation/existential crisis that the software development community is currently experiencing as a canary in the coal mine. On one hand, the future may belong to the vibe coders, people with no software development background who tell AI what they want the software to do, review the results, and repeat. But I think this is unlikely. Rather, I think the most talented developers will be the ones not coding the most, because they have the best judgment to direct teams of agents effectively. In this vein, “agentic engineering” has begun gaining traction as a more disciplined version of vibe coding: using structured methods to direct, scrutinize, and test AI agents.
I think the same distinction applies to data analysis. There is a version of this where a policymaker asks ChatGPT to “analyze this data” and takes whatever comes back. And who knows, maybe one day these tools will work so well that that will be enough. But once again, I doubt it. That’s because even if it makes no mistakes, there is an enormous amount of researcher discretion in any analysis that will lead different agents to make justifiable, but different, choices. For that reason, I think agentic analysis is a real skill that we need to develop ourselves and in our students. This means directing AI agents through a rigorous analytical pipeline, pre-specifying expectations, verifying outputs at each stage, and exercising the statistical judgment to know when something has gone wrong.
Of course, this requires strong statistical judgment, but it also requires learning things that a great statistical analyst who has never used agentic tools will fail at. Here are the four I’ve come up with.
Four Skills for Working with Agentic Tools
Rather than organizing around a list of AI tools (a fool’s errand), the course is organized around four skills that I believe will endure regardless of which model or interface is in vogue six months from now.
1. Managing Cognitive Debt
I’m leading with this because I think it’s the most important. In software engineering, technical debt refers to the shortcuts and workarounds that developers take to ship a product, which accrue over time and must eventually be addressed. A recent paper introduced me to an analogous term that we should all be thinking about: cognitive debt. Cognitive debt refers to the gap between what an AI has done on your behalf and what you actually understand about what it did.
Accruing cognitive debt is not a result of bad AI use. Rather, it’s an inherent component of agentic work. In order to make the most of AI tools, you must offload bits of detail, assumptions, and understandings to the model you’re working with. Every time you let an agent handle a step without deeply understanding its choices, you accumulate cognitive debt. Sometimes that’s fine. But eventually it could be catastrophic.
So, a key skill for the agentic analyst is knowing where they are on this spectrum at any given moment, and deciding when a deeper (or shallower) understanding is needed to move forward. This is a genuinely hard task for the reasons described above, and also because AI removes the correlation between presentation and quality (i.e., a crappy analysis will look strikingly similar to an excellent one).
This also means using AI to learn techniques as they surface, turning moments of debt into moments of understanding. While I’d like to think that this can be done mid-analysis (especially since Claude Code introduced the very cool /btw command), this can also be a moment where AI’s sycophancy and desire to get things done can cause real problems.
2. Verification-First Analysis
If there’s one thing I want students to internalize, it’s this: decide what you expect to see, and how you will react, before you look at what the AI produced. This is not a new idea. Pre-analysis plans in social science research and test-driven development in software development both embody the same principle. In an agentic workflow, where AI can produce an entire analysis in minutes, it is nearly impossible to prevent yourself from data mining under the guise of exploration, or saying “yes” to whatever helpful next step the AI proposes.
In practice, this skill is about structuring your interactions well, not an in-the-moment decision that good judgment will address. We need to get into the habit of pre-specifying data checks, deciding what code snippets from the eventual output you will review for assumptions, designing falsification tests before seeing results, and building sub-agents to critique output with a distinct context window.
3. Reproducibility and Documentation
One of the underappreciated risks of AI-assisted analysis is that it can be incredibly hard to reproduce. This has already been true — sharing a link to a ChatGPT thread is absolutely not sufficient for anything complex — but is even more so when you are working with multiple agents across different windows. Agents make decisions, write code, iterate with the user, and eventually arrive at a result, but the reasoning trail may be scattered across multiple sources or lost to the ether.
This is a more challenging proposition than “generate reproducible code”; a good agentic analyst will create a trail of decisions, assumptions, scrapped analyses, and reflections that will help other humans, other AIs, and their future selves understand and build upon their analysis.
4. Agentic Workflows
This is the least interesting and most ephemeral of the skills, but one that is pretty important for any professional degree. This includes the practical mechanics of working with AI coding agents: configuring them, structuring projects so they can work effectively, setting up different agents for different parts of a pipeline, and coordinating across them. This is the closest thing to “learning the tool,” and yes, it will be somewhat transient; if I were to teach this tomorrow I would probably focus on Claude Code and git, but who knows whether that will be the case in a few months. But as I noted last time, practice with these tools is essential to using them well, and the underlying patterns (isolating tasks, managing context, structuring information for an agent) are likely to persist even as the specific tools evolve.
What I’m Still Wrestling With
I want to be upfront about the tensions I see in this design.
The prerequisites problem. This course assumes students have already completed a full quantitative methods sequence. That’s a lot of training before they ever touch these tools in a structured way. Is there a way to introduce agentic analysis earlier, without the risks of cognitive offloading I wrote about previously? My gut says no, but it may be naive to think that students can have their hands tied as they develop their expertise.
The shelf life. I am designing a course around tools that may look very different by the time I teach it. I think the tools still need to be taught for students to graduate with the right skills, and the four cross-cutting skills are my hedge against this, but I would need to redesign significant portions of this course on an ongoing basis.
The access gap. Not all students (or their future employers) will have access to tools like Claude Code or Codex. If instead they have an enterprise LLM chatbot, there will be a huge gap in what they will be able to do with the tools at their disposal. I need to think carefully about what’s transferable and what’s tool-specific, and make sure the course emphasizes the former.
The other thing: A notable absence from this list of skills is the whole separate set of applications where generative AI is used not just as a workflow tool but as a research instrument: using LLMs to classify text at scale, simulate human behavior, generate experimental stimuli, or deploy chatbots as data collection instruments. Some courses are already doing this, and I imagine it will become a standard part of any social science curriculum. But I think of this as distinct from directing AI to do analyses, and so I’m excluding it from this list.
Next Steps
I’m sharing this now because I want to get it in front of people who can poke holes in it. If you teach quantitative methods, if you’ve been experimenting with agentic AI in your research, or if you just have strong opinions about what policy analysts will need to know in two years, I would genuinely love to hear from you. In the meantime, If you know of others, please send them my way.