Thinking about agents: enablement and over-reliance
AI is fantastic when we can't do a task.
But what about when we can, and we simply choose not to, out of laziness, pressure to be more productive, or competition?
If you're like me, you're being bombarded with people using AI to do cool things like build apps for themselves or automate their lives. Even Andrei Karpathy has hopped onto the vibe-coding trend, and he also uses LLMs every day.
It's thrilling to offload our daily drudgery onto tools that can do it all for us. More and more systems promise to automatically write our emails, manage our social media, read our documents, and help us code. When AI genuinely succeeds in saving us hours of sweat every week, it can be hard not to fall in love.
But I've been wrestling with a question: Just because we can offload so many tasks, does that mean we should?
Enablement
After a decade working in AI, I stepped away from my role building a platform for fine-tuning LLMs to take a break and work on projects I'd long postponed. With AI lowering the barriers to making rapid meaningful progress, these projects finally felt within reach.
For example, the paper we published using a decentralized council of LLMs to self-rank would have taken my last research team at Google several months just to build the experimentation harness, another several months for experiments, and weeks more to visualize results and write the paper. Yet with AI, our tiny team of mostly independent researchers produced a paper accepted to ACL — most of the work done in just a few weeks.
AI has grown and changed so much since I started in 2014. The days of researchers lamenting that LLMs are stochastic parrots are numbered. People in my network with minimal coding experience are using AI to build real tools aligned with their interests. For example, my Product Lead friend with near-zero SWE experience has built multiple tools aligned with his personal passions (e.g. meditation coaching app Apollo), and my partner who has never written a single line of code in his life built a patient-staff scheduling tool for his medical team at Stanford hospital. These tools are simple, but there's a lot of low-hanging fruit like this that can have a transformative impact on organizations. No SWEs needed, no fine-tuning, just use your words.
Projects that seem daunting are suddenly accessible, not just for AI specialists but for nearly anyone with an idea. Earlier this week, I made an app that transcribes, diarizes, and summarizes YouTube and TikTok videos, from start to finish, in 20 minutes.
It's not just apps that are accelerating due to AI – AI is also revolutionizing data analysis and research. Today, more than 21 AI research papers are published on Hugging Face every day on average, more than double from a year ago. Source.
If LLMs were already game-changers, LLM agents (LLMs that can semi-autonomously execute on ambiguous tasks) push the boundaries further. Some of my favorite demos:
OpenAI's DeepResearch (as well as versions from HF, You, Jina, Gemini, comparison)
Tutorials from stephengpope@ using tools like n8n and make
This past quarter has seen a deluge of agent-related releases — OpenAI's Agents SDK, Anthropic's MCP, Manus, just to name a few). Each new agentic system promises to enable LLMs to handle tasks both tedious and complex with increasingly minimal human input.
When you spend enough time with these tools, you see their limits, but you also see amazing capabilities, and it's hard to not get excited about what capabilities are coming in the future.
In Sam Altman's recent essay "Three Observations", he makes a bold prediction that "In a decade, perhaps everyone on earth will be capable of accomplishing more than the most impactful person can today". It's optimistic, but at the pace we're going, it feels within reach.
Claude Plays Pokemon. Does beating the game mean we've reached AGI?
Over-reliance
AI seems genuinely fantastic when we truly can't do a task. But what about when we can, and we simply choose not to, out of laziness, pressure to be more productive, or competition?
Research has long shown that when we stop doing something ourselves, those abilities tend to atrophy. For example, it's not uncommon for engineering managers who haven't coded in years to struggle to find new jobs after they become managers. Research also suggests reliance on technology can lead to cognitive skill erosion.
Ophir, Nass, and Wagner (2009) studied multitasking and found that over-reliance on digital tools can weaken focus and deep learning.
Clements et al. (2011) argues that while calculators can be helpful, they should be introduced only after conceptual mathematical understanding is solid.
Ishikawa et al. (2008) found that people who use GPS frequently develop worse spatial awareness and topographical memory and often lose their intuitive sense of spatial relationships.
Barr et al. (2015) showed that frequent smartphone users demonstrate lower analytical thinking because they rely on quick lookups instead of deep reasoning.
Manu Kapur (2016) suggests that struggling with problems before receiving instruction leads to deeper learning and retention.
Bjork and Bjork (2011) argue that learning should involve challenges and setbacks because effortful retrieval strengthens memory and understanding. If things are too easy, deeper comprehension is hindered.
Every time we ask AI to write code for something instead of doing it ourselves, we may be weakening the neural "muscles" we once used for those tasks.
The Financial Times reports that average information processing, reasoning, and problem-solving capabilities have been declining since the mid-2010s.
Nobody would argue that the fundamental biology of the human brain has changed in that time span. But there is growing evidence that the extent to which people can practically apply that capacity has been diminishing. Is this just a bi-product of exponential technological progress? What happens to our brains in a world where AI becomes our generalized calculator for everything?
‘Brain rot’ (Oxford's word of the year in 2024) is defined as “the supposed deterioration of a person’s mental or intellectual state, especially viewed as the result of overconsumption of material (now particularly online content) considered to be trivial or unchallenging."
What happens to our brains in a world where AGI makes everything trivial?
Brain Rot or Just Natural Progress?
"Brain rot" sounds negative, implying AI weakens us. But history contains countless examples of cognitive offloading that proved beneficial:
No one laments our inability to do long division mentally now that we have calculators
Cursive writing was once essential; now it's decorative
People used to memorize phone numbers; now they don't need to
Not all cognitive decline is problematic. Cognitive offloading can be positive when it redirects effort toward more meaningful work.
The key distinction is whether we're losing something valuable.
If AI fully automates email writing and calendar management, are we losing anything essential? Probably not.
If AI automates creative writing, argumentation, or complex decision-making, does that degrade human abilities that do matter?
We should also consider robustness to AI failure. If systems fail or go offline or if they can't be controlled and people can't function without them, that's a genuine vulnerability.
We see a parallel happening in the world of LLM evals, too. Even though the state of the art for most popular AI tasks in 2017 are still specialized neural networks and not LLMs, no one seems to care. The once industry-obsessed Hugging Face Open LLM leaderboard is being retired because it "is slowly becoming obsolete; we feel it could encourage people to hill climb in irrelevant directions."
When is Over-Reliance Okay?
Over-reliance is acceptable when the skill is no longer relevant or when AI failure wouldn't be catastrophic. But it becomes problematic when:
The skill is foundational for higher-level reasoning (e.g., critical thinking, creativity) – atrophy.
AI failure would leave us helpless (e.g., doctors losing diagnostic skills) – fragility.
It limits our ability to recognize when AI is wrong – vulnerability.
AI and Technical Interviews
@im_roy_lee, creator of Interviewcoder, which provides a translucent overlay on your screen that's invisible to Zoom screen sharing where an LLM can look at your screen, listen to your interview, and give you answers on the fly.
Ask any experienced engineer about technical interviews, and you'll likely get an eye roll. The industry has long debated LeetCode-style interviews — algorithmic whiteboarding under pressure. They provide standardized filtering but feel disconnected from real-world engineering, favoring those who grind problems over those with practical experience.
At my last company, we opted for interviews centered around pair programming — collaborative exercises that gave clearer pictures of how candidates actually think and problem-solve in realistic scenarios. Yet even at senior/staff levels in FAANG, LeetCode-style formats persist because, despite their flaws, they seem to correlate with something useful, for now.
InterviewCoder – an invisible AI for technical interviews – was created by a single 20-year-old college student from Columbia University with a single purpose to hack the technical interview process. Despite broad denouncement from the industry criticizing the tool as one that empowers cheating, leading to Roy's own pending suspension from Columbia University, the service has allegedly hit $1M ARR in just 36 days.
The premise for tools like InterviewCoder is understandable: If you view these interviews as arbitrary gatekeeping, especially when all SWEs use some form of AI to write code on the actual job – why not use AI to level the playing field during the interview?
AI-assisted coding makes you feel stronger in the moment, but overuse may quietly erode your problem-solving abilities. You might ace an interview with AI assistance, but what happens when facing an ambiguous, high-stakes engineering problem without your AI crutch?
It's easy to criticize LeetCode interviews. But they do force development of skills companies still value — structured problem-solving, algorithmic fluency, clear communication. AI-assisted interviewing blurs the line: If everyone uses these tools, do technical interviews still test anything meaningful? Or do they become an AI-assistance arms race?
I don't have a definitive stance on whether tools like InterviewCoder are good or bad. But they act as a forcing function to clarify paradoxes in our processes and reflect on how we conceptualize competence in an AI-augmented world.
At the end of the day, I'll still grind some LeetCode problems the old-fashioned way. Not because I love it, but because the ability to struggle without AI remains valuable. Maybe not forever, but for now.
A Different Kind of Agent
At the end of February, I registered for the Elevenlabs Worldwide "AI Agents" hackathon with a few friends in New York who I've long wanted to hack with.
When we think of agents, we often imagine highly active entities, but my hackathon teammates and I wondered if an agent could be something very different: What if an agent could be intentionally quiet? Instead of trying to run your life or solve tasks end-to-end, what if it only stepped in when absolutely necessary?
Imagine a different kind of “agent” that is almost always quiet. It listens thoughtfully, as your quiet companion in conversations, offering gentle yet precise fact-checking but only when a significant discrepancy arises.
That's how we built Cue — a real-time fact-checking AI that listens to conversations but remains silent most of the time. It only speaks up if it detects a factual statement so off-base that it could derail discussion if uncorrected. No unsolicited suggestions, no micro-corrections, no overshadowing natural interaction.
We wanted to explore a different relationship with AI—one where technology is practically invisible unless something truly consequential might slip by. In other words, an AI that actively respects human agency, allowing us to brainstorm, hypothesize, and even be wrong without constant correction. Because part of being human is figuring things out through trial and error.
This stems from the same idea behind letting students struggle with problems before tutors intervene. If tutors do all the heavy lifting, students learn less. Likewise, if AI agents are always in the driver's seat — dictating decisions, generating every email, completing every project — when do we build or refine our own thought processes?
AI is an incredible tool. But the more we rely on it for everything, the less we hone our ability to handle complexity or creative challenges. When AI eventually fails or gives incorrect answers, we may lack the instincts to spot errors.
Cue doesn't solve this conundrum — it's a small demonstration of an alternative path, a reminder that agents can be designed to stay in the background. They don't have to replace us; they can watch, wait, and offer nudges only when genuinely needed.
Building Cue taught us about calibration difficulties: intervene too frequently, and you foster dependency; too rarely, and the system becomes pointless. But we showed it's possible to create agents that respect our messy, flawed human way of learning rather than overshadowing it.
Hacking on Cue at the end of my time in NYC.
Final Thoughts
It's not going to be a question of what AI is capable of — it's what we want AI to be in our lives. We're at a juncture where AI's abilities are expanding rapidly, and it's tempting to let them run large parts of our personal and professional lives.
I'm not advocating blocking progress. But maybe we should each consider our relationship with these increasingly competent systems. Are we using them to complement and challenge us — or in ways that slowly erode our capacity to think, create, and be human?
Personal Note
Building Cue was one of my last projects before making the move from NYC to the Bay Area last week.
It’s been rewarding to reflect on how much has changed since I last called California home, and it feels great to return to familiar territory. The side-project list continues to grow, but I'm also beginning to explore what's next professionally. If you're hiring, I'd be delighted to connect.
P.S. If you're in the Bay Area and want to catch up, please reach out — would love to hear from you.
Leaving Google
(As published originally on Medium)
Natural Language Generation team at Google ~2018.
All good things come to an end, and October was my last month at Google. I was at Google for 6 years, and when I left I had been there longer than 90% of the company! This post will serve as a brief announcement and small summary of my career there.
My entire tenure at Google, I worked on a specific problem in NLP called natural language generation. I even produced a YouTube video about it with Google Cloud in 2017.
14 minutes of nerdy brilliance?
The dream of natural language generation is to make a computer that can talk and text with perfect fluency, in any language. Of course, languages are really complex, and saying the right things in the right way requires a lot more than good grammar. Advancements in machine translation and large language models like GPT-3 have effectively solved how to get computers to write like humans, “believably”. Getting a deep learning model to perfect quality, however, is an entirely different story. Our team published a paper about how incredibly difficult it is to do this, even on an extremely narrow set of domains and languages.
The relentless pursuit of the “perfect generative language model” led me on rather lovely adventures in linguistics, ML infrastructure, data science, and NLP. I travelled to all sorts of places including London, Zurich, and Hong Kong where I hung out, learned from, and befriended many people who were often smarter and more talented than me.
Deep learning is stunning technology, but applying it to real world problems is really hard, often not because of the power of models themselves, but because of everything else — data, evaluation, quality, and serving. The existing solutions for these other aspects feel immature, even at a place like Google, and teams like ours often opted to invest in bespoke systems.
I began to wonder about a fundamentally different interface to deep learning — one that’s technically expressive, but crucially simple and easy to use — for any problem, for any person. This led me to follow through on an interview at a seed-stage startup with a colleague I have known for years and one of the pioneers of declarative machine learning. This is where I will work next.
As an outspoken green-bubble ambassador, leaving Google is certainly a bittersweet moment. In all, I am very grateful to Google. I’ve learned so much, and I feel incredibly lucky to have met and worked with many amazing colleagues. In the future, it would be nice to one day return to the problem of generative language (or music!) models. For now, I’m excited to see what it’s like being on a smaller, riskier ship, and hopefully make something that makes the most technical deep learning technology compellingly easy to use.
Google searching your own name: Visiting Ground Zero
(As published originally on Medium)
Ground Zero in NYC.
When I was 11, I googled my own name to see who else had the same name as me. The most prominent results were for a “Justin Zhao” who graduated from the City University of New York. Justin worked as a computer technician for Aon in New York City. Justin died on September 11, 2001. He was a victim of 9/11.
Seeing such misfortune as the top search result for my own name sharpened my ambition to become more well known. I didn’t want people to Google my name and be reminded of a tragedy that killed thousands of people. In the years that followed, I leaned into my interests in music, computer science, and engineering. I won awards, created social media, and posted YouTube videos. Gradually, content about me became the top search results for my own name, and the other Justin Zhao from my youth felt like little more than a distant memory.
At 22, I graduated from Columbia University and started working as a software engineer in New York City. The New World Trade Center had finished construction and I decided to check it out. As I walked towards the triangular skyscraper, I was surprised to find two massive square craters with names carved into the stone borders. I eavesdropped on a tour guide nearby who explained that the pools are actually memorials built on the same footprint of the Twin Towers, and the names around the pools are of the people who died on 9/11.
I instantly remembered the “Justin Zhao” I discovered in my adolescence, and I felt a wave of excitement that his name might be engraved somewhere nearby.
As I walked around, I noticed folks sitting on benches, checking their phone, casually enjoying a bite to eat. Around them was a steady stream of stereotypical New Yorkers briskly pacing through the park without a minute to waste. Scattered throughout were tourists — families and kids taking pictures, marveling at the architectural spectacle. It was all very consistent with the diversity, energy, and hullabaloo of New York City.
Finally, there were those who stood closer to the pools, who moved more slowly than everyone around them, looking on with heavier hearts. Some of them would touch or underline an engraved name with their hand, or place flowers. I was touched that even on the most ordinary day, there were many who had come to cherish their loved ones who passed more than a decade ago. At the eastern edge of the North Pool, one family sobbed quietly. Perhaps they were remembering a lost parent, a relative, or a friend. The light drizzle masked their tears. My heart goes out to them.
The showers became rain. I considered resuming my search another day, but then I found it — “Justin Zhao”, engraved on the southeast corner of the South Pool. Seeing my own name memorialized in front of me was surreal. I stood there drenched.
Articles of Justin Zhao from 9/11 can still be found, but they are a few pages of search results deep. According to a New York Times archive, Justin is survived by his brother Jimmy.
Face to face, I contemplated our similarities. We were both named “Justin Zhao”. We were both raised in Chinese families; we both studied and worked in New York, and in technical roles. I recently turned 26, which is the same age he was when he passed away.
In myself, I recognize the enthusiasm I have for the path ahead of me, and the privilege of dreaming of who I want to be and what I want to accomplish. There’s so much food I want to eat, music to play, experiments to try, places to see, books to read, people to meet, and time to spend with people I love. This makes me feel tremendous sadness for Justin and the other victims, who may have felt the same, but whose lives were cut short.
If I’m ever in the area with time to spare, I will go to the Ground Zero Memorial to find Justin. Sometimes I’ll pull up the latest search results for our name on Google to check out what some of the other Justin Zhao’s have been up to. Recently, there’s a high school basketball star and a Chinese actor. I wonder if they know about 9/11 Justin Zhao.
I reflect on the fleeting lives that we all live. I admire the unique and complex paths we forge for ourselves, and the deep and peculiar connections we share as a human species through living, loving, and ultimately dying, on this Earth. A simple coincidence of having the same name and an internet-powered search engine has woven Justin’s story into my own. I use Justin to remind myself of the weight of tragic events like 9/11, the consequences of terrorism, and the tough challenges that lie ahead for our world. But our story is also about appreciation, sympathy, and shared humanity.
With the time that we have, may we hope and work for a better world.
Jie Yao Justin Zhao’s name engraved on the South Pool. Rest in peace.
Natural Language Generation at Google Research
(Below is a summary of the video transcript, courtesy of ChatGPT).
As technology becomes more integral to our daily lives, the need for seamless human-computer interaction grows. Virtual assistants like Siri, Google Assistant, and Alexa have popularized the idea of conversational AI—software that allows us to interact with machines using natural language. However, beneath the surface, creating smooth, human-like conversations involves complex challenges, particularly in the realm of Natural Language Generation (NLG).
The Two Pillars of Natural Language Interfaces
At its core, NLG is the task of enabling computers to generate human-like responses based on input data. For years, much of the focus in natural language processing (NLP) revolved around understanding text—analyzing words to determine their meaning, intent, and structure. Yet the other half of the equation, the ability to generate natural responses, is equally vital for building conversational systems. Without fluent and contextually appropriate replies, even the most advanced understanding models will fall short in user satisfaction.
Natural language interfaces depend on two crucial tasks:
- Understanding user input: Ensuring that the system correctly interprets the user's question or statement.
- Generating responses: Formulating an answer that addresses the user’s query while feeling natural and coherent in the flow of conversation.
The Role of Structured Data in Natural Responses
In the field of NLG, the problem becomes more pronounced when dealing with structured data. For instance, if a user asks for the weather forecast, a conversational system must transform raw data—like temperatures and conditions for the coming week—into a natural, coherent response.
As Justin Zhao, a Google Research Engineer, explains, reading data in a mechanical, templated format (e.g., "On Sunday, it will be 65 degrees...") creates responses that are correct but awkward and repetitive. Instead, a more human response might summarize: “It’ll be cloudy until Thursday, with temperatures ranging from the mid-60s to the low 70s.”
This challenge highlights why rule-based systems struggle to generate flexible, spontaneous responses. They rely on manually crafted templates and require extensive engineering. Machine learning offers a more scalable solution by training models to generalize and create responses in ways that rule-based systems cannot.
Machine Learning and Recurrent Neural Networks (RNNs)
Machine learning, particularly neural networks, has transformed natural language generation. By training models on large datasets, engineers can teach them to infer appropriate structures and patterns for generating natural responses.
One of the key architectures used in NLG is the Recurrent Neural Network (RNN). Unlike traditional feed-forward networks, RNNs process inputs sequentially, making them ideal for language generation where word order is crucial. For example, "The cat sat on the mat" is very different from "Sat the mat on the cat."
RNNs maintain context by feeding previous outputs back into the model, enabling it to generate more coherent sentences. Additionally, RNNs use attention mechanisms to focus on specific parts of the input data at different points in the conversation.
The Power of Attention Mechanisms and Transformers
Attention mechanisms allow models to "pay attention" to relevant parts of input data when generating responses. In weather forecasting, for instance, the model might prioritize temperature data when constructing a sentence about the week's weather. Visualizing this process often shows diagonal lines in attention graphs, illustrating how the model decides what data to focus on at each step.
While RNNs have been instrumental in advancing NLG, transformer models have further revolutionized the field. Transformers use self-attention mechanisms and process text in parallel, offering both speed and accuracy improvements over RNNs. These models are the foundation for state-of-the-art systems like GPT and BERT.
Ethical Considerations in AI-Driven Conversation Systems
As conversational AI systems become more sophisticated, ethical concerns arise. AI-driven language generation can unintentionally propagate biases present in the training data or generate misleading information. As these systems become more human-like, there are important questions about transparency, consent, and the boundaries of human-computer interaction.
Conclusion
In conclusion, natural language generation is essential to the future of conversational AI. Researchers like Justin Zhao and his team are pushing the boundaries of NLG, moving us closer to a world where interactions with computers feel as natural as talking to another human. However, with this progress comes the responsibility to use these technologies ethically, ensuring transparency and trust between users and machines.