In May 2026, Anthropic — the company that makes Claude, the AI you use every day — hired Andrej Karpathy to lead their pre-training research. That's the team responsible for making the underlying model smarter. This is roughly equivalent to Real Madrid hiring the world's best football coach: it signals both Karpathy's standing in the field and where the frontier of AI is being fought.
But the reason to care about Karpathy isn't his résumé. It's that he's spent a decade building tools and teaching resources specifically to help smart people who aren't ML researchers actually understand what's happening inside AI. His micrograd library teaches backpropagation in 150 lines. His nanoGPT re-created GPT-2 in 300 lines. His YouTube series — "Neural Networks: Zero to Hero" — has hundreds of thousands of learners. He is, arguably, the best explainer of deep learning alive today.
If you want to understand the technology you use and sell every day, Karpathy is the best mentor available. He doesn't require a PhD. He requires curiosity and willingness to sit with code for a few hours.
Born 1986 in Slovakia, raised in Canada. PhD at Stanford under Fei-Fei Li (2015), thesis on image captioning. Co-founded OpenAI in 2015 alongside Sam Altman, Elon Musk, and Greg Brockman. Left for Tesla in 2017 to run their Autopilot AI team — the group responsible for the neural networks that make Teslas drive themselves. Built Tesla's "occupancy network" approach to autonomous driving. Left Tesla in 2022. Returned to OpenAI briefly in 2023. Left again in February 2024 to start Eureka Labs, an AI-native education startup. Coined the term "vibe coding" in February 2025 (Collins Dictionary Word of the Year 2025). Joined Anthropic in May 2026 to lead pre-training research.
Outside of his career: he maintains one of the most educational GitHub accounts in existence, with tools that have collectively amassed over 300,000 stars. He is not a theoretician — he is an engineer who understands theory deeply enough to explain it simply.
In November 2017, Karpathy published an essay called "Software 2.0." The central claim: we are in the middle of a fundamental transition in how software is written. Traditional software (Software 1.0) is explicit instructions — if-statements, for-loops, logic a human writes. Neural networks represent Software 2.0: instead of writing the logic yourself, you specify what good behavior looks like (training data), and an optimization process discovers the program. The weights of a neural network are the program.
Mental ModelThink of it like hiring for two completely different roles. Software 1.0 is hiring an employee and giving them a detailed rulebook: "If the customer says X, do Y. If they say Z, do W." Software 2.0 is hiring an employee, showing them 10,000 examples of great customer service, and letting them figure out the pattern. The second employee can handle situations your rulebook never anticipated — but you can't easily read their "rulebook" because it's encoded in the patterns they internalized, not written anywhere.
Real ExampleTesla's Autopilot under Karpathy moved from a Software 1.0 approach — explicit rules like "if lane marking detected at angle X, steer Y degrees" — to Software 2.0: feed millions of hours of human driving into a neural network, let it learn what "good driving" looks like from the data. The network developed its own understanding of lanes, cars, cyclists, and edge cases that no human would have thought to code explicitly. This is also exactly what ChatGPT is: a massive Software 2.0 program that learned language from text, not from explicit grammar rules.
Common MistakePeople think Software 2.0 means "AI writes the code." It doesn't. It means the behavior itself is encoded in data and weights, not in traditional code. The Python file that runs a neural network is tiny and simple — but the intelligence is in the 70 billion parameters it loads. When people say "AI is just statistics," they're using Software 1.0 vocabulary to describe a Software 2.0 phenomenon.
What to Do NextRead the original essay: search for "Karpathy Software 2.0 Medium." It's 1,500 words and takes 10 minutes. Notice that he wrote it in 2017 — two years before GPT-2 existed. Then read his 2025 follow-up on "Software 3.0" (LLMs as operating systems, prompts as programs).
Backpropagation ("backprop") is the algorithm that allows neural networks to learn. A neural network makes a prediction. That prediction is wrong. Backprop calculates, for every single parameter (weight) in the network, how much that parameter contributed to the wrongness — and adjusts each parameter to be slightly less wrong next time. Do this millions of times, and the network gets good. Every neural network — from the model inside Claude to the one recommending your next YouTube video — learns through backprop.
Mental ModelImagine you're adjusting the settings on a mixing board to make a song sound right. You have 100 knobs. You turn them all slightly, listen to the result, hear what got better and what got worse, and adjust accordingly. Backprop is the mathematical version of this: it calculates the "gradient" — which direction to turn each knob, and by how much — using the chain rule of calculus. The key insight: you don't have to try every combination. Calculus tells you the direction of improvement without brute-force search.
Real ExampleKarpathy's micrograd library implements backprop in 150 lines of pure Python — no NumPy, no PyTorch. It can train a small neural network to classify points. Here's the core loop it implements:
# Loss: how wrong are we?
loss = model.forward(x) - correct_answer
# Backprop: how should each weight change?
loss.backward() # fills .grad for every parameter
# Update: nudge weights in the right direction
for param in model.parameters():
param.data -= learning_rate * param.grad
That's it. That's the core of training every neural network in the world. The math behind loss.backward() is what Karpathy's micrograd lecture makes completely transparent.
People learn backprop abstractly and think they understand it, but they can't build it. Karpathy's core teaching insight: if you can't implement it from scratch, you don't understand it. You understand the word "backpropagation" — you don't understand backpropagation. The micrograd exercise fixes this in about 4 hours.
What to Do NextWatch Karpathy's micrograd video (2h 30min, "The spelled-out intro to neural networks and backpropagation: building micrograd" on YouTube). Code along in a Colab notebook. By the end, you will have built a working autograd engine from scratch. That understanding transfers to every neural network you'll ever encounter.
The Transformer is the architecture behind GPT, Claude, Gemini, LLaMA — every major language model in use today. It was introduced in a 2017 Google paper titled "Attention Is All You Need." Its core mechanism, attention, allows every token in a sequence to "look at" every other token and decide how much to weight it when making a prediction. This is what allows GPT to understand that "it" in "The cat sat on the mat because it was comfortable" refers to the cat, not the mat.
Mental ModelBefore Transformers, language models processed text left-to-right, word by word, like reading with a finger covering everything ahead. The Transformer is like being allowed to see the entire sentence at once and highlight which words matter most for understanding each other word. "Attention" is a learned spotlight — the model learns, through training, which words to pay attention to when processing each token. This is why GPT can maintain context across thousands of words: it's not remembering a sequence, it's attending to the right parts of a window.
Real ExampleKarpathy's nanoGPT re-implements the full GPT-2 Transformer in ~300 lines of Python. The attention mechanism in that code looks roughly like:
# For each token, compute attention scores with every other token
q = self.query(x) # "what am I looking for?"
k = self.key(x) # "what do I contain?"
v = self.value(x) # "what do I communicate?"
# Score = how much does each token attend to each other
att = (q @ k.transpose(-2, -1)) * scale
att = att.masked_fill(mask, -inf) # can't look at future tokens
att = F.softmax(att, dim=-1)
# Weighted combination of values
y = att @ v
6 lines of code that implement the core mechanism used in every language model on Earth.
Common MistakePeople think the Transformer "understands" language. It doesn't — it's a very sophisticated pattern-completion engine. What it does understand is statistical relationships between tokens at enormous scale. The distinction matters: it's why LLMs confidently confabulate facts (they're completing patterns, not recalling truth), and why prompting style matters so much (you're tuning the completion signal).
What to Do NextWatch Karpathy's "Let's build GPT: from scratch, in code, spelled out" (YouTube, ~2h). He builds the full attention mechanism live, explaining every line. After that, the architecture behind Claude and GPT-4 will not be a black box to you.
Karpathy's "State of GPT" talk (Microsoft Build 2023) demystified how ChatGPT-style models are actually created. There are four stages, each building on the previous: Pretraining → Supervised Fine-Tuning (SFT) → Reward Modeling → Reinforcement Learning from Human Feedback (RLHF). Most people think of LLMs as one thing — a chat bot — but they are the result of this four-stage process, and each stage creates a fundamentally different kind of model.
Mental ModelThink of a person becoming a doctor. Pretraining is reading every medical textbook, journal article, and patient forum that exists — absorbing the structure of medical knowledge through massive exposure. SFT is a residency where experienced doctors show you what good answers look like for specific cases. Reward Modeling is building a rating system: medical reviewers rank several possible answers to each case from best to worst. RLHF is using that rating system to further train the doctor — repeatedly asking "which answer would the raters prefer?" and optimizing toward that. The result: a doctor who not only knows medicine, but knows how to communicate it in a way humans find helpful.
Real ExampleThe base GPT-4 model after pretraining is remarkable but strange — it completes text, answers questions sometimes, but mostly just continues whatever you started. If you give it "The capital of France is" it says "Paris." But if you give it "What is the capital of France?" it might just ask another question, because that's a common pattern in text. SFT + RLHF is what turns "text completer" into "helpful assistant."
Karpathy's point: "Base models are not assistants. They are text completion machines trained on internet-scale data." Understanding this explains why prompt engineering works the way it does — you are essentially writing the start of a document that a text-completer wants to continue in a specific way.
Common MistakeThinking that RLHF makes models "smarter." It makes them more aligned with human preferences, not more knowledgeable. A highly RLHF-tuned model can be confidently wrong in a way that sounds very helpful. The knowledge is in pretraining. RLHF is shaping how that knowledge is expressed. This is also why Karpathy has been bullish on letting models think longer (chain-of-thought, reasoning models) — it accesses the pretrained knowledge more reliably.
What to Do NextSearch YouTube for "State of GPT Karpathy Microsoft Build 2023" — it's 43 minutes and will permanently change how you think about every LLM interaction you have. The second half, on how to use LLMs effectively, is particularly actionable for your daily Claude usage.
Karpathy has a single, consistent teaching methodology across everything he's produced: start from the simplest possible implementation, understand every line, and add complexity only when the simple version is fully understood. He refuses to use black-box libraries in educational contexts. His micrograd doesn't use NumPy. His llm.c doesn't use Python. His Zero to Hero series doesn't use HuggingFace until the very end. The philosophy: understanding the building blocks means you can understand anything built on top of them. Using abstractions before understanding them means you're always one abstraction away from confusion.
Mental ModelImagine learning to drive in a car with all the controls labeled in a foreign language. You can follow the pattern of what instructors do and eventually drive okay. But if anything unusual happens, you're lost. Karpathy's approach is to first understand what the steering wheel physically does — how it connects to the wheels, how the wheels contact the road — before getting in the car. Slower start, much more durable understanding.
"I like to build things from scratch as much as possible. When I want to understand something I implement it myself. I don't trust myself to understand something unless I can implement it." — Andrej KarpathyReal Example
When Karpathy built his makemore series, he didn't start with a Transformer. He started with a bigram model — the simplest possible approach: look at the previous character, predict the next one. It performs terribly. Then he asked: what would make this better? Answer: look at more context. That led to an MLP. What would make that better? Better architecture. And so on, through every major architecture in ML history — each one motivated by the failure of the previous one. By the end of the series, students understand not just what a Transformer is, but why it exists.
Common MistakeStarting with the most powerful tool (GPT-4, LangChain, HuggingFace) and trying to understand ML from the outside. You can build applications this way, but your mental model will have gaps that bite you the moment something doesn't work as expected. Karpathy's alternative: spend a few weeks on micrograd + makemore + nanoGPT, and those gaps close permanently.
What to Do NextCommit to the full Zero to Hero playlist in order. Don't skip the micrograd video because it "seems basic." The backpropagation understanding from lecture 1 makes every subsequent lecture 3x clearer.
In February 2025, Karpathy posted about "vibe coding" — a new way of building software where you describe what you want in natural language, let an AI generate the code, accept changes without reviewing every line, and let the codebase grow organically from prompts. He said: "I just see stuff, say stuff, run stuff, and it mostly works." Collins Dictionary named "vibe coding" its Word of the Year 2025. This is directly connected to his "Software 3.0" thesis: LLMs are a new kind of operating system, and natural language is the new programming language.
Mental ModelSoftware 1.0 = write every instruction explicitly (C, Python, Java). Software 2.0 = define desired behavior via data, let optimization write the program (neural networks). Software 3.0 = describe desired behavior in English, let an LLM write both the logic and adapt it dynamically. The "programmer" role shifts from writing code to writing specifications — knowing what you want, evaluating whether you got it, and iterating. The bottleneck is no longer typing speed or syntax knowledge. It's clarity of thought about what you actually want.
Real ExampleThis is exactly what you're doing on this server every day. When you ask Claude to write a Python scraping script, generate a report, or set up n8n workflows, you are vibe coding. You specify outcomes ("find candidates with these criteria"), evaluate results ("this missed people in Hamburg"), and iterate. The distinction Karpathy is making: this isn't a shortcut — it's a genuinely new paradigm that will become the primary way software gets written.
Common MistakeVibe coding is not the same as "not understanding the code." Karpathy himself has one of the deepest understandings of ML systems alive. His point is that for many tasks, the bottleneck is no longer implementation — it's specification and evaluation. The people who will be best at Software 3.0 are those who understand Software 1.0 and 2.0 deeply enough to evaluate what the AI produces. Pure vibe coding without any technical understanding produces fragile, untestable systems.
What to Do NextSearch for "Karpathy Software 3.0 Sequoia" — a 2025 talk where he lays out what this paradigm shift means for builders. The talk is ~45 minutes and directly relevant to what you're building on your server.
| Repo | Stars | What it teaches | Who it's for |
|---|---|---|---|
| micrograd | ~16k | Backpropagation from 150 lines of Python. No dependencies. See exactly how neural networks train. | Everyone. Start here. |
| nn-zero-to-hero | ~23k | Full curriculum: micrograd → makemore → nanoGPT → tokenizer. 8 lectures, each with video + Colab notebook. | Anyone who wants structured ML learning. |
| makemore | ~4k | Same task (name generation), 5 increasingly powerful architectures: bigram → MLP → CNN → RNN → Transformer. Teaches why architectures evolved. | Beginner–Intermediate. |
| minGPT | ~25k | Clean 300-line GPT-1/2 implementation. Archived but still the clearest pedagogical transformer code. | After makemore. |
| nanoGPT | ~60k | Can reproduce GPT-2 results. Multi-GPU, tiktoken, compile. Deprecated in Nov 2025 — use nanochat instead. | Intermediate. Best for 2022-2024 content. |
| nanochat | ~55k | Full LLM pipeline: pretraining + SFT + RLHF + chat UI, all on one GPU. GPT-2 capability for ~$48. | Intermediate–Advanced. The current flagship. |
| llama2.c | ~20k | Llama 2 inference in ~700 lines of pure C. Int8 quantization. No Python, no dependencies. | Intermediate. Bridge to systems-level understanding. |
| llm.c | ~30k | GPT-2/3 training in pure C/CUDA. Manual backprop, CUDA kernels. Understand what PyTorch is hiding. | Advanced. Deep systems understanding. |
| autoresearch | ~88k | AI agent that runs ML research overnight autonomously. Modifies code, trains, evaluates, iterates ~100x. | Advanced. Frontier research automation. |
Recommended path for Felix: micrograd → Zero to Hero lectures (in order) → nanochat README. You don't need to run the code to benefit — reading and understanding is 80% of the value.
Karpathy doesn't trust explanations. He trusts implementations. If someone says "here's how attention works" with a diagram, he builds it in code and runs it. When it works, he understands it. When it doesn't, he finds out where his understanding was wrong. This is not just his teaching style — it's his epistemology.
He has a consistent instinct to find the simplest possible thing that still works, and be surprised at how far it goes. micrograd is 150 lines. llm.c trains GPT-2 in C. microgpt (2026) puts a full LLM — tokenizer, architecture, training, inference — in 200 lines of Python with zero dependencies. Each of these was a deliberate exercise in compression: "how small can I make this and still have it work?"
In the first lecture of Zero to Hero, he spends 45 minutes on a single derivative calculation. He doesn't rush. He knows that the people who skip this to "get to the good stuff" will have gaps in their understanding that cost them later. His most famous teaching move: slow down exactly where most people would speed up.
He doesn't over-explain or over-soften. He shows you the code, walks through it, and trusts you to follow. His lectures have almost no graphics or animations — just a Jupyter notebook and his commentary. This reflects a belief that the material itself, explained clearly, is enough. The person who can follow that is the person he's teaching.
Despite coining "vibe coding," Karpathy is not utopian about AI replacing human judgment. His autoresearch project, for instance, automates ML experiments — but a human still decides what to experiment on, evaluates the results, and judges what matters. His view: AI dramatically amplifies what a single person can accomplish, but the human in the loop still determines quality and direction.
"Neural networks are not magic. They are just a mathematical function. The magic is in what happens when you train them on enough data." — Karpathy, Zero to Hero lecture 1
"The most important thing I've learned about learning is that the best way to learn something is to implement it from scratch. Reading is not enough. Watching is not enough. Building is the test." — Karpathy (paraphrased from multiple talks)
"Software 2.0 is eating the world. The question is not whether neural networks will take over a task — it's when." — Karpathy, Software 2.0 essay, 2017
"I just see stuff, say stuff, run stuff, and it mostly works. I'm not even sure what the code does half the time, and it doesn't matter." — Karpathy, coining "vibe coding," February 2025
"LLMs are a new kind of operating system. English is the new programming language. Prompts are programs." — Karpathy, Software 3.0 framing, 2025
"Base models are not assistants. Their objective is to complete documents." — Karpathy, State of GPT, Microsoft Build 2023
"You shouldn't judge the power of a model just by the number of parameters it contains." — Karpathy, State of GPT, on LLaMA vs GPT-3
| Resource | Type | Why It's Worth Your Time |
|---|---|---|
| The Unreasonable Effectiveness of RNNs | Blog post (2015) | His most famous pre-LLM writing. Shows how sequence models can generate text, code, and more. Still accurate in its core intuitions. |
| Software 2.0 (Medium, 2017) | Essay | The paradigm-shifting essay. 1,500 words. Read before anything else. Search "Karpathy Software 2.0 Medium." |
| karpathy/micrograd | GitHub + Video | Start here. 150 lines. 2.5h video. Will change how you think about AI training forever. |
| Neural Networks: Zero to Hero | YouTube Series + GitHub | The complete curriculum. 8 lectures. Follow in order. Nothing else required. |
| State of GPT (YouTube, May 2023) | Talk (43 min) | Best 43-minute explanation of how LLMs work and how to use them. Search "Karpathy State of GPT Microsoft Build." |
| karpathy/nanochat | GitHub | His current flagship. Full LLM pipeline on one GPU. The README alone teaches the complete training process. |
| karpathy/llm.c | GitHub | LLM training in pure C. For understanding what happens below Python. Read the README even if you never run the code. |
| Software 3.0 / Sequoia Talk (2025) | Talk (~45 min) | His vision for where AI is taking software development. Directly relevant to your server and automation work. |
| Wikipedia: Andrej Karpathy | Reference | Good biographical overview. Links to all major papers and talks. |