Video: Agent Quality & Token Optimization (AMER Friendly) | Duration: 3448s | Summary: Agent Quality & Token Optimization (AMER Friendly) | Chapters: Welcome and Introduction (10.8s), Workshop Motivation (65.23s), Agent Quality Focus (178.84s), Agent Quality Value (313.76s), Quality Over Tokens (404.835s), Context Balance Fundamentals (553.985s), Agent Context Management (702.85s), Token Basics (870.39s), Context Window Problems (998.915s), Optimization Strategy (1159.22s), Model Selection Strategy (1278.08s), Context Engineering (1425.715s), Prompt Optimization Strategies (1536.33s), Deterministic Testing Controls (1704.54s), Agent Configurations (2003.655s), Custom Agents (2274.115s), Skills Management (2434.41s), MCP Server Optimization (2559.015s), Sub Agents & Configs (2741.865s), Power User Optimization (2953.315s), Long-Term Guidance (3224.58s), Wrap-Up & Tips (3344.025s)
Transcript for "Agent Quality & Token Optimization (AMER Friendly)":
Alright. Well, hello, everyone. Great to see you all on here, and I wanna welcome each of you to this presentation on agent quality and token optimization. My name is John Casanova, and I've been at GitHub for the past five years as a customer success architect. I'm out of the Dallas Fort Worth area, and I work primarily with our strategic customers across many industries, to help them, get the most out of our platform and, of course, Copilot. And I'm also joined by a small army of my colleagues who are here to help answer questions in the chat, and we have a lot of great content for you, to cover for you all today. So, question wise, I won't be fielding questions during the session, but if you can use the q and a portion to put your questions in there, then my colleagues will help answer those questions as they come in. Alright. The the motivation for this workshop is, of course, GitHub switch from the premium request usage model to usage based billing. And if you're joining us today, then you're probably well aware of that upcoming change, which has naturally raised a lot of questions around token consumption. And our customers are, of course, asking, how do we improve our token spend? But we feel focusing purely on cost can actually diminish the value that you get out of Copilot. And so we think the better question to answer or to ask is, how can we make the most out of the tokens that we're spending? And that actually comes with token optimization as we'll learn today. The presentation that you'll be watching today has three parts. In the first part, we'll cover why agent quality is the better focus. We'll explore why optimizing your agents for quality beats optimizing for cost alone and how quality improvements naturally reduce token spend. And then in the second part, we'll cover foundational concepts about large language models, agents, and context windows. And, we wanna make sure you have a good grasp of the fundamentals because having this foundation will not only help you gain a deep understanding of token optimizations and agent quality, but also help you make sense out of the barrage of optimization tips you're likely to run across online. And then you can form your own opinion about what works and what doesn't. And then in the third part, we'll discuss quality and token controls. We'll take a look at the practical controls you can employ, so things like model choice, prompts, agent configs, all with actionable optimization tips. So the overarching goal here is to help you all learn how to work better with less. So we're gonna start today with why agent quality is the better focus. And to understand this, we wanna look at how the industry is operating today and how each of us is probably acting. So we're using an we're using agents in what we like to call a gambling system. So looking at the graphic on the slide here, imagine that unmanned rockets were super cheap. NASA could fire off 20 of these things in the general direction of the moon, and if one of them landed, awesome. If none of them landed, no big deal, and they just send off the next 20 and keep hoping for the best. And that maps precisely to how we're seeing, agents used today. Often, we're providing only a little bit of context, and this is akin to launching a rocket without providing the exact coordinates of the moon. And then there's our prompt that we're frequently putting too little effort into. We're sending off a lazy prompt where you know you could have taken some more time to do better, but you may have been multitasking and just fired one off. And this is like scribbling a vague flight plan instead of calculating a precise trajectory and sequence of maneuvers for the rocket to execute on its way to the moon. And then you send the agent on its way, like, firing off the rocket and then, you know, hoping for the best. And if it comes back with a good result, fantastic. And if it doesn't, you just send off the next agent. And so this is like waiting to see if the rocket lands, and then if not, then simply launch another one without really learning from the last attempt. And the problem with this type of gambling approach is that it's no longer sustainable, and it never really was in the first place. It was sustainable when you're only firing off a handful of agents a day. But given the direction we're moving where dozens, if, not hundreds of agents get dispatched by every single developer every day, those agent sessions also become longer and longer. So this isn't sustainable from a financial point of view for us at GitHub because we're we've been bearing that cost the whole time. And now that we're switching to a usage based model, it's no longer sustainable for you as a customer. So it's definitely time for change on this front. But if you look at this purely from a cost angle, you're asking yourself, how can we make the fuel cheaper? And that's really just asking, how can we continue to gamble? And this isn't the right thing to do. And I also don't think it's actually what you want to do anyways. So instead, we should work on decreasing the number of rockets we're sending out with a better chance that they'll actually hit their target. So in other words, we need to increase the value of each and every agent. We need to increase the quality of the agents before we send them out so that more of them are likely to reach their targets. So overall, we need fewer agents, and fewer agents automatically means fewer tokens spent. We need to start working toward a better return of investment on our agents. Increasing the value of an agent means that we have to increase its quality. Remember, we wanna make the rocket land. Optimizing costs when the value is zero, is pointless work as we can clearly see when looking at the ROI formula. This formula is the standard ROI formula that's used in agentic development, and there's one catch to it, which is that you can't really calculate it cleanly because how exactly do you quantify the value of an agent's output? A lot of companies struggle with this, and frankly, we don't have the perfect recommendation either. But just because you can't calculate the formula, it doesn't mean it still isn't useful. You can look at it as guidance. There's another dynamic at play that makes this argument even stronger. So in many cases, we can see that increasing the value of an agent is achieved by actually decreasing the amount of tokens. And what do I mean by that? Oh, if you send off an agent with a lot of irrelevant context, the quality of the output is going to be low. The agent is going to be confused. It's not going to know what to prioritize. It's not gonna know what's relevant and what's not relevant, and so it's more likely to produce low quality output. And not only that, but the conversations compound with much of that useless information that you're paying for with every turn during the agent session. So reducing this context is a huge first lever for both quality and less token costs. But this isn't the only reason that quality plays a huge role in our, in our agent ROI. The third reason for why focusing on quality is more important can be explained by looking at the compounding error problem in multi agent, multi step agentic workflows. So large language models are nondeterministic as we all know. They've got an error margin and won't be ever and won't ever be a 100% accurate. Apologies if that information may have shocked you. In multi step agentic workflows, this creates compounding errors. So a quick look at the math. At 99% accuracy per step, which obviously is optimistic, 50 steps reduces your overall accuracy to just 61% and drop to 95%, which is still pretty good. Right? And you end up at 8% accuracy over 50 steps. Now this doesn't mean that every agent will fail, but it means every quality improvement significantly increases your chance of success. And consider the cost. Every agent miss effectively sends those tokens off into the sun. And then you've got bug fixes, which is more tokens, reviews, and additional agent runs. You might even have incidents caused by low quality agent output. And in traditional software development, we'd handle that with the shift left movement where we move quality testing and security checks earlier on in the development process. And all of that becomes even more true in agentic systems. So to sum things up, instead of counting tokens, focus more on making every one of those tokens count. So, yes, reduce token size, but not driven by cost and instead by quality. You wanna send fewer rockets with higher accurate accuracy, which will automatically optimize the fuel powering those rockets. So in order to increase the quality of agents, we're gonna go back to the fundamentals to understand a bit about how all of this works. So the LLMs, the agents, the context window, and how they all tie together. So a quick high level refresher. An LLM is ultimately just a text in, text out machine. It's a word probability machine. So given the input and the model's training data, it predicts the most probable sequence of words until you have a sentence. And in coding, you have the same mechanics at play. It's just predicting the next instruction or statement as opposed to prose. And, of course, models have gotten better and better over the course of time. Some bells and whistles have been added for more compute, some higher accuracy, biases towards software development, etcetera. But the underlying principles and capability capabilities are still very much this. So when mapped to response quality, the core principle is context balance. You wanna provide as little context as possible, but as much as required. And finding that sweet spot maximizes the LLM's effectiveness. So if you give too much context, the irrelevant information biases the model toward incorrect answers. It can't distinguish relevant from irrelevant data and considers everything when calculating its responses. And if you give too little context, the model lacks the necessary information to get to generate accurate responses. It may make incorrect assumptions or hallucinate details to fill in those gaps, and this, of course, leads to low quality output. And even worse, there's no error message that gives you an indication that this is what happened. Now the math makes no distinct, distinction between hallucination and fact, and this is why context engineering is the fundamental skill for working with agents. So with agents, the previous rule matters even more because it's no longer a single back and forth interaction. The agent talks to the LLM on your behalf dozens of times before it comes back to you with its response. So an agent is just application code. You'll often hear hear it referred to as a harness, and we offer multiple different harnesses with Copilot. So you've got Versus code chat, you know, where it all started, you know, with an in in a inline command, code completion. And then came the Copilot CLI and the Copilot cloud agent, formerly known as the Copilot coding agent. And even third party harnesses, you may have heard of, like Cloud Code and Codex. And the LLM itself is, you know, just the model. So your your GPT, your your cloud models, your Gemini models. Now what we need to understand here is that the way that an inter an agent interacts with is is not magic. It's still just text. And most importantly, it's stateless text. So an LLM does not store conversations. And what is a conversation? A conversation really just means resending the entirety of inputs and outputs in order every single turn, and there's your conversation. Tokens and context compound as your session progresses and spans, you know, multiple turns. So as you can see, the harness itself already has a significant role in agent quality, but you still have ways that you can influence it. And your levers to influence the agent include your prompt, the files in your prompt, agent configs, such as a custom instructions, custom agents, skills, and MCP servers. All of these things are part of the context that gets sent to the model, and they all influence the model's output. So you have a lot of power to steer the agent in the right direction, but you also have power to steer it in the wrong direction if you're not careful with your context. So to better understand token consumption, let's first look under the hood at context windows. On every loop, the agent sends the entire conversation to the LLM again. So on the initial loop, you have your system prompt, your tools, your prompt itself, and any file references you included in the prompt. All of that gets sent to the model as input tokens, and then the model responds back with output tokens. And then on the second loop, all of loop one plus previous responses and any new inputs, they all get sent again. So the tokens compound with every single loop. So we'll cover a little around token basics as well. So as a rough, rough mental model, one token is equivalent to about three fourths of an English word, though that varies by language, punctuation, formatting, and whether you're looking at prose or code. Common words may be a single token while punctuation, symbols, code syntax, and longer or less common words may be split into multiple tokens. And in practice, providers usually talk about tokens in three billing categories. You have input tokens, output tokens, and cash tokens. Input tokens are the prompt, the instructions, the tool definitions, file content, and conversation history sent to the model. They're usually cheaper than output tokens, but they still cost money and consume space in your context window. Output tokens are the tokens that the model generates in its response. They're usually the most expensive. They're generated sequentially. And then cache tokens are the tokens the the provider can reuse from prior context instead of fully reprocessing them. They're typically the cheapest, but caching behavior and pricing will vary by provider and model. And then context windows vary a lot by model as well. You know, they range from tens of thousands of tokens to hundreds of thousands, with some models today supporting, you know, 1,000,000. And just to make that, number feel a bit more tangible, 1,000,000 tokens is roughly, equivalent to several very large books worth of text, so around in the ballpark of the Lord of the Rings trilogy plus the Hobbit, that would fit into a, 1,000,000 context window. But let's not get overly caught up in the tokenization details as you really don't have that much control over it anyways. Think of it, at a more basic level. So your prompts, files, and responses, they all consume tokens, and they compound with each loop. Alright. Context rot. So before diving into controls, let's understand two major problems with context windows. The first is the lost in the middle scenario in which less than 50% of the context window is taken up. So in this situation, the model can tend to favor, content at the beginning and end of the context window as opposed to the middle. And this is usually fine because the beginning contains your instructions, your goals defined by the prompt, and perhaps an implementation plan that you started with. So you want this prioritized. At the end of the context window is the current work stream, which is also really important. And then you have the middle of the window, which contains the past work, which isn't quite as relevant as the beginning and the end, but still contains important information the model needs to be aware of. And the problem hits when you switch tasks mid section mid session. For example, you start with a bug fix, and then mid conversation, you say, okay. Now let's implement this cool feature. As the window grows, the model may suddenly switch back to the bug fix because it biases the initial statement, over the recent ones. So the solution here is to use a new context window for each distinct task. And then there's recency bias, which comes into play with over 50% of the context window, being consumed. Above 50% capacity, the model starts favoring only the end of the conversation. I can tend to forget your system instructions, your custom instructions, and even your original prompt. So the model drifts and starts doing things you don't understand based solely on recent context. And so the solution here is to try to avoid letting your context window, your token window grow beyond 67% 60 to 70% unless unless it's absolutely necessary. Dividing and conquering tasks from the get go is the best way to counter this solution, followed by potentially compacting the conversation, which is a feature present in Versus Code and the Copilot CLI, which we'll talk, more about a bit later. Lastly, don't take this as something that will always happen. We don't wanna talk in absolutes here. Recency bias doesn't mean the model will forget everything in the middle. There are, of course, scenarios where you'll need to go above 50 of the token window and hit sixty, seventy, 80%, and that's fine. It's just something to keep in mind that you can use to optimize your usage. So hopefully, this has given you a good foundational understanding of tokens, context windows, large language models, and agents, and some initial ideas on how to optimize your usage. So with this knowledge under our belts, now let's look at the practical guidance and things that you can do starting today to improve the quality of your agents and your tokens. Before we dive in, one thing to understand here is that depending on your level of maturity in working with agents, you should either take more or less care about optimizations. So if you're on the left side of the spectrum and only sending off a handful of agents a day, you tend to work synchronously and you treat AI as an assistant rather than an autonomous part of your team. These optimizations aren't gonna make a huge impact. So if you only spend $20 a month on a few tokens, even saving 50% through every optimization in this deck, will only get you to about $10 of savings. So probably not worth a huge, effort on your part. But compare that to what we might call an AI engineer over on the far right of the spectrum. So someone who typically orchestrates multiple asynchronous agents and dozens, if not hundreds of them every day. So every percent of token usage that you can shave off and every percent of quality that you can add, and if you think back to the compounding error problem we looked at earlier, that's effort well spent across the accumulation of your agents. So the optimizations that we're going to look at next are ordered by lever and by the effect that they have. So if you're on the left side of the maturity spectrum, the first few of these are more relevant than the later ones. And I'm also gonna share some power tips, some power user type tips, at the end. The common pattern that we see is that the majority of us tend to use the bigness the biggest reasoning model for everything, including typo fixes. I've seen countless customer premium request usage reports, and the waste is pretty significant. So imagine Opus four seven comes back from its long day spent in the harnesses and walks into its house and immediately begins venting to its partner. Alright. Well, they did it again to me today, all day with the typo fixes and readme.md updates. I wrote 758 hello world apps before lunch. I'm literally a Frontier model, and this is what they've got me doing. The cost difference, between GPT five four mini and Claude Opus four seven is a staggering 24 x. Model choice drastically impacts both token cost and quality, and bigger doesn't always mean better. So when do you wanna use reasoning models like Opus? So whenever you're doing planning and architecture tasks for one, debugging complex bugs, synchronous work where you drive the agent, and tasks that you know are going to require large context windows. So lots of files probably being dumped into that context window. When to use smaller models. You typically wanna use these for implementation after planning is done. So the heavy lifting the heavy lifting of the reasoning already happened in the planning phase, and now you just need execution of that plan. And in fact, using the beefier reasoning models for implementation can actually hurt quality. So even if you have a tight spec, a reasoning model might reopening the plan you know, might reopen the plan, second guess it, and and go rogue. So, we have some help, for this on the way. Starting in June, along with the switch to usage based building, the auto model picker will be improved from its current capacity based model selection to a more mature task based selection to choose the model automatically. And this task based selection will only get better over time. And if you don't wanna go the auto model picker route, just try to become more intentional about your model choice. Try to avoid defaulting to the biggest one for, you know, for every task that you're doing. Level lever number two, provide only relevant context. So try to avoid loading up your prompts with might need information. Let the agent discover what it needs. You know, it can find files on its own. Don't attach your entire project for a small menial change. Context engineering is the core skill you wanna build and continue to refine. All the tips that are going to follow here are context engineering techniques. As I mentioned earlier, compacting. And compacting is the process of summarizing an agent's past conversation history, the tool calls, and the reasoning steps. So So it extends an agent's effective context window by replacing older interactions with concise summaries of them. And that allows the model to handle handle long running tasks without exhausting token limits or degrading performance. And you should also be somewhat careful with this because compacting does come with a non zero amount of information loss due to the fact that you're summarizing information. So if lost information was relevant, you then can have the potential for creating agent misses, and then token savings token savings become quality loss. So I'd say the best approach here is to use slash clear regularly. Start fresh for each new task when the context window gets too packed. So you wanna try to avoid, steering a bloated session with 80% of the context window filled because recency bias, will become your mortal enemy in that situation. So tokens don't accumulate across sessions, So don't be afraid to throw away context without hesitation. Alright. Your prompt. Don't optimize prompts for fewer tokens. So you wanna optimize your prompt to steer the agent correctly from the start. So prompts, system prompts, and tools are always at the beginning of the context window, which gives them outside outsized influence. You'll recall our lost in the middle bias in which the model prioritizes the beginning of the context window. So your prompt has a lot of power to steer the agent, but it also has a lot of power to steer it in the wrong direction if it's not clear and precise. So be precise. Instead of prompting it to fix the bug, try instead saying issue number 45 describes a bug where x happened. Fix it. Include stop signals in your prompt. So once the bug is fixed and test pass, stop. So this keeps the agent from continuing with unnecessary work like git commit, push, linking files, etcetera. Don't spend money having the agent do your git commits and pushes for you because you're paying for those tokens. You know, just do it yourself and save the tokens for the work that you actually want the agent to do. So define a clear endpoint for the agent to stop at so that it doesn't continue doing work that you don't want it to do. Add known context beforehand. So if you know where files or other context reside, help the agent out and provide it as part of your prompt. The same is true for documentation websites that you want the agent to fetch, skills to invoke. Whatever you can put there from the start will improve your experience and reduce the cycles and tokens, you know, for the same outcome. Anytime I think I'm having to point the model to the Internet to look for guidance on something, I always try to find some sort of specific link like a GitHub docs page, whatever, to ensure it immediately has the right information, instead of having to search for it and potentially finding the wrong information. As an anecdote to this topic, even small prompt habits add up at scale. Sam Altman has said that polite filler like please and thank you costs major AI providers tens of millions of dollars in cumulative processing and energy spend, which is a useful reminder that prompt brevity, matters when you multiply it across massive usage. So there's no need to be polite to these models. You know, Copilot knows you're a good person. It's not thinking about those spicy PR comments you left earlier in the week. It has no memory. You know? Completely forgot that. Alright. Work in phases. So, research, which you can invoke with the slash research command in the Copilot CLI, loads many files and, most won't be relevant, for implementation. And if you do all three of these phases in one session, you carry irrelevant context through all turns. And this degrades quality and waste tokens, you know, even the cached ones. So a better approach to this is to create new context windows between phases. There's surely gonna be some duplication involved in this, but in exchange, you'll get improved quality and token efficiency. For planning, use good models. It doesn't always have to be Opus four seven, but use something with decent reasoning capability. And planning mode, for complex features. You know, this is where you wanna run with the heavier reasoning models because they're particularly adept at viewing plans, you know, from every angle and identifying gaps. So the goal should be to create a precise specification that covers all of the thinking upfront. And then there is parallel implementation, parallel agent implementation. So with a clear spec, you may have the option to deploy multiple agents or sub agents in parallel. These can be split up by architecture layer, so front end, back end, and database, and they can help define contracts between components. So each agent works efficiently with only the relevant context that it needs. A good example of this is the slash fleet command in the Copilot CLI, which orchestrates multiple sub agents and allows you to track them individually. This approach saves both time and tokens because the agents aren't carrying unnecessary knowledge for their specific tasks. Okay. While not strictly context optimization, deterministic controls like tests are essential context engineering tools that will help counter nondeterministic LLM behavior. And so what this means is that you should write tests, then write even more tests, and then make sure Copilot knows to execute these tests anytime it's making code changes, like, through your Copilot instructions dot MD file or through your prompt. As an anecdote here, the Copilot CLI team ships 500 PRs a week. Their number one context engineering practice is tests, and over half of their code base is just tests. And why? Because it's a test is a deterministic control. It either fails or it doesn't. And the agent will execute the deterministic control, and it counters the compounding, error problem effectively. So if after 10 steps, you've landed at 50% accuracy, the test will fail and bring the agent back on track to 99. So you basically restart the accuracy by having tests. And it's not just tests, of course. You've also got linters, you've got security scanners, and any other guardrails that you can employ. Whatever deterministic controls you can throw at the agent and have it execute, that is a great way to approach this. So let's visualize it, in the context window. So looking at the slide here, what we'll see, with a unit test in place. So picture the scenario in which, the agent happened to introduce a buggy change. A failing test immediately signals to the agent, hey. You know, you stop. You can't continue, and we have a problem here. So the agent then corrects the change and then builds on top of a stable working base. If it continues making the rest of the changes, and then it continues on making the rest of the changes until all, the tests succeed and the agent is done. Whereas, if you don't have tests, the agent will build a buggy change on top of a buggy change on top of a buggy change. It might get done quicker, maybe a little earlier, and with fewer tokens. But what you risk having without tests is a bug or even an incident. And the costs add up in the form of wasted CICD minutes, Copilot review cycles, and agent runs spent unnecessarily, human time required to fix these bugs, and a debugging session that fills the next context window. The accumulation of tokens is much higher, and the cost is much higher than if you just had tests in the first place. So spend some time on testing, shifting left. And that's what we've done in this industry for many years now to improve our outputs and improve the value. And it's just that much more applicable now for, for agents than it was before. Alright. Agent configs. So when we talk about, context engineering, often it's basically a one on one with agent configurations. Agent configurations are all those markdown markdown files and controls that you can put in place that agents will take into account automatically. So persistent instructions, custom agents, skills, MCP servers, sub agents, scoped instructions. We're gonna take a look at some of these in more detail and how you can use them to improve, quality and tokens. Alright. Persistent instructions. So these are your Copilot and, custom instructions files or your agents dot m d files that live in the dot GitHub directory in the repository, or you can have them in your user space so that they apply across all repositories that you're working in. Persistent instructions are in the context window for every agent session and every interaction. The contents of these files are sent up to the model with every single prompt that you send. So how do we look at approaching these files? What are some of the requirements? Keep them keep them concise and small. You know, don't take entire documentation or human readable guides and just dump them in there. So you wanna think of it as your human in the loop proactive guidance for every agent. Put your nonnegotiables in there, your project guardrails that every agent want should be following. Anytime you're, encountering an error that an agent makes, you know, during a session. So you can correct those recurring errors, like when the wrong testing framework is is used or the wrong build command was used. Add those to your instructions file so that you, you know, you don't have them, occur next time. Along with that, there's output trimming. So be concise, drop niceties, only return code. This also goes back to kind of the the, notion of don't have don't let the agent do your git commits and and handle those kind of, easier tasks that you you can do yourself without having to spend tokens. You know, output tokens are the most expensive, so trimming them matters a whole lot. Don't use AI to generate your instructions. So this is your chance as a human to guide the public, to guide the agents and fill gaps that AI can't know. So AI generated instructions, they tend to be verbose and sometimes imprecise, which cost tokens. It's, some of this is a bit counterintuitive to the easy methods that we've provided in Versus Code to generate these instructions along with the slash init command in the Copilot CLI. But in the token optimization mindset, it's actually better to write them yourself. You know your project best, and you can be much more precise and concise than an LLM LLM can be in this case. So fill this file with domain specific knowledge that only you have and that the model can't infer on its own. And that's not to say never use those, tools in the IDE or in the CLI to help you, you know, get these files going. It's fine to use them as a kind of a starting point, especially if you're new to, persistent instructions. But just keep in mind, generally, you want to be very intentional about what you put in there and, you know, these these are nuances that you you are familiar with about working in your code base. That's generally one what you wanna keep in there. And along with that, you know, these files don't need to be perfect right from the start. Revisit them regularly and make small adjustments as you learn from your agents. So if you see a recurring error, add a line to prevent it. And if you see an agent doing something unnecessary, then add a line to trim that behavior out. And then recreate them every few months. So models change and so does your project. As an example, the Copilot CLI team throws away their entire instructions files every three months because they might be outdated, no longer relevant, no longer contain the required information, or compound useless information. Treat them as living, breathing documents. Custom agents. Alright. Custom agents are a way to force an a, an agent to adopt a specific persona or way of working. Their best use is something manually invoked by you as a human when you wanna orchestrate an agent workflow and have an agent behave in a very specific way. And this is done by either switching to that custom agent in the agent drop down of Versus Code or the cloud agent or by invoking it directly via a slash command. And in the example here on the slide, we have a test driven development agent. So one that's very specifically scoped to only implement red failing tests. So this is something an agent wouldn't do on its own and would require a lot of prompting to get done. So a custom agent is a nice way to write that prompt once and then reuse it again and again. So as I mentioned earlier, you usually invoke them manually. They can also be invoked automatically by the harnesses, but we'll leave that part out for now. But for our mental model, you invoke them with a slash command. So in this case, you might say, hey. Add an API endpoints and implement the test first. So the harness will then retrieve that custom agent file and then adjust the available tools accordingly. And that's another neat thing that you can do with these custom agents. You can adjust which tools the agent has access to. So be intentional but intentional about which tools you define in the tool section of a custom agent's YAML front matter. So this part at the top here where the, tools definition block is. So the tools, they take up space in the context window. And if you don't need them, don't give them to an agent. So this alone will reduce a little bit of con token consumption, though it's probably not the most relevant lever here. Input tokens will get cached. And even though tools can make up a large portion of the system and tools prompt, they're usually not a big lever when we talk about token optimization. The real benefit here is preventing agent misses. So what it will do is prevent your agent from going down a path that you didn't intend it for to go down in the first place. So, for example, if you only want it, to read an issue in GitHub for getting the specification, not write or update that issue, you prevent the agent from going down that path by simply not giving it access to the tool to do so. Skills. Alright. Skills are very close to custom agents, but they're not quite the same. So skills allow you to have a markdown description that makes your agent behave in a very specific way just like a custom agent. But the key difference here is that a skill is something you offer to your agent based on the task that it's doing, and it can be loaded dynamically. So it's not always on context. Part of it is the harness pulls the skill description and puts it into the context window. So same as with tools, the harness is now offering skills to the LLM. So when the LLM detects a task matching a skill, so, for example, work on the API, it tells the harness to load that skill file along with any reference files or scripts that are also provided by the skill. And this is a great way to offload context that isn't always relevant. So you only pull it in when it's needed for the task at hand. So some best practices around skills, don't overdo it. You don't need hundreds of skills. Be aware of the fact that the skills tool description has to go into the context window. So every single skill that you add adds tokens to the context window. Also, be wary of redundant skills. So, for example, the LLMs are highly proficient at React development. So if you have a skill that describes how to do React development, might actually not add very much value, but it will add a lot of tokens. So be mindful of the fact that skills can be a double edged sword. Lastly, maintain them just like everything else that we've, just the persistent instructions, the custom agent files. So as the LLMs evolve, some skills become unnecessary. So regularly review and update your skills to ensure they remain relevant and efficient. MCP servers. So MCP, short for model context protocol. MCP grants us the ability to pull in information from external sources and use it as context. So they provide dynamic tools to the agents, and once they're activated, they return tool descriptions that also go into the context window. So as an example, the GitHub MCP server offers a get issue tool. So when the user sends a prompt sends off a prompt like read issue number 45, The LLM recognizes the tool provided by the MCP server based off that description, and it invokes it via the harness. So be intentional with your MCP servers. They've got the potential to bloat tool descriptions in the context window, which leads to context waste sorry, token waste. So if you have an MCP server that offers 20 tools but your agent only ever needs to use two of them, you're paying for the tokens that the other 18 tools, of those other 18 tools in every single turn. And more importantly, they can lead agents to call undesirable tools. So, for example, if your GitHub MCP server configuration offers both get issue and update issue, but you only want the agent to read issues and not update not update them, the agent might still call update issue because it's in the context window. Of course, this can lead to unintended consequences, so be very intentional about what tools you're enabling, through MCP. And don't just enable everything because it's easier. And then lastly, deactivate MCP servers that you don't always need or just put them in custom agents. I wanna give another example, of this for you guys. The Playwright MCP server is a really powerful, tool for web front end work. So it allows the LLM to drive an automated browser session. However, it's costly in that it can take screenshots and execute page reads that can consume many tokens. Images are very expensive in terms of tokens, so be very intentional about when you use this thing. If it's always left on, it might trigger unnecessary work, like reading a web page for a simple CSS color change. So use it only in combination with custom agents when it's actually needed. And there's also a hidden trap with image tokens and multi turn agent workflows. So if an image is attached early in the session as context, it gets reprocessed and billed again on every subsequent turn because the full conversation history is sent back to the model each time. Let's talk a little bit about sub agents now. So sub agents open a second context window for specific tasks, like, for example, research. And they're able to help prevent filling the main session with irrelevant information. So the sub agent processes documents, creates a summary, and returns only relevant information back to the main session. So, this improves main session quality, but, of course, comes with a trade off of tokens spent in the sub agent. So when to use sub agents? A lot of the time, the agent is, you know, gonna decide this automatically for you, but also, often you're able to explicitly invoke them, for example, with research tasks. You can kick those off, and and that will trigger a sub agent. Just be mindful that they should be used cautiously because they're a conditional optimization. So a bit of the trade off of, you know, a bit more tokens to ensure that you don't end up, you know, adding too much back into the, the main context window. Other agent configs that we can, use as levers. So these other agent configs have less of an impact on tokens and quality, but they're still good to be aware of. Scoped instructions, these are also known as path specific instructions dot m d files, and they allow you to provide instructions that only apply to agents working in a specific directory. They have the apply to field in the YAML front matter at the top where you can specify the path that they apply to. And we find that these are most beneficial in monorepos with distinct code sections. You should start with repository wide custom instructions and then look to leverage the scoped instructions if your main instructions file starts to become, you know, too long, too big. And they also come with real maintenance overhead because, you know, they're likely gonna be spread all over the repository since every scoped file has to be kept current as the code base evolves. So something to consider about the scoped instructions. You got prompt files as well. Prompt files have been around for a while, and they're they're reusable prompts, which are manually invoked. Usually, skills or or custom agents are better choices, but just be aware of the existence of prompt files. And then there's Copilot memory. Copilot memory, automatically learns from your behavior and team patterns and creates instructions that improve agent quality over time. They're stored in the repository. This tends to this all works in the background. So Copilot determines when and how they're created, and Copilot also manages the life cycles of the memories, itself. So if a memory hasn't been used in twenty eight days, then it will be removed. So there's not much to proactively optimize with memory, but, they're worth checking out periodically. You know, go into your repo where those are stored. They're stored in the settings. In the Copilot settings, you'll find a Copilot memory if it's if it's enabled there. They're enabled on a user base. So, just go you the the thing to take away here is that you can go in and delete these as needed. But largely, they're just kinda managed behind the scenes for you. Power user guidance. Alright. So these tips, these require a bit more knowledge and testing. They sometimes trade quality for token savings, but there's still some things to take away from these items. Think in code. Create scripts to filter outputs before analysis. So for example, filter the GitHub rest API to relevant fields only. So if you're instead of sending a full API massive API response, big old JSON blob up to the model, maybe you create a script that extracts only the relevant fields from that and send that up to the model and save some tokens. Can also improve quality by, that will remove irrelevant information from what you're sending up to the model. Can CLIs versus MCP servers, and this is a bit of an ongoing debate. So CLI tools, like the GitHub CLI are already known to models and may be leaner than MCP equivalents. You're not having to add tool descriptions for the CLI because the model already knows how to use it, Whereas with an MCP server, you, you know, it adds the tool description to the context window, which takes up tokens. So if you have the option to use a CLI tool instead of an MCP server, for a specific task, that may be a more efficient choice. Shell output optimization. Here's a pretty good, really good tip. I think this is, one of the, one of the bigger levers, I think, also that we can call out here. Keep your harnesses up to date. So newer releases of Versus Code include, token optimization enhancements like terminal output compression. So where Versus Code post processes terminal command output before sending it up to the model. And I can't stress enough that this is one of the easiest things that you can do on the token optimization front is to simply keep your harness up to date. New releases of Versus Code and the Copilot CLI include optimizations that will save you tokens without you having to do anything other than just update the update the application. And Versus Code releases are now coming out on a weekly basis, and the last four of them have all contained some form of token optimization improvement. So check the release notes. Sometimes these features require you to, they they're in preview. They may require you to go into Versus code settings and check a box, but certainly worth experimenting and, you know, and will have an impact on, you know, the amount of tokens that you're sending up. The Copilot CLI is also being updated throughout the week recently and contains similar improvements on that front. Chronicle. This is now available in both Versus Code and the Copilot CLI to analyze your session logs, to suggest prompt optimizations. So use the slash chronicle tips command combined with a prompt requesting an analysis of your prompt your prompting behavior and your model selection. It should come back and print out an actionable report on when you could have used a less expensive model or when you could have optimized your prompt for better quality and token savings. So try it out. It's in both the Copilot CLI and Versus Code, chat interface now. So, I've been using this regularly, and I really find the feedback to be helpful for me, because I I am certainly guilty of using, way too beefy models for, you know, tasks that were not requiring, such a a heavy model. Collapsing tool calls. So just be aware there are tools out in the wild that can help batch multiple tool calls into one to help reduce turns. And then you've got model specific, optimization. So this is really only for power users with thousands of agents and happen to have a really strong understanding of the quirks of different models. So for example, if you know that a certain model is particularly good at following instructions but not so great at cogeneration, you might use that model for planning and then switch to a different model for implementation. Alright. Long term guidance. So, let's end the session with a more, more forward looking long term outlook on what things to focus on to be truly successful with agentic development and enhance your context engineering skill set. So build your analytical skills. And what's always set developers apart was never just writing code. It's the analytical skills that you bring to the table. So you build domain knowledge quickly, you understand customer needs, and you know how to translate those requirements into technology. This is your strong suit, and it's going to continue to be highly in demand going forward. And this is something agents can't do on their own. They don't understand the nuances of your customers. They're not able to make high level decisions about what matters in an application. They're able to execute but not strategize like you're able to. So apply good architecture. And this is now more important than ever because good architecture is able to reduce agent misses. We can provide navigation, guardrails, and help prevent agents from placing code in the wrong locations, and just in general, maintain high code quality. So, these architecture patterns clearly distinguish low level technology from differentiating domain core and give agents excellent guardrails. And lastly, iterate on your prompts and your agent configs, because you're now a context engineer, and this isn't one time work. It's continuous engineering. So approach it with an engineering mindset. Set your agents up for success consistently. Tools like Chronicle can help analyze and optimize your prompting behavior over time. So make your goal to continuously improve your prompts and agent configs based on data and feedback from your agents. Alright. So let's wrap up here. We are getting close to time. I realize a lot of this, was this is a lot to take in. So as a reminder, here are the biggest tips we can give you today, that you can use to start improving your agent quality, and token spend that don't require too much effort. Choose the right model for the job or just rock the auto model picker come June 1. Provide clear guidance in your prompts. Split your tasks into separate research plan and implementation sessions. Provide deterministic guardrails. So test, tests, and more tests, linters, security scans, maintain a concise human written persistent instruction file going forward, and lastly, keep your harnesses up to date to benefit from the built in optimizations that are being added to them on a regular basis now. And if you take anything out of this talk, it's provide as little context as required and as much as necessary. So implement those tips, and you'll already be in a fantastic place going forward. Alright. So that's the end of pres of the presentation today. I wanna thank you all for your time and attention. I hope this was useful and that you learned something new about how to work with agents, how to optimize your tokens, and how to be a better context engineer. So until next time, take care.