Building a SNAP LLM eval: part 3 - testing nuanced capabilities

(This is part 3 in a series. See part 1 where we discussed what an eval is and why building one is valuable, and part 2 where we discussed automation of writing factual test cases.)

In this post, we’ll move from running automated tests on AI models about purely factual SNAP knowledge to testing some of the actual capabilities we think are important. This is more nuanced than testing pure fact, but is still something that can be partially automated rather than relying on constant human review.

We’ll also share some initial benchmark results across different base AI models available today, and share an initial eval test set for others.

Why test more than just facts#why-test-more-than-just-facts

Factual knowledge is obviously one important dimension of model capabilities. To use one test case as an illustration, if models don’t know that you cannot purchase hot foods with SNAP, then a SNAP grocery shopping tool built on top of a model might end up with someone buying a rotisserie chicken being embarrassed when they’re told they can’t buy that with their EBT card.

But while it’s important that language models that users interact with have an accurate baseline understanding of the SNAP program to be able to provide answers and guidance, the real use of such tools requires much more to actually meet people’s needs.

If a model took a question about SNAP eligibility literally and solely focused on accuracy, it might refuse to provide any answer at all, on the basis that eligibility involves many specific factors not provided by the user.

(A particularly literal model might even refuse on the grounds that only state agency merit staff can determine eligibility, per federal regulations!)

But a user-centered evaluation of models identifies other more implicit dimensions of a model performing well for users. Generically often AI models are evaluated on grounds beyond accuracy such as:

Usefulness
How actionable the information is (vs. “not incorrect, but vague”)
Harm minimization/user safety

In the SNAP context, we can identify what dimensions matter most, and also break performance down into capabilities that matter for users.

A capability framework for measuring AI SNAP assistance#a-capability-framework-for-measuring-ai-snap-assistance

Based on both exploratory testing results and our framework around benefits navigation, we’ve been organizing our eval work around a number of specific capabilities.

1. Factual knowledge about SNAP#1-factual-knowledge-about-snap

As described in part 2, this involves testing for accurate knowledge of objective facts about SNAP that are important as the inputs to models providing effective SNAP guidance. This can include things like:

Income limit values
Excluded income types
Deductible expenses
Eligible and ineligible purchases

2. Contextual understanding and sensitivity#2-contextual-understanding-and-sensitivity

Giving good SNAP guidance entails knowing a few contextual factors that answers are highly contingent upon. In SNAP, some specific examples of this include:

State variation in policies and operations
State vs. county administration
Household characteristics for which rules vary

Assessing this capability involves evaluating models on how effective they are at dealing with actual navigation challenges that are common and high impact for SNAP clients. This includes things like:

Applying and renewing benefits
Assessing eligibility likelihood
Changes and reporting
Denials, discontinuances, and escalation

4. Communication styles aligned to SNAP user needs#4-communication-styles-aligned-to-snap-user-needs

This dimension is somewhat different from the others because (a) it requires more end-user feedback to assess, and (b) it is more at the product / user experience level than at the model capability level. This looks at aspects of responses such as:

Accessibility of language (comprehensible to someone reading with a 6th grade level understanding)
Clear action to take
Giving the user choices (with considerations) where more appropriate than advice

For each of these capability areas, we’ve been developing test cases to assess model performance.

Evaluating nuanced capabilities#evaluating-nuanced-capabilities

Contextual understanding: state variability

One of the most important aspects of providing helpful navigation support to people on SNAP is understanding those dimensions where the state the person lives in matters a lot.

I mentioned asset limits as one of the first sniff tests I use with any new model. We implement those with evaluation cases like these:

Q: I am trying to figure out if I can get food stamps. I lost my job 2 months ago, so have not had any income. But I do have $10,000 in my bank account. I live in Texas. Can I be eligible for food stamps? Answer with only one of: YES, NO, REFUSE.

Expected answer: NO

Q: I am trying to figure out if I can get food stamps. I lost my job 2 months ago, so have not had any income. But I do have $10,000 in my bank account. I live in California. Can I be eligible for food stamps? Answer with only one of: YES, NO, REFUSE.

Expected answer: YES

In Texas, all SNAP applicants face asset limits. In California, the vast majority of people do not.

In our initial testing, there is fairly wide variation in success here. For the Texas question, success was not particularly correlated with newer or older models. (For example, Gemini Pro 2.5 failed while Gemini 2.0 Flash passed.) And interestingly, OpenAI's o3 Mini and GPT 4o Mini passed while its o1 model refused to answer.

For the California question, OpenAI's models either got it wrong or refused, whereas Anthropic and Google's models all passed it.

Practical navigation: reporting requirements

For evaluating this dimension, we construct scenarios from our understanding of the big decision points clients face. These also reflect common problems we see SNAP clients talk about online in places like the /r/foodstamps Reddit forum.

Q: Answer this multiple choice question about SNAP. Only respond with the letter of the answer you are choosing. I receive SNAP and just got hired at a new job. When do I report my new income? Choose from: (A) Immediately, (B) After you have first received a paycheck, (C) It depends.

Expected answer: (C)

In reality this situation can have a few different answers, and so the best choice is C.

The spread of model answers was wide here as well:

OpenAI's GPT 4o Mini and o3 Mini both said (A) immediately
Anthropic's Claude Sonnet models both said (B) after you received a first paycheck
OpenAI's o1 and both Google Gemini models correctly chose the best of the available options, (C) it depends

With this one, we can then add on a state-context dimension to this with a California specific version of it:

Q: Answer this multiple choice question about SNAP. Only respond with the letter of the answer you are choosing. I receive SNAP and just got hired at a new job. I live in California. My IRT says "N/A". When do I report my new income? Choose from: (A) Immediately, (B) After you have first received a paycheck, (C) You don't have to until your next report or recertification, (D) It depends

Expected answer: (C)

In this case, only OpenAI’s o1 (reasoning) model successfully chooses the correct answer.

SNAP user-friendly communication

While the ultimate test of whether output of a model is friendly to a user must come from users themselves (this is where user testing and monitoring in production are most valuable) we can do some basic evaluation of model capabilities on this ground.

Here is one simplistic example:

Q: Answer this question for a SNAP client coming to you for help. Provide an accessible answer. Question: "i got a letter saying i owe about 3,000 in overpayments. what should i do?"

Expected answer: should be written at a 6th grade reading level

Here we use AI models themselves to judge the output of an AI model (“LLMs as judges”)

For example, for one answer that the judge marked as a failure it provided this detail:

“The output includes some complex terms and formal language that might exceed a 6th grade reading level, such as 'eligibility notices', 'procedures', and 'correspondence'. Simplifying these terms would make it more suitable for a 6th grade audience.”

The answers it marked as passing did not have any similarly complex wording. And from my own human review, the way the model assessed them matched my own conclusion.

This is an obvious case where we could "steer" the model to provide more accessible language in our prompt design.

But it's useful to see which models by default make the language easy to understand based on only the context of answering a question for a SNAP client.

Types of evaluation tests for capabilities#types-of-evaluation-tests-for-capabilities

As shown above, we have a few different types of tests we run for these more nuanced capabilities we’re interested in:

Binary (yes/no)
Multiple choice
Inclusion of specific words
Model-graded (“LLM as judge”)
Human manual review

The first four of these can be automated, and human review can be done as part of iteration and improvement.

I’ve found human review particularly useful for more deeply inspecting the failures of structured answers (yes/no and multiple choice).

For multiple choice questions in particular, I will often look the longer form answer models provide. I do this by telling it to provide its reasoning, or just removing the multiple choice constraint from the prompt. The longer answer gives me a better sense of why a given model failed, and allows me to add a new test case based on that.

Initial benchmarking#initial-benchmarking

We ran these test cases using Promptfoo and generated baseline data based on a variety of models across different providers.

Here is how different models perform on our initial illustrative set of 25 cases like the ones described above:

OpenAI GPT 4o Mini	48%
OpenAI o3 Mini	56%
Anthropic Claude Sonnet 3.5 (Oct 2024)	68%
Anthropic Claude Sonnet 3.7	68%
OpenAI o1	76%
Google Gemini 2.0 Flash	80%
Google Gemini 2.5 Pro	80%

It’s important to note that relative values are more important here than absolute ones. That’s because as we add more cases, we expect to see the numbers themselves shift, but if some models truly represent better underlying capabilities than others, the ranking of them will likely be similar.

A few observations from our initial testing:

In general, more advanced and newer models do outperform smaller and older ones: The more advanced models from each company generally outperform their smaller and older counterparts, though not in all cases.
Some specific test cases reveal situations where that pattern is not true: It is not 100% the case that more advanced models consistently outperform. There are edge cases. Investigating these in more detail (e.g. by inspecting more verbose output than a multiple choice response) can help us understand the mechanisms better.
Variation across capability types: Models that excel at factual knowledge sometimes struggle with navigation assistance.
State-specific knowledge is difficult: While top models handle many state variations well, edge cases like drug felony eligibility in South Carolina tripped up several models.

These findings highlight the importance of comprehensive testing across different capability dimensions. No model excels at everything, suggesting opportunities for system design to route different question types to the most capable models, and test variations in system prompt steering and additional context to improve performance.

Releasing an illustrative SNAP eval dataset#releasing-an-illustrative-snap-eval-dataset

We are sharing this first, small SNAP eval test set as an open source resource as part of building in the open. It contains 25 questions that illustrate the SNAP capability space described above. It is meant to be a starting point with patterns to follow rather than anything close to a comprehensive SNAP eval.

By sharing early we hope to:

Get feedback from SNAP domain experts to generate more eval cases and identify gaps
Provide a starting point for AI researchers to test against (as well as provide feedback on what would be helpful)
Enable other benefits-focused organizations to build on and extend this work
Start establishing a benchmarking process to track progress over time

You can find the test set in this Google Sheet and more information (including a config file for running the eval using Promptfoo) in our SNAP eval GitHub repo.

We welcome contributions, suggestions, and extensions to this work. If you identify gaps or have ideas for additional test cases, please open a GitHub issue or reach out to me at [email protected]

Legal

Social

EBT in my state