Evaluating LLM agents in practice
LangChain provides several predefined evaluators for different evaluation criteria. These evaluators can be used to assess outputs based on specific rubrics or criteria sets. Some common criteria include conciseness, relevance, correctness, coherence, helpfulness, and controversiality.
We can also compare results from an LLM or agent against reference results using different methods starting from pairwise string comparisons, string distances, and embedding distances. The evaluation results can be used to determine the preferred LLM or agent based on the comparison of outputs. Confidence intervals and p-values can also be calculated to assess the reliability of the evaluation results.
Let’s go through a few basics and apply useful evaluation strategies. We’ll start with LangChain.
Evaluating the correctness of results
Let’s think of an example, where we want to verify that an LLM’s answer is correct (or how...