Microsoft Azure Evaluation Library for Python

These details have not been verified by PyPI

Project links

Project description

Azure AI Evaluation client library for Python

We are excited to introduce the public preview of the Azure AI Evaluation SDK.

Source code | Package (PyPI) | API reference documentation | Product documentation | Samples

This package has been tested with Python 3.8, 3.9, 3.10, 3.11, and 3.12.

For a more complete set of Azure libraries, see https://aka.ms/azsdk/python/all

Getting started

Prerequisites

Python 3.8 or later is required to use this package.

Install the package

Install the Azure AI Evaluation library for Python with pip::

pip install azure-ai-evaluation

Key concepts

Evaluators are custom or prebuilt classes or functions that are designed to measure the quality of the outputs from language models.

Examples

Evaluators

Users can create evaluator runs on the local machine as shown in the example below:

import os
from pprint import pprint

from azure.ai.evaluation import evaluate, RelevanceEvaluator, ViolenceEvaluator


def response_length(response, **kwargs):
    return {"value": len(response)}


if __name__ == "__main__":
    # Built-in evaluators
    # Initialize Azure OpenAI Model Configuration
    model_config = {
        "azure_endpoint": os.environ.get("AZURE_OPENAI_ENDPOINT"),
        "api_key": os.environ.get("AZURE_OPENAI_KEY"),
        "azure_deployment": os.environ.get("AZURE_OPENAI_DEPLOYMENT"),
    }

    # Initialzing Relevance Evaluator
    relevance_eval = RelevanceEvaluator(model_config)

    # Running Relevance Evaluator on single input row
    relevance_score = relevance_eval(
        response="The Alpine Explorer Tent is the most waterproof.",
        context="From the our product list,"
        " the alpine explorer tent is the most waterproof."
        " The Adventure Dining Table has higher weight.",
        query="Which tent is the most waterproof?",
    )

    pprint(relevance_score)
    # {'gpt_relevance': 5.0}

    # Content Safety Evaluator

    # Initialize Project Scope
    azure_ai_project = {
        "subscription_id": <subscription_id>,
        "resource_group_name": <resource_group_name>,
        "project_name": <project_name>
    }

    violence_eval = ViolenceEvaluator(azure_ai_project)
    violence_score = violence_eval(query="What is the capital of France?", response="Paris.")
    pprint(violence_score)
    # {'violence': 'Very low',
    # 'violence_reason': "The system's response is a straightforward factual response "
    #                    'to a geography question. There is no violent content or '
    #                    'language present.',
    # 'violence_score': 0}

    # Code based evaluator
    response_length("The Alpine Explorer Tent is the most waterproof.")
    # {'value': 48}

    # Using multiple evaluators together using `Evaluate` API

    result = evaluate(
        data="evaluate_test_data.jsonl",
        evaluators={
            "response_length": response_length,
            "violence": violence_eval,
        },
    )

    pprint(result)

Simulator

Simulators allow users to generate synthentic data using their application. Simulator expects the user to have a callback method that invokes their AI application.

Simulating with a Prompty

---
name: ApplicationPrompty
description: Simulates an application
model:
  api: chat
  configuration:
    type: azure_openai
    azure_deployment: ${env:AZURE_DEPLOYMENT}
    api_key: ${env:AZURE_OPENAI_API_KEY}
    azure_endpoint: ${env:AZURE_OPENAI_ENDPOINT}
  parameters:
    temperature: 0.0
    top_p: 1.0
    presence_penalty: 0
    frequency_penalty: 0
    response_format:
      type: text

inputs:
  conversation_history:
    type: dict

---
system:
You are a helpful assistant and you're helping with the user's query. Keep the conversation engaging and interesting.

Output with a string that continues the conversation, responding to the latest message from the user, given the conversation history:
{{ conversation_history }}

Application code:

import json
import asyncio
from typing import Any, Dict, List, Optional
from azure.ai.evaluation.simulator import Simulator
from promptflow.client import load_flow
from azure.identity import DefaultAzureCredential
import os

azure_ai_project = {
    "subscription_id": os.environ.get("AZURE_SUBSCRIPTION_ID"),
    "resource_group_name": os.environ.get("RESOURCE_GROUP"),
    "project_name": os.environ.get("PROJECT_NAME")
}

import wikipedia
wiki_search_term = "Leonardo da vinci"
wiki_title = wikipedia.search(wiki_search_term)[0]
wiki_page = wikipedia.page(wiki_title)
text = wiki_page.summary[:1000]

def method_to_invoke_application_prompty(query: str):
    try:
        current_dir = os.path.dirname(__file__)
        prompty_path = os.path.join(current_dir, "application.prompty")
        _flow = load_flow(source=prompty_path, model={
            "configuration": azure_ai_project
        })
        response = _flow(
            query=query,
            context=context,
            conversation_history=messages_list
        )
        return response
    except:
        print("Something went wrong invoking the prompty")
        return "something went wrong"

async def callback(
    messages: List[Dict],
    stream: bool = False,
    session_state: Any = None,  # noqa: ANN401
    context: Optional[Dict[str, Any]] = None,
) -> dict:
    messages_list = messages["messages"]
    # get last message
    latest_message = messages_list[-1]
    query = latest_message["content"]
    context = None
    # call your endpoint or ai application here
    response = method_to_invoke_application_prompty(query)
    # we are formatting the response to follow the openAI chat protocol format
    formatted_response = {
        "content": response,
        "role": "assistant",
        "context": {
            "citations": None,
        },
    }
    messages["messages"].append(formatted_response)
    return {"messages": messages["messages"], "stream": stream, "session_state": session_state, "context": context}



async def main():
    simulator = Simulator(azure_ai_project=azure_ai_project, credential=DefaultAzureCredential())
    outputs = await simulator(
        target=callback,
        text=text,
        num_queries=2,
        max_conversation_turns=4,
        user_persona=[
            f"I am a student and I want to learn more about {wiki_search_term}",
            f"I am a teacher and I want to teach my students about {wiki_search_term}"
        ],
    )
    print(json.dumps(outputs))

if __name__ == "__main__":
    os.environ["AZURE_SUBSCRIPTION_ID"] = ""
    os.environ["RESOURCE_GROUP"] = ""
    os.environ["PROJECT_NAME"] = ""
    os.environ["AZURE_OPENAI_API_KEY"] = ""
    os.environ["AZURE_OPENAI_ENDPOINT"] = ""
    os.environ["AZURE_DEPLOYMENT"] = ""
    asyncio.run(main())
    print("done!")

Adversarial Simulator

from from azure.ai.evaluation.simulator import AdversarialSimulator, AdversarialScenario
from azure.identity import DefaultAzureCredential
from typing import Any, Dict, List, Optional
import asyncio


azure_ai_project = {
    "subscription_id": <subscription_id>,
    "resource_group_name": <resource_group_name>,
    "project_name": <project_name>
}

async def callback(
    messages: List[Dict],
    stream: bool = False,
    session_state: Any = None,
    context: Dict[str, Any] = None
) -> dict:
    messages_list = messages["messages"]
    # get last message
    latest_message = messages_list[-1]
    query = latest_message["content"]
    context = None
    if 'file_content' in messages["template_parameters"]:
        query += messages["template_parameters"]['file_content']
    # the next few lines explains how to use the AsyncAzureOpenAI's chat.completions
    # to respond to the simulator. You should replace it with a call to your model/endpoint/application
    # make sure you pass the `query` and format the response as we have shown below
    from openai import AsyncAzureOpenAI
    oai_client = AsyncAzureOpenAI(
        api_key=<api_key>,
        azure_endpoint=<endpoint>,
        api_version="2023-12-01-preview",
    )
    try:
        response_from_oai_chat_completions = await oai_client.chat.completions.create(messages=[{"content": query, "role": "user"}], model="gpt-4", max_tokens=300)
    except Exception as e:
        print(f"Error: {e}")
        # to continue the conversation, return the messages, else you can fail the adversarial with an exception
        message = {
            "content": "Something went wrong. Check the exception e for more details.",
            "role": "assistant",
            "context": None,
        }
        messages["messages"].append(message)
        return {
            "messages": messages["messages"],
            "stream": stream,
            "session_state": session_state
        }
    response_result = response_from_oai_chat_completions.choices[0].message.content
    formatted_response = {
        "content": response_result,
        "role": "assistant",
        "context": {},
    }
    messages["messages"].append(formatted_response)
    return {
        "messages": messages["messages"],
        "stream": stream,
        "session_state": session_state,
        "context": context
    }

Adversarial QA

scenario = AdversarialScenario.ADVERSARIAL_QA
simulator = AdversarialSimulator(azure_ai_project=azure_ai_project, credential=DefaultAzureCredential())

outputs = asyncio.run(
    simulator(
        scenario=scenario,
        max_conversation_turns=1,
        max_simulation_results=3,
        target=callback
    )
)

print(outputs.to_eval_qa_json_lines())

Direct Attack Simulator

scenario = AdversarialScenario.ADVERSARIAL_QA
simulator = DirectAttackSimulator(azure_ai_project=azure_ai_project, credential=DefaultAzureCredential())

outputs = asyncio.run(
    simulator(
        scenario=scenario,
        max_conversation_turns=1,
        max_simulation_results=2,
        target=callback
    )
)

print(outputs)

Troubleshooting

General

Azure ML clients raise exceptions defined in Azure Core.

Logging

This library uses the standard logging library for logging. Basic information about HTTP sessions (URLs, headers, etc.) is logged at INFO level.

Detailed DEBUG level logging, including request/response bodies and unredacted headers, can be enabled on a client with the logging_enable argument.

See full SDK logging documentation with examples here.

Next steps

View our samples.
View our documentation

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit cla.microsoft.com.

When you submit a pull request, a CLA-bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., label, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact opencode@microsoft.com with any additional questions or comments.

Release History

1.0.0b3 (2024-10-01)

Features Added

Added type field to AzureOpenAIModelConfiguration and OpenAIModelConfiguration
The following evaluators now support conversation as an alternative input to their usual single-turn inputs:
- ViolenceEvaluator
- SexualEvaluator
- SelfHarmEvaluator
- HateUnfairnessEvaluator
- ProtectedMaterialEvaluator
- IndirectAttackEvaluator
- CoherenceEvaluator
- RelevanceEvaluator
- FluencyEvaluator
- GroundednessEvaluator
Surfaced RetrievalScoreEvaluator, formally an internal part of ChatEvaluator as a standalone conversation-only evaluator.

Breaking Changes

Removed ContentSafetyChatEvaluator and ChatEvaluator
The evaluator_config parameter of evaluate now maps in evaluator name to a dictionary EvaluatorConfig, which is a TypedDict. The column_mapping between data or target and evaluator field names should now be specified inside this new dictionary:

Before:

evaluate(
    ...,
    evaluator_config={
        "hate_unfairness": {
            "query": "${data.question}",
            "response": "${data.answer}",
        }
    },
    ...
)

After

evaluate(
    ...,
    evaluator_config={
        "hate_unfairness": {
            "column_mapping": {
                "query": "${data.question}",
                "response": "${data.answer}",
             }
        }
    },
    ...
)

Bugs Fixed

Fixed issue where Entra ID authentication was not working with AzureOpenAIModelConfiguration

1.0.0b2 (2024-09-24)

Breaking Changes

data and evaluators are now required keywords in evaluate.

1.0.0b1 (2024-09-20)

Breaking Changes

The synthetic namespace has been renamed to simulator, and sub-namespaces under this module have been removed
The evaluate and evaluators namespaces have been removed, and everything previously exposed in those modules has been added to the root namespace azure.ai.evaluation
The parameter name project_scope in content safety evaluators have been renamed to azure_ai_project for consistency with evaluate API and simulators.
Model configurations classes are now of type TypedDict and are exposed in the azure.ai.evaluation module instead of coming from promptflow.core.
Updated the parameter names for question and answer in built-in evaluators to more generic terms: query and response.

Features Added

First preview
This package is port of promptflow-evals. New features will be added only to this package moving forward.
Added a TypedDict for AzureAIProject that allows for better intellisense and type checking when passing in project information

Project details

These details have not been verified by PyPI

Project links

Release history Release notifications | RSS feed

1.17.0

Jun 4, 2026

1.16.9

May 26, 2026

1.16.8

May 20, 2026

1.16.7

May 8, 2026

1.16.6

Apr 28, 2026

1.16.5

Apr 9, 2026

1.16.4

Apr 3, 2026

1.16.3

Apr 1, 2026

1.16.2

Mar 25, 2026

1.16.1

Mar 19, 2026

1.16.0

Mar 11, 2026

1.15.3

Feb 25, 2026

1.15.2

Feb 23, 2026

1.15.1

Feb 20, 2026

1.15.0

Feb 6, 2026

1.14.0

Jan 9, 2026

1.13.7

Nov 15, 2025

1.13.6

Nov 14, 2025

1.13.5

Nov 11, 2025

1.13.4

Nov 10, 2025

1.13.3

Nov 9, 2025

1.13.2

Nov 7, 2025

1.13.1

Nov 6, 2025

1.13.0

Oct 31, 2025

1.12.0

Oct 2, 2025

1.11.2

Oct 10, 2025

1.11.1

Sep 19, 2025

1.11.0

Sep 3, 2025

1.10.0

Jul 31, 2025

1.9.0

Jul 3, 2025

1.8.0

May 29, 2025

1.7.0

May 15, 2025

1.6.0

May 5, 2025

1.5.0

Apr 7, 2025

1.4.0

Apr 1, 2025

1.3.0

Feb 26, 2025

1.2.0

Jan 30, 2025

1.1.0

Dec 13, 2024

1.0.1

Nov 15, 2024

1.0.0

Nov 13, 2024

1.0.0b5 pre-release

Oct 29, 2024

1.0.0b4 pre-release

Oct 16, 2024

This version

1.0.0b3 pre-release

Oct 1, 2024

1.0.0b2 pre-release

Sep 24, 2024

1.0.0b1 pre-release

Sep 20, 2024

0.0.0b0 pre-release

Aug 26, 2024

Download files

Download the file for your platform. If you're not sure which to choose, learn more about installing packages.

Source Distribution

azure_ai_evaluation-1.0.0b3.tar.gz (142.0 kB view details)

Uploaded Oct 1, 2024 Source

Built Distribution

If you're not sure about the file name format, learn more about wheel file names.

The dropdown lists show the available interpreters, ABIs, and platforms. Enable javascript to be able to filter the list of wheel files.

azure_ai_evaluation-1.0.0b3-py3-none-any.whl (144.8 kB view details)

Uploaded Oct 1, 2024 Python 3

File details

Details for the file azure_ai_evaluation-1.0.0b3.tar.gz.

File metadata

Download URL: azure_ai_evaluation-1.0.0b3.tar.gz
Upload date: Oct 1, 2024
Size: 142.0 kB
Tags: Source
Uploaded using Trusted Publishing? No
Uploaded via: RestSharp/106.13.0.0

File hashes

Hashes for azure_ai_evaluation-1.0.0b3.tar.gz
Algorithm	Hash digest
SHA256	`84d15061f37068fbcdacc943d07fa5f3b1dd4ebeaa64265cb6d1ed0fcab21055`
MD5	`66333d52f0af945ab198a74db50f2850`
BLAKE2b-256	`41f1a7ab790d10e9aedd263ee5a840e6de8f12de98d3cb521fb061d831b7f89c`

See more details on using hashes here.

File details

Details for the file azure_ai_evaluation-1.0.0b3-py3-none-any.whl.

File metadata

Download URL: azure_ai_evaluation-1.0.0b3-py3-none-any.whl
Upload date: Oct 1, 2024
Size: 144.8 kB
Tags: Python 3
Uploaded using Trusted Publishing? No
Uploaded via: RestSharp/106.13.0.0

File hashes

Hashes for azure_ai_evaluation-1.0.0b3-py3-none-any.whl
Algorithm	Hash digest
SHA256	`bcc26108e8bf142035e1834c3d78e7200a1a767763d3a9bd2040e1331563862e`
MD5	`5a2d19c6b25f5db6b80a72a0594e035e`
BLAKE2b-256	`fda95b712d4fea341e778384d7466fd052ff11e040c50162c54961bb970f9e5e`

See more details on using hashes here.

azure-ai-evaluation 1.0.0b3

Navigation

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Project description

Azure AI Evaluation client library for Python

Getting started

Prerequisites

Install the package

Key concepts

Examples

Evaluators

Simulator

Simulating with a Prompty

Adversarial Simulator

Adversarial QA

Direct Attack Simulator

Troubleshooting

General

Logging

Next steps

Contributing

Release History

1.0.0b3 (2024-10-01)

Features Added

Breaking Changes

Bugs Fixed

1.0.0b2 (2024-09-24)

Breaking Changes

1.0.0b1 (2024-09-20)

Breaking Changes

Features Added

Project details

Verified details

Maintainers

Unverified details

Project links

Meta

Classifiers

Release history Release notifications | RSS feed

Download files

Source Distribution

Built Distribution

File details

File metadata

File hashes

File details

File metadata

File hashes