Captura de Pantalla 2024-05-31 A La(s) 9.07.37 A. M.
Captura de Pantalla 2024-05-31 A La(s) 9.07.37 A. M.
Deploying Large
Language Models
on Databricks
Databricks
Course Outline
Course Introduction
Module 6 - LLMOps
Before we begin
. Why LLMs?
. Primer on NLP
(LLMs)
                                 Questions we hear
                                   about LLMs
“Chegg shares drop more than                                                                  “[...] ask GitHub Copilot to explain
    % after company says ChatGPT                                                              a piece of code. Bump into an
is killing its business”                                                                      error? Have GitHub Copilot fix it.
                                                                                              It’ll even generate unit tests so
                                                                                              you can get back to building
                                                                                              what’s next.”
                        05/02/2023
                                                                                                                         03/22/2023*
                               Link
                                                                                                                                 Link
Decision criteria
                                                You:
              Exec:
                                                “Where do I
              We need to add
                                                start?”
              LLMs
What is NLP?
We use NLP everyday
NLP is useful for a variety of domains
Sentiment analysis: product reviews                               Other use cases
This book was terrible and went
on and on about…
                                      Negative                    Semantic similarity
                                                                   • Literature search.
                                                                   • Database querying.
                                                                   • Question-Answer matching.
Translation
                                                                  Summarization
I like this book.                     Me gusta este libro.
                                                                   • Clinical decision support.
                                                                   • News article sentiments.
                                                                   • Legal proceeding summary.
50:"**’s",
                                                                                      …}
Types of sequence tasks
Translation
I like this book.                 Me gusta este libro.        Sequence to sequence prediction
Speech recognition
...
“The ball hit the table and it broke.” “What’s the best sci-fi book ever?”
  Large Language Model—What about these makes them “larger” than other language
                                   models?
                                  Source: txt.cohere.com
What is a Language Model?
LMs assign probabilities to word sequences: find the most likely word
Categories:
• Generative: find the most likely next word
• Classification: find the most likely classification/answer
What is a Large Language Model?
The moon, Earth's only natural satellite, has been a subject of fascination and wonder for thousands of years.
                                                 a:    0                                    {The        { [ ],
Corpus of
                                               The:    1                                    moon,       [            ],
training
                                                is:    2                                    Earth’s     [         ],
data used           Building Vocabulary                         Tokenization
                                              what:    3                                    only        [ ],
to build our
                                                 I:    4                                    natural     [        ],
vocabulary.             Build index                                 Map tokens
                                               and:    5                                    satellite   [      ]
                       (dictionary of                               to indices
                                                       …                                    …}          …}
                      tokens = words)
                                                              Cons
Pros
                                                              Big vocabularies.
Intuitive.
                                                              Complications such as handling misspellings and
                                                              other out-of-vocabulary words.
Tokenization - Characters                                                                                            This vocab
                                                                                                                     is too small!
The moon, Earth's only natural satellite, has been a subject of fascination and wonder for thousands of years.
                                                 a:   0                                   t          →
Corpus of
                                                 b:   1                                   h          →
training
                                                 c:   2                                   e          →
data used
                                                 d:   3                                   m          →
to build our
                                                 e:   4                                   o          →
vocabulary.       Build index
                     Build index                                Maptokens
                                                                Map   tokens
                                                 f:   5                                   o          →
                  (alphabet)
                    (dictionary of
                                                      …          toindices
                                                                to  indices               n          →
                        tokens =
                  letters/characters)                                                     …          →           …
    Pros                                                   Cons
    Small vocabulary.                                      Loss of context within words.
    No out-of-vocabulary words.                            Much longer sequences for a given input.
Tokenization - Sub-words                                                                                             This vocab
                                                                                                                     is just right!
The moon, Earth's only natural satellite, has been a subject of fascination and wonder for thousands of years.
                                                 a:   0                                   The         →
Corpus of
                                                as:   1                                   moon        →
training
                                               ask:   2                                   **,         →
data used
                                                be:   3                                   Earth       →
to build our
                                                ca:   4                                   **‘s        →
vocabulary.        Build
                     Buildindex
                           index                                 Maptokens
                                                                 Map   tokens
                                                cd:   5                                   on          →
                   (byte-pair
                    (dictionary of
                                                      …           toindices
                                                                 to  indices              ly          →
                   tokens  = mix of
                   encoding)
                      words and                                                           …           →          …
                     sub-words)
                                                                   Compromise
Byte Pair Encoding (BPE) a popular encoding.
Start with a small vocab of characters.                            “Smart” vocabulary built from characters
Iteratively merge frequent pairs into new bytes in                 which co-occur frequently.
the vocab (such as “b”,”e” → “be”).                                More robust to novel words.
Tokenization
Tokenization
                                                                 Tokens                                                                Token count   Vocab size
  method
               ‘The moon, Earth's only natural satellite, has been a subject of                                                                      # sentences in
  Sentence                                                                                                                                  1
               fascination and wonder for thousands of years.’                                                                                            doc
               'The', 'moon', ',', 'Earth', "'", 's', 'on', 'ly', 'n', 'atur', 'al', 's', 'ate', 'll', 'it', 'e',
  Sub-word     ',', 'has', 'been', 'a', 'subject', 'of', 'fascinat', 'ion', 'and', 'w', 'on', 'd', 'er',                                   37           (varies)
               'for', 'th', 'ous', 'and', 's', 'of', 'y', 'ears', '.'
               'T', 'h', 'e', ' ', 'm', 'o', 'o', 'n', ',', ' ', 'E', 'a', 'r', 't', 'h', "'", 's', ' ', 'o', 'n', 'l', 'y', '
               ', 'n', 'a', 't', 'u', 'r', 'a', 'l', ' ', 's', 'a', 't', 'e', 'l', 'l', 'i', 't', 'e', ',', ' ', 'h', 'a', 's', ' ',                     52 +
  Character    'b', 'e', 'e', 'n', ' ', 'a', ' ', 's', 'u', 'b', 'j', 'e', 'c', 't', ' ', 'o', 'f', ' ', 'f', 'a', 's', 'c', 'i',          110        punctuation
               'n', 'a', 't', 'i', 'o', 'n', ' ', 'a', 'n', 'd', ' ', 'w', 'o', 'n', 'd', 'e', 'r', ' ', 'f', 'o', 'r', ' ',                           (English)
               't', 'h', 'o', 'u', 's', 'a', 'n', 'd', 's', ' ', 'o', 'f', ' ', 'y', 'e', 'a', 'r', 's', '.'
                                                      ¹Source: BBC.com
Word Embeddings:
               1_DAIS_Title_Slide
The surprising power of similar
context
Represent words with vectors
                                             Source: victorzhou.com
Creating dense vector representation
Sparse vectors lose meaningful notion of similarity
New idea: Let’s give each word a vector representation and use data to
build our embedding space.                               Typical dimension
                                                                                sizes:    ,    ,
   “puppy”             Embedding
                                                 [ . , . , . …. . ]
                        function
  word/token        Pre-trained module                 Word                When done well, similar words will
                   (eg. word vec model)           embedding/vector         be closer in these
                                                                           embedding/vector spaces.
                                   Word Embedding: Basics. Create a vector from a word | by Hariom Gautam | Medium
Dense vector representations
Visualizing common words using word vectors.
                          Word Embedding: Basics. Create a vector from a word | by Hariom Gautam | Medium
Natural Language Processing (NLP)
Let’s review
• Large LMs are just LMs with transformer architectures, but bigger.
with LLMs
Learning Objectives
                (CNN)
                A magnitude 6.7 earthquake rattled Papua New Guinea early
                Friday afternoon, according to the U.S. Geological Survey.   <Article
                The quake was centered about 200 miles north-northeast       summary>
                of Port Moresby and had a depth of 28 miles. No tsunami
                warning was issued…
                                                                             <Article
                                                                             summary>
* For Spark NLP, this is missing counts from Conda & Maven downloads.
Hugging Face: 1_DAIS_Title_Slide
The GitHub of Large Language
Models
Hugging Face
                                                                                              Year
Under the hood, these libraries can use PyTorch, TensorFlow, and
JAX.
                                 Source: stackoverflow.com
Hugging Face Pipelines: Overview
                                  LLM Pipeline
(CNN)         from transformers import pipeline
A magnitude
                                                                     <Article
6.7           summarizer = pipeline("summarization")                 summary>
earthquake
rattled…      summarizer("A magnitude 6.7 earthquake rattled ...")
      Hugging Face Pipelines: Inside
               (Optional)
                                       Tokenizer              Model       Tokenizer
                Prompt
                                      (encoding)              (LLM)      (decoding)
              construction
(CNN)
A magnitude
                                                                                      <Article
6.7
                                                                                      summary>
earthquake
rattled…         Input text                  Encoded input
                                                                  Encoded output
                 Summarize: “A magnitude     [     ,      ,   ,
                                                                  [   ,   , , …]
                 6.7 earthquake rattled…”       , …]
  Tokenizers
             Model                                                        inputs.input_ids,
                                    Mask handles variable-length inputs   attention_mask=inputs.attention_mask,
                                                                          num_beams=10,    Models search for best output
                                                                          min_length=5,
         Encoded output             Adjust output lengths to match task
         [   ,   , , …]                                                   max_length=40)
Datasets
Datasets library
• -line APIs for loading and sharing datasets
• NLP, Audio, and Computer Vision tasks
    (CNN)
    A magnitude 6.7 earthquake rattled Papua New Guinea early
    Friday afternoon, according to the U.S. Geological Survey.                 <Article
    The quake was centered about 200 miles north-northeast                     summary>
    of Port Moresby and had a depth of 28 miles. No tsunami
    warning was issued…
NLP task behind this app:                                        Find a model for this task:
Summarization                                                    Hugging Face Hub →      ,      models.
Extractive: Select representative pieces of text.                Filter by task →     models.
Abstractive: Generate new text.                                  Then…? Consider your needs.
Selecting a model: filtering and sorting
•   Summarization
•   Sentiment analysis
                                 We’ll focus on these examples
•   Translation                  in this module.
•   Zero-shot classification
•   Few-shot learning
•   Conversation / chat
•   (Table) Question-answering   Some “tasks” are very general
                                 and overlap with other tasks.
•   Text / token classification
•   Text generation
Task: Sentiment analysis
                                                         for real…"
I need to monitor the stock market, and I
want to use Twitter commentary as an early
indicator of trends.
                                                          "<company> stock price target
                                                          cut to $ vs. $ at BofA          Negative
                                                          Merrill Lynch"
sentiment_classifier(tweets)
Out:[{'label': 'positive', 'score': 0.997},
     {'label': 'negative', 'score': 0.996},
     …]
en_to_es_translator = pipeline(
   task="text2text-generation", # task of variable length
   model="Helsinki-NLP/opus-mt-en-es") # translates English to Spanish
# General models may support multiple languages and require prompts / instructions.
t5_translator("translate English to Romanian: Existing, open-source models...")
                                                           Article
                                                           The full cost of damage in
                                                           Newton Stewart, one of the       Breaking news
predicted_label = zero_shot_pipeline(                      areas worst affected, is still
     sequences=article,                                    being…
     candidate_labels=["politics",
"breaking news", "sports"])
                                             *Source: huggingface.co
Prompts get complicated
Few-shot learning       pipeline(
                                                                           Instruction
                        """For each tweet, describe its sentiment:
                        ###
                        [Tweet]: "This is the link to the article"
                        [Sentiment]: Neutral
                        ###
                                                             Query to
                                                             answer
As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array",
"items": {"type": "string"}}}, "required": ["foo"]}} the object {"foo": ["bar", "baz"]} is a well-formatted instance of
the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.
                           Main instruction
Tell me a joke.""")
Prompt
Engineering
              1_DAIS_Title_Slide
As an example, for the schema {"properties": {"foo": {"title": "Foo", "description": "a list of strings", "type": "array",
"items": {"type": "string"}}}, "required": ["foo"]}} the object {"foo": ["bar", "baz"]} is a well-formatted instance of
the schema. The object {"properties": {"foo": ["bar", "baz"]}} is not well-formatted.
                                   Output format
Here is the output schema:
```
{"properties": {"setup": {"title": "Setup", "description": "question to set up a joke", "type": "string"}, "punchline":
{"title": "Punchline", "description": "answer to resolve the joke", "type": "string"}}, "required": ["setup","punchline"]}
```
                Input / Question
Tell me a joke.""")
How to help the model to reach a better answer?
                                 Jailbreaking:
                                 Bypass moderation rule
Prompt leaking:
Extract sensitive information
• Post-processing/filtering
   •   Use another model to clean the output
   •   "Before returning the output, remove all offensive words, including f***, s***
Databases,
and Search
Learning Objectives
• Learn best practices for when to use vector stores and how to improve
  search-retrieval performance
How do language models learn knowledge?
                                            Source: OpenAI
Refresher: We represent words with vectors
                Word Embedding: Basics. Create a vector from a word | by Hariom Gautam | Medium
Turn images and audio into vectors too
 Data objects     Vectors                 Tasks
                                      • Object recognition
                [ . , . , - . , ….]   • Scene detection
                                      • Product search
                                      • Translation
                [ . , . , - . , ….]   • Question Answering
                                      • Semantic search
                                      • Speech to text
                [ . , . , - . , ….]   • Music transcription
                                      • Machinery malfunction
Use cases of vector databases
• Similarity search: text, images, audio Are electric cars better for the environment?
    • De-duplication
    • Semantic match, rather than keyword match!                 electric cars climate impact
                                                                      Source: Spotify
Search and Retrieval-Augmented Generation
The RAG workflow
Search and Retrieval-Augmented Generation
The RAG workflow
Search and Retrieval-Augmented Generation
The RAG workflow
How Does
Vector Search
       1_DAIS_Title_Slide
Work?
Vector search strategies
The higher the metric, the less similar The higher the metric, the more similar
                                                    Source: buildin.com
Compressing vectors with Product Quantization
PQ stores vectors with fewer bytes
                                     Source: Pinecone
HNSW: Hierarchical Navigable Small Worlds
Builds proximity graphs based on Euclidean (L2) distance
                                                           Source: Pinecone
Ability to search for similar
          objects is
 •   Post-query
 •   In-query
 •   Pre-query
• # of results is highly
  unpredictable
• Branding as a scalar
• Not as performant as
  post- or in-query filtering
Vector Stores  1_DAIS_Title_Slide
• Specialized, full-fledged
  databases for unstructured data
    • Inherit database properties, i.e.
      Create-Read-Update-Delete (CRUD)
Pros Cons
Open-Sourced
Qdrant No HNSW
Redis No HNSW
Weaviate No HNSW
Not Open-Sourced
• Splitting doc into smaller docs = doc can produce N vectors of M tokens
Chunking strategy is use-case specific
Another iterative step! Experiment with different chunk sizes and approaches
Existing resources:
• Text Splitters by LangChain
• Blog post on semantic search by Vespa - light mention of chunking
• Chunking Strategies by Pinecone
Preventing silent failures and undesired
performance
• For users: include explicit instructions in prompts
   • "Tell me the top 3 hikes in California. If you do not know the answer, do not
     make it up. Say 'I don’t have information for that.'"
   • Helpful when upstream embedding model selection is incorrect
Reasoning
Learning Objectives
• Apply LangChain to leverage multiple LLM providers such as OpenAI and Hugging Face.
• Create complex logic flow with agents in LangChain to pass prompts and use logical
  reasoning to complete tasks.
LLM Limitations  1_DAIS_Title_Slide
LLMs are great at single tasks… but we
want more!
LLM Tasks vs. LLM-based Workflows
LLMs can complete a huge array of challenging tasks.
                                                         Summarization
                                                         Sentiment analysis
                                                         Translation
                                                         Zero-shot classification
   Prompt                           Response
     Prompt                          Response            Few-shot learning
       Prompt                         Response
      Prompt                                 Response
         Prompt                               Response   Conversation / chat
                                                         Question-answering
                                                         Table question-answering
                                                         Token classification
                                                         Text classification
                                                         Text generation
Workflow: Applications
with more than a single
                                                                        Task
interaction                 Workflow
                                         Task            Task                  Task
                                                                                                  Workflow
                            Initiated           Task                                             Completed
                                                                 Task
                                                       End-to-end workflow
Summarize and Sentiment
Example multi-LLM problem: get the sentiment of many articles on a topic
   Article : “...”
  Article : “...”
                                                 Goal:
  Article : “...”                                Create a reusable workflow for multiple articles.
         …           Summary LLM
                                      Overall
                                     Sentiment
                                                 For this we’ll focus on the first task first.
  Summary +
  Summary +
     “...”                                       How do we make this process systematic?
                     Sentiment LLM
Prompt
Engineering:     1_DAIS_Title_Slide
  Article : “...”
                                                 Now we need the output from
              DONE
  Article : “...”
  Article : “...”                                our new engineered prompts to
        …            Summary LLM
                                      Overall
                                                 be the input to the sentiment
                                     Sentiment   analysis LLM.
  Summary
  + Summary
      + “...”
                                                 For this we’re going to chain
                     Sentiment LLM               together these LLMs.
LLM Chains:       1_DAIS_Title_Slide
Linking multiple LLM interactions to build
complexity and functionality
LLM Extension Libraries
• Released in late
• Useful for multi-stage reasoning,
  LLM-based workflows
# We will also need another prompt template like before, a new sentiment prompt
sentiment_prompt_template = """
Evaluate the sentiment of the following summary: {summary}
Sentiment: """
Workflow Chain
                              Source: python.langchain.com
Going ever further
What if we want to use our LLM results to do more?
• ……
                                intermediate_steps: Steps the LLM has taken to date, along with observations
Building reasoning loops        """
                                  output = self.llm_chain.run(intermediate_steps=intermediate_steps)
                                  return self.output_parser.parse(output)
Agents are LLM-based systems
                                def take_next_step() : """Take a single step in the thought-action-observation loop."""
that execute the ReasonAction    # Call the LLM to see what to do.
A set of tools that the LLM will select   tools = load_tools([Google Search,Python Interpreter])
and execute to perform steps to           agent = initialize_agent(tools, llm)
achieve the task.                         agent.run("In what year was Isaac Newton born? What is
                                          that year raised to the power of 0.3141?"))
                                                                                        Simplified code from
                                                                                        the LangChain Agent
LLM Plugins are coming
LangChain was first to show LLMs+tools. But companies are catching up!
Source: csdn.net
Source: Twitter.com
                                                 Source: arstechnica.com
OpenAI and ChatGPT Plugins
OpenAI acknowledged the open-sourced community moving in similar
directions
LangChain
HF transformers Agents
HuggingGPT/Jarvis BabyAGI
Evaluating LLMs
Learning Objectives
• Be familiar with common tools for training and fine-tuning, such as those from Hugging
  Face and DeepSpeed.
News API
               Open-source instruction-following LLM
                                                         <Article
                                                         summary riddle>
               Paid LLM-as-a-Service
  “Some”
 premade
 examples
               Build your own…
Fine-Tuning:
Few-shot learning
       1_DAIS_Title_Slide
Potential LLM Pipelines
 What we have          What we could do              What we want
News API
                                                        <Article
                                                        summary riddle>
  “Some”
 premade
 examples
Pros and cons of Few-shot Learning
Pros                                              Cons
LLM needs to reframe the output     [Article 1]: "Residents were awoken to the surprise…"
                                    [Summary Riddle 1]: "In houses they stay, the peop… "
as a riddle.
                                    ###
                                    [Article 2]: "Gas prices reached an all time …"
                                    [Summary Riddle 1]: "Far you will drive, to find…"
• Large version of base LLM         ###
• Long input sequence               …
                                    ###
                                    [Article n]: {article}
                                    [Summary Riddle n]:""")
Fine-Tuning:
Instruction-following
         1_DAIS_Title_Slide
LLMs
Potential LLM Pipelines
What we have         What we could do      What we want
 News API
               Instruction-following LLM
                                             <Article
                                             summary riddle>
  “Some”
 premade
 examples
Pros and cons of Instruction-following LLMs
Pros                                               Cons
LLMs-as-a-Service
Potential LLM Pipelines
What we have         What we could do   What we want
News API
                                          <Article
                                          summary riddle>
               Paid LLM-as-a-Service
  “Some”
 premade
 examples
Pros and cons of LLM-as-a-Service
Pros                                                 Cons
back.                                LLM_API(prompt(article),api_key="sk-@sjr…")
Fine-tuning: DIY
       1_DAIS_Title_Slide
Potential LLM Pipelines
What we have        What we could do   What we want
News API
                                         <Article
                                         summary riddle>
  “Some”
 premade
 examples
               Build your own…
Potential LLM Pipelines
What we have           What we could do             What we want
News API
                                                      <Article
               Build your own…                        summary riddle>
  “Some”
 premade
 examples
               Create full model    Fine-tune an
                 from scratch      existing model
Potential LLM Pipelines
What we have                What we could do             What we want
News API
                                                           <Article
                    Build your own…                        summary riddle>
  “Some”
 premade
 examples
                    Create full model    Fine-tune an
                      from scratch      existing model
Pythia 12B:
Layers:36 Dimensions:5120
Heads:40 Seq. Len:2048
                                  databricks-dolly-15k
  The Pile
      GB Dataset of
  Diverse Text for
  Language Modeling
        -
The foundation model era: racing to trillion parameter transformer
models
"I think we're at the end of the era ..[of these]... giant, giant models"
          and beyond
The Age of small LLMs and Applications
Dolly Demo
So you’ve decided to fine-tune…
Did it work? How can you measure LLM performance?
 EVALUATION TIME!
Evaluating LLMs:  1_DAIS_Title_Slide
But for a good LLM what does the loss tell us?
A good language will model will have high accuracy and low perplexity
           Language Model
             probability
             distribution
Evaluations
BLEU for translation
bi-grams
Reference Life is what happens when you're busy making other plans.
                                                                                 Total matching
                                                                                 N-grams                N-gram
                                                                                 Total N-grams
                                                                                                        recall
 . Target application
   a. NLP tasks: Q&A, reading comprehension, and summarization
   b. Queries chosen to match the API distribution
   c. Metric: human preference ratings
 . Alignment
   a. “Helpful” → Follow instructions, and infer user intent. Main metric: human
      preference ratings
   b. “Honest” → Metrics: human grading on “hallucinations” and TruthfulQA benchmark
      dataset
   c. “Harmless” → Metrics: human and automated grading for toxicity
      (RealToxicityPrompts); automated grading for bias (Winogender, CrowS-Pairs)
       i.   Note: Human labelers were given very specific definitions of “harmful” (violent content, etc.)
                                       Reference: https://arxiv.org/abs/   .
Module Summary
Fine-tuning and Evaluating LLMs - What have we learned?
LLMs
The models developed or used in this course are for demonstration and
learning purposes only. Models may occasionally output offensive,
inaccurate, biased information, or harmful instructions.
Learning Objectives
• Examine datasets used to train LLMs and assess their inherent bias
Source: Brynjolfsson et al
Limitations
There are many risks and limitations
Many without good (or easy) mitigation strategies
                                   •   Information hazard
 •   Big data != good data         •   Misinformation harms    •     Automation of human jobs
 •   Discrimination, exclusion,    •   Malicious uses          •     Environmental harms and
     toxicity                      •   Human-computer                costs
                                       interaction harm
Automation undermines creative economy
Automation displaces job and increases inequality
                                 Source: Weidinger et al
Incurs environmental and financial cost
                                                            • LLaMa:             B parameters
                                                              =$ M
                        Image                                    •       days of training
                        source:
                        giphy.com                                •     ,    A      GPUs
                               Source: Bender et al
Big training data != good data
We don’t audit the data
Source: Allen AI
                        Source: Brown et al
(Mis)information hazard
Compromise privacy, spread false information, lead unethical behaviors
                                             Source: Weidinger et al
Malicious uses
Easy to facilitate fraud, censorship, surveillance, cyber attacks
Source: Weidinger et al
                                           Image source:
                                           gyphy.com
                                                          Extrinsic
 Intrinsic
                                                          Cannot verify output from the
 Output contradicts the source
                                                          source, but it might not be wrong
Source:                                                   Source:
The first Ebola vaccine was
                                                          Alice won first prize in fencing last
approved by the FDA in         , five
                                                          week.
years after the initial outbreak in
    .                                                     Output:
                                       Source: Ji et al
Data leads to hallucination
                                      Source: Ji et al
Model leads to hallucination
                             Source: Ji et al
Evaluating hallucination is tricky and imperfect
Lots of subjective nuances: toxic? misinformation?
                                            Source: Ji et al
Mitigation
       1_DAIS_Title_Slide
Mitigate hallucination from data and model
   •   EU AI Act
                                                                                       • Model limitations
                                                                                       • Model characteristics
   •
                                                                                       • Environmental data
       Japan AI regulation approach
                                                                      Figure 2: Outputs from audits on one level become
                                                                                inputs for audits on other levels
   •   Biden-Harris Responsible AI Actions
                                                                               Source: Mokander et al
Who should audit LLMs?
“Any auditing is only as good as the institution delivering it”
                                   Source: Mokander et al
Module Summary
Society and LLMs - What have we learned?
Goals of MLOps
• Maintain stable performance
    • Meet KPIs                                         Google Search popularity of
                                                        “MLOps”
    • Update models and systems as
      needed
    • Reduce risk of system failures
                      Different
                       Differentproduction
                                 productiontooling:
                                            tooling:
                      big
                       bigmodels,
                           models,vector
                                   vectordatabases,
                      etc.
                       databases, etc.
                                                        Vector
                                                       database
Adapting MLOps for LLMs
                                     Larger cost, latency, and
       If model training or tuning
                                     performance tradeoffs for
       are needed, managing cost
                                     model serving, especially
       and performance can be
                                     with rd-party LLM APIs
       challenging.
                                                     Vector
                                                    database
                          Some things change—but
Adapting MLOps for LLMs   even more remain similar.
                                    Vector
                                   database
LLMOps details: 1_DAIS_Title_Slide
“Plan for key concerns which you may
encounter with operating LLMs”
Key concerns
• Prompt engineering
• Packaging models or pipelines for deployment
• Scaling out
• Managing cost/performance tradeoffs
• Human feedback, testing, and monitoring
• Deploying models vs. deploying code
• Service infrastructure: vector databases and complex models
Prompt engineering
 Model
  API
 (New) fine-tuned
     model
LangChain chain
     Vector DB            Prompt       Hugging Face
      lookup             template        pipeline
Packaging models or pipelines for deployment
Standardizing deployment for many types of models and pipelines
 Model       mlflow.openai.log_model(model="gpt-3.5-turbo",
  API                                task=openai.ChatCompletion, …)
                        mlflow.pytorch.log_model(
 (New) fine-tuned
     model                  pytorch_model=my_finetuned_model, …)
LangChain chain
     Vector DB            Prompt        Hugging Face      mlflow.langchain.log_model(lc_model=llm_chain, …)
      lookup             template         pipeline
                                                                                                                    Deployment
   An open source platform for the machine learning lifecycle                                                         Options
In-Line Code
                                                                              v2
                                                                                                                      Cloud Inference
                                                                                                                         Services
Metrics to optimize
•   Cost of queries and training
•   Time for development
•   ROI of the LLM-powered product
•   Accuracy/metrics of model
•   Query latency
Deploy models
Next Steps
THANK YOU!