0% found this document useful (0 votes)
94 views15 pages

Using LLM To Transcribe Restaurant Menu Photos - DoorDash

Uploaded by

chuanqiwangcqw
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
94 views15 pages

Using LLM To Transcribe Restaurant Menu Photos - DoorDash

Uploaded by

chuanqiwangcqw
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

2025/4/7 09:38 Using LLM to transcribe restaurant menu photos - DoorDash

Blog
Using LLM to transcribe restaurant

menu photos

March 19, 2025

Zhe Mai

Zheng Hu

Ying Yang

https://careersatdoordash.com/blog/doordash-llm-transcribe-menu/ 1/15
2025/4/7 09:38 Using LLM to transcribe restaurant menu photos - DoorDash


A restaurant s menu is one of its most important representations on a delivery platform. To ensure

accuracy and alignment with their latest o ff ’


erings, DoorDash s restaurant partners must actively

maintain their menus. This can be challenging, however, for business owners who already are

managing demanding daily operations. As a delivery company committed to their success,

DoorDash sees a valuable opportunity to integrate AI into this traditionally human-managed

process, streamlining e ffi


cient updates through submitted menu photos.

Previously, we relied on humans to transcribe and update restaurant menus manually, which is

costly and time-consuming. The rapid improvement of large language models, or LLMs, creates an

opportunity for a big stepwise change, allowing AI to transcribe information from menu photos.

However the diverse menu structures restaurants use pose a challenge for an LLM to do an

accurate job at scale. In this blog, we will discuss how we built a system with a guardrail layer for

LLMs leveraging traditional Machine Learning (ML) techniques. The guardrail layer serves as an

e ffective control mechanism of LLMs that enables LLM applications to run at scale with high

accuracy. It enables AI practitioners to swiftly leverage newly released LLMs while mitigating

potential risks that may impact the final product quality. In the meantime, the clever use of

traditional ML in this system o ffers advantages in both low latency and cost e ffi
ciency.

Rapid start with prototyping

LLMs have greatly accelerated how quickly we can develop an initial minimum viable product,

completely changing the way we discover possibilities. Figure 1 shows an example of what we could

put together quickly for initial evaluation. The process fi rst uses optical character recognition, or

OCR, to extract text from a menu image, which is then passed over to an LLM for item-level

information extraction and summarization, creating a structured data format.

https://careersatdoordash.com/blog/doordash-llm-transcribe-menu/ 2/15
2025/4/7 09:38 Using LLM to transcribe restaurant menu photos - DoorDash

Figure 1: OCR extracts text from a menu photo that an LLM then can summarize into a structured data format.

LLM key challenges and pain points


An LLM s text understanding provides excellent summarization and organization. However, given

our user cases, we require very high transcription accuracy, which is di fficult for an LLM to achieve

because of its lack of familiarity with the variety of menu structures, and LLM's ability to follow

instructions in complicated scenarios. Through human evaluation of a large number of menu

photos, a reasonable proportion of menus can be transcribed with various errors, such as incorrect

item names or categories. After a thorough investigation, we found that the LLM created

transcription errors primarily when it encountered three sub-optimal types of menu photos, as

shown in Figure 2:

Inconsistent menu structure, leading to confusing OCR raw texts

Incomplete menus, causing di fficulty in the correct linkage between items and their attributes

Low photographic quality, such as too dark, too many flares, or too many irrelevant items in the

foreground or background

https://careersatdoordash.com/blog/doordash-llm-transcribe-menu/ 3/15
2025/4/7 09:38 Using LLM to transcribe restaurant menu photos - DoorDash

Figure 2: Example transcriptions of menu photos resulting in lower accuracy.

To enhance accuracy, we have made an intensive e ff ’


ort to improve the LLM s performance gap.

However given our high accuracy standards, we still need a tremendous amount of time and

investment to improve the LLMs, postponing the realization of their value. As a result, we ve ’
developed more innovative approaches to move AI automation to production. The key to ensuring


an LLM s accuracy is to build an LLM system with a suitable automatic guardrail process and LLM

itself, instead of having LLM being a standalone product. The system allows us to not only optimize

for high accuracy, also seek for cost and lower latency.

Introducing an LLM guardrail

Our guardrail framework is based on a machine learning (ML) model that identi fi es whether an LLM

transcription can achieve high accuracy. Simultaneously, the framework must be flexible enough to

adapt to rapid developments in AI models. The following outlines our journey toward achieving

these goals.

Generating guardrail model training features

To understand transcription quality, the guardrail model must learn how each menu photo interacts

with both OCR and LLM summarization. As with building any other machine learning model, it is

key to identify and process the right set of features. We focus in particular on generating features

https://careersatdoordash.com/blog/doordash-llm-transcribe-menu/ 4/15
2025/4/7 09:38 Using LLM to transcribe restaurant menu photos - DoorDash

that can explain the interactions between a menu photo, its OCR output, and the LLM

summarization because:

An inconsistent menu structure leads to an illogical order in the OCR output s raw text. For ’
example, the OCR might not be able to read the menu by category or in any particular order.

We have observed arbitrary ordering of text recognition that makes it more di ffi cult for an LLM

to link the right item attributes together.

Incomplete menus may output attributes from items that are only partially visible, resulting in

extraneous or mismatched attributes, confusing the LLM on the correct item<>attribute

linkage.

Because photo quality can be subpar in many di fferent ways, challenges are generated for both

the OCR and the LLM, including minuscule, unusable fonts and cluttered foregrounds and

backgrounds that obscure text.

It soon became clear that we could not rely solely on menu photos for machine learning. Instead,

we decided to use three types of features/inputs for the model, as shown in Table 1:

Table 1: Guardrail model features and inputs

Guardrail model training and performance

https://careersatdoordash.com/blog/doordash-llm-transcribe-menu/ 5/15
2025/4/7 09:38 Using LLM to transcribe restaurant menu photos - DoorDash

We developed a simple model structure with a three-component neural network design as in

Figure 3, to predict whether a transcription is su ffi ciently accurate. It utilizes pre-trained image

models to understand both image features, concatenates with fully connected layers for tabular

features, and passes to final classi fication layers (fully connected layers and a classi fi er head). We

considered the following pre-train image models for exploration:

1 . Convolutional Neural Network (CNN) based pre-train image model: Visual Geometry Group

16 ( VGG16) and Deep Residual Network (ResNet)

2 . Transformer-based pre-train image model: Vision Transformer ( ViT) / Document Image

Transformer ( DiT)

Figure 3: We developed a three-component neural network as our guardrail model to take advantage of various

types of features

Table 2 below shows the comparison among di fferent model architectures based on two main

metrics: average transcription accuracy across all test menu photos and percentage of

https://careersatdoordash.com/blog/doordash-llm-transcribe-menu/ 6/15
2025/4/7 09:38 Using LLM to transcribe restaurant menu photos - DoorDash

transcriptions that met accuracy requirements. Surprisingly, we found that the simplest model —
Light Gradient-Boosting Machine, or LightGBM for short — outperforms all models while

maintaining the fastest run time. The neural network with ResNet (residual networks) follows

closely behind, while the neural network with Vision Transformers, or ViT, performs the worst of the

five. A key reason for its poor performance is that we have limited labeled data, making it di ffi
cult to

take full advantage of more complex model designs.

Table 2: Model performance based on architecture. The highest values are

represented by deeper green.

Stay Informed with Weekly Updates

Subscribe to our Engineering blog to get regular updates on all the coolest projects our team

is working on

Email Subscribe

Enabling automation of partial transcriptions

https://careersatdoordash.com/blog/doordash-llm-transcribe-menu/ 7/15
2025/4/7 09:38 Using LLM to transcribe restaurant menu photos - DoorDash

Figure 4: Automatic menu transcription pipeline combines human and ML transcriptions through the guardrail

model

To bring the LLM transcription model to production, we came up with the partial automation

transcription pipeline to combine human and ML transcriptions, as shown in Figure 4. In this

pipeline, all validated photos are passed to our transcription model, whose features and

performance will be generated and evaluated by the guardrail model. Transcribed information

becomes readily available for the menu photos that pass the auditing threshold for accuracy. For


those that don t pass, the system moves photo menus to the human process. This system marked

our first step toward improving e fficiency in the manual human processes without sacri fi cing

quality.

Quick adaptation to improved transcription automation

During the six months following the development of our first guardrail model, there was rapid

evolution in the generative AI world, including the development of multimodality models. We

continue to explore and test new transcription models, evaluating their pros and cons. Each

generation transcription model has unique advantages and shortcomings, but none signi fi
cantly

outperforms the others. For example, multimodality models are great at context understanding but

more prone to errors when handling bad-quality photos, resulting in overall higher transcription

failure rates. OCR+LLM models, on the other hand, maintain relatively stable performance but

underperform on context understanding.

https://careersatdoordash.com/blog/doordash-llm-transcribe-menu/ 8/15
2025/4/7 09:38 Using LLM to transcribe restaurant menu photos - DoorDash

Nonetheless, our guardrail model framework has allowed us to leverage newly released state-of-

the-art AI models quickly. It balances the pros and cons of di fferent models and helps the system

steadily reach a higher ratio of automation while ensuring quality.

Figure 5: Updated automatic menu transcription pipeline with both multimodality GenAI models and guardrail

model in place.

Looking into the future

With the rapid development of generative AI and increasing investment, this has become a fast

learning and exploring process for all of us. From this journey, we ve learned that more supervision ’
is needed to realize full value and move into reliable production. The guardrail ML model has

proven most viable for achieving these purposes.

As our journey continues, we are seeing improvement in our current pipeline, even as we explore

additional options for optimizing and improving the performance of both transcription and

guardrail models. For example, current LLM/multimodal models are trained with a general dataset

and no domain expertise on restaurant menus. Because we have an increasing availability of

manually transcribed data, we could extend its use to fi ne-tune custom LLM/multimodal models.

One of the biggest challenges with both transcription models, however, is the poor quality of menu

photos. Additional processes could be put in place to ensure quality improvements, which could

https://careersatdoordash.com/blog/doordash-llm-transcribe-menu/ 9/15
2025/4/7 09:38 Using LLM to transcribe restaurant menu photos - DoorDash

lead to advancements in downstream transcription. Those are just some of the areas we plan to

continue working on. We are excited about the potential to continually improve our AI system to

provide the most up-to-date information from restaurants to consumers.

About the Authors

Zhe Mai is a Machine Learning Engineer on Merchant team at DoorDash. Her focus is on

utilizing machine learning to solve real-world problems to deliver values.

Zheng Hu is a Machine Learning Engineer at DoorDash.

Ying Yang is a Machine Learning, Engineering Manager on the Merchant team at DoorDash.

https://careersatdoordash.com/blog/doordash-llm-transcribe-menu/ 10/15
2025/4/7 09:38 Using LLM to transcribe restaurant menu photos - DoorDash

https://careersatdoordash.com/blog/doordash-llm-transcribe-menu/ 11/15
2025/4/7 09:38 Using LLM to transcribe restaurant menu photos - DoorDash

https://careersatdoordash.com/blog/doordash-llm-transcribe-menu/ 12/15
2025/4/7 09:38 Using LLM to transcribe restaurant menu photos - DoorDash

https://careersatdoordash.com/blog/doordash-llm-transcribe-menu/ 13/15
2025/4/7 09:38 Using LLM to transcribe restaurant menu photos - DoorDash

Careers Home

Mission & Values

Working at DoorDash

Belonging

Career Areas

University Careers

Career Blog

Talent Network

Search Jobs

Statement of Non-Discrimination: In keeping with our beliefs and goals, no employee or applicant will face discrimination or

harassment based on: race, color, ancestry, national origin, religion, age, gender, marital/domestic partner status, sexual

orientation, gender identity or expression, disability status, or veteran status. Above and beyond discrimination and

harassment based on “protected categories,” we also strive to prevent other subtler forms of inappropriate behavior (i.e.,
stereotyping) from ever gaining a foothold in our o ffi
ce. Whether blatant or hidden, barriers to success have no place at

DoorDash. We value a diverse workforce – people who identify as women, nonbinary or gender non-conforming, LGBTQIA+,
American Indian or Native Alaskan, Black or African American, Hispanic or Latinx, Native Hawaiian or Other Paci fi
c Islander,

diff erently-abled, caretakers and parents, and veterans are strongly encouraged to apply. Thank you to the Level Playing

Field Institute for this statement of non-discrimination.

Terms of Service

Consumer Privacy

Applicant Privacy Notice

Do Not Sell or Share My Personal Information

© 2025 DoorDash

https://careersatdoordash.com/blog/doordash-llm-transcribe-menu/ 14/15
2025/4/7 09:38 Using LLM to transcribe restaurant menu photos - DoorDash

https://careersatdoordash.com/blog/doordash-llm-transcribe-menu/ 15/15

You might also like