2025/4/7 09:38 Using LLM to transcribe restaurant menu photos - DoorDash
Blog
Using LLM to transcribe restaurant
menu photos
March 19, 2025
Zhe Mai
Zheng Hu
Ying Yang
https://careersatdoordash.com/blog/doordash-llm-transcribe-menu/ 1/15
2025/4/7 09:38 Using LLM to transcribe restaurant menu photos - DoorDash
’
A restaurant s menu is one of its most important representations on a delivery platform. To ensure
accuracy and alignment with their latest o ff ’
erings, DoorDash s restaurant partners must actively
maintain their menus. This can be challenging, however, for business owners who already are
managing demanding daily operations. As a delivery company committed to their success,
DoorDash sees a valuable opportunity to integrate AI into this traditionally human-managed
process, streamlining e ffi
cient updates through submitted menu photos.
Previously, we relied on humans to transcribe and update restaurant menus manually, which is
costly and time-consuming. The rapid improvement of large language models, or LLMs, creates an
opportunity for a big stepwise change, allowing AI to transcribe information from menu photos.
However the diverse menu structures restaurants use pose a challenge for an LLM to do an
accurate job at scale. In this blog, we will discuss how we built a system with a guardrail layer for
LLMs leveraging traditional Machine Learning (ML) techniques. The guardrail layer serves as an
e ffective control mechanism of LLMs that enables LLM applications to run at scale with high
accuracy. It enables AI practitioners to swiftly leverage newly released LLMs while mitigating
potential risks that may impact the final product quality. In the meantime, the clever use of
traditional ML in this system o ffers advantages in both low latency and cost e ffi
ciency.
Rapid start with prototyping
LLMs have greatly accelerated how quickly we can develop an initial minimum viable product,
completely changing the way we discover possibilities. Figure 1 shows an example of what we could
put together quickly for initial evaluation. The process fi rst uses optical character recognition, or
OCR, to extract text from a menu image, which is then passed over to an LLM for item-level
information extraction and summarization, creating a structured data format.
https://careersatdoordash.com/blog/doordash-llm-transcribe-menu/ 2/15
2025/4/7 09:38 Using LLM to transcribe restaurant menu photos - DoorDash
Figure 1: OCR extracts text from a menu photo that an LLM then can summarize into a structured data format.
LLM key challenges and pain points
’
An LLM s text understanding provides excellent summarization and organization. However, given
our user cases, we require very high transcription accuracy, which is di fficult for an LLM to achieve
because of its lack of familiarity with the variety of menu structures, and LLM's ability to follow
instructions in complicated scenarios. Through human evaluation of a large number of menu
photos, a reasonable proportion of menus can be transcribed with various errors, such as incorrect
item names or categories. After a thorough investigation, we found that the LLM created
transcription errors primarily when it encountered three sub-optimal types of menu photos, as
shown in Figure 2:
Inconsistent menu structure, leading to confusing OCR raw texts
Incomplete menus, causing di fficulty in the correct linkage between items and their attributes
Low photographic quality, such as too dark, too many flares, or too many irrelevant items in the
foreground or background
https://careersatdoordash.com/blog/doordash-llm-transcribe-menu/ 3/15
2025/4/7 09:38 Using LLM to transcribe restaurant menu photos - DoorDash
Figure 2: Example transcriptions of menu photos resulting in lower accuracy.
To enhance accuracy, we have made an intensive e ff ’
ort to improve the LLM s performance gap.
However given our high accuracy standards, we still need a tremendous amount of time and
investment to improve the LLMs, postponing the realization of their value. As a result, we ve ’
developed more innovative approaches to move AI automation to production. The key to ensuring
’
an LLM s accuracy is to build an LLM system with a suitable automatic guardrail process and LLM
itself, instead of having LLM being a standalone product. The system allows us to not only optimize
for high accuracy, also seek for cost and lower latency.
Introducing an LLM guardrail
Our guardrail framework is based on a machine learning (ML) model that identi fi es whether an LLM
transcription can achieve high accuracy. Simultaneously, the framework must be flexible enough to
adapt to rapid developments in AI models. The following outlines our journey toward achieving
these goals.
Generating guardrail model training features
To understand transcription quality, the guardrail model must learn how each menu photo interacts
with both OCR and LLM summarization. As with building any other machine learning model, it is
key to identify and process the right set of features. We focus in particular on generating features
https://careersatdoordash.com/blog/doordash-llm-transcribe-menu/ 4/15
2025/4/7 09:38 Using LLM to transcribe restaurant menu photos - DoorDash
that can explain the interactions between a menu photo, its OCR output, and the LLM
summarization because:
An inconsistent menu structure leads to an illogical order in the OCR output s raw text. For ’
example, the OCR might not be able to read the menu by category or in any particular order.
We have observed arbitrary ordering of text recognition that makes it more di ffi cult for an LLM
to link the right item attributes together.
Incomplete menus may output attributes from items that are only partially visible, resulting in
extraneous or mismatched attributes, confusing the LLM on the correct item<>attribute
linkage.
Because photo quality can be subpar in many di fferent ways, challenges are generated for both
the OCR and the LLM, including minuscule, unusable fonts and cluttered foregrounds and
backgrounds that obscure text.
It soon became clear that we could not rely solely on menu photos for machine learning. Instead,
we decided to use three types of features/inputs for the model, as shown in Table 1:
Table 1: Guardrail model features and inputs
Guardrail model training and performance
https://careersatdoordash.com/blog/doordash-llm-transcribe-menu/ 5/15
2025/4/7 09:38 Using LLM to transcribe restaurant menu photos - DoorDash
We developed a simple model structure with a three-component neural network design as in
Figure 3, to predict whether a transcription is su ffi ciently accurate. It utilizes pre-trained image
models to understand both image features, concatenates with fully connected layers for tabular
features, and passes to final classi fication layers (fully connected layers and a classi fi er head). We
considered the following pre-train image models for exploration:
1 . Convolutional Neural Network (CNN) based pre-train image model: Visual Geometry Group
16 ( VGG16) and Deep Residual Network (ResNet)
2 . Transformer-based pre-train image model: Vision Transformer ( ViT) / Document Image
Transformer ( DiT)
Figure 3: We developed a three-component neural network as our guardrail model to take advantage of various
types of features
Table 2 below shows the comparison among di fferent model architectures based on two main
metrics: average transcription accuracy across all test menu photos and percentage of
https://careersatdoordash.com/blog/doordash-llm-transcribe-menu/ 6/15
2025/4/7 09:38 Using LLM to transcribe restaurant menu photos - DoorDash
transcriptions that met accuracy requirements. Surprisingly, we found that the simplest model —
Light Gradient-Boosting Machine, or LightGBM for short — outperforms all models while
maintaining the fastest run time. The neural network with ResNet (residual networks) follows
closely behind, while the neural network with Vision Transformers, or ViT, performs the worst of the
five. A key reason for its poor performance is that we have limited labeled data, making it di ffi
cult to
take full advantage of more complex model designs.
Table 2: Model performance based on architecture. The highest values are
represented by deeper green.
Stay Informed with Weekly Updates
Subscribe to our Engineering blog to get regular updates on all the coolest projects our team
is working on
Email Subscribe
Enabling automation of partial transcriptions
https://careersatdoordash.com/blog/doordash-llm-transcribe-menu/ 7/15
2025/4/7 09:38 Using LLM to transcribe restaurant menu photos - DoorDash
Figure 4: Automatic menu transcription pipeline combines human and ML transcriptions through the guardrail
model
To bring the LLM transcription model to production, we came up with the partial automation
transcription pipeline to combine human and ML transcriptions, as shown in Figure 4. In this
pipeline, all validated photos are passed to our transcription model, whose features and
performance will be generated and evaluated by the guardrail model. Transcribed information
becomes readily available for the menu photos that pass the auditing threshold for accuracy. For
’
those that don t pass, the system moves photo menus to the human process. This system marked
our first step toward improving e fficiency in the manual human processes without sacri fi cing
quality.
Quick adaptation to improved transcription automation
During the six months following the development of our first guardrail model, there was rapid
evolution in the generative AI world, including the development of multimodality models. We
continue to explore and test new transcription models, evaluating their pros and cons. Each
generation transcription model has unique advantages and shortcomings, but none signi fi
cantly
outperforms the others. For example, multimodality models are great at context understanding but
more prone to errors when handling bad-quality photos, resulting in overall higher transcription
failure rates. OCR+LLM models, on the other hand, maintain relatively stable performance but
underperform on context understanding.
https://careersatdoordash.com/blog/doordash-llm-transcribe-menu/ 8/15
2025/4/7 09:38 Using LLM to transcribe restaurant menu photos - DoorDash
Nonetheless, our guardrail model framework has allowed us to leverage newly released state-of-
the-art AI models quickly. It balances the pros and cons of di fferent models and helps the system
steadily reach a higher ratio of automation while ensuring quality.
Figure 5: Updated automatic menu transcription pipeline with both multimodality GenAI models and guardrail
model in place.
Looking into the future
With the rapid development of generative AI and increasing investment, this has become a fast
learning and exploring process for all of us. From this journey, we ve learned that more supervision ’
is needed to realize full value and move into reliable production. The guardrail ML model has
proven most viable for achieving these purposes.
As our journey continues, we are seeing improvement in our current pipeline, even as we explore
additional options for optimizing and improving the performance of both transcription and
guardrail models. For example, current LLM/multimodal models are trained with a general dataset
and no domain expertise on restaurant menus. Because we have an increasing availability of
manually transcribed data, we could extend its use to fi ne-tune custom LLM/multimodal models.
One of the biggest challenges with both transcription models, however, is the poor quality of menu
photos. Additional processes could be put in place to ensure quality improvements, which could
https://careersatdoordash.com/blog/doordash-llm-transcribe-menu/ 9/15
2025/4/7 09:38 Using LLM to transcribe restaurant menu photos - DoorDash
lead to advancements in downstream transcription. Those are just some of the areas we plan to
continue working on. We are excited about the potential to continually improve our AI system to
provide the most up-to-date information from restaurants to consumers.
About the Authors
Zhe Mai is a Machine Learning Engineer on Merchant team at DoorDash. Her focus is on
utilizing machine learning to solve real-world problems to deliver values.
Zheng Hu is a Machine Learning Engineer at DoorDash.
Ying Yang is a Machine Learning, Engineering Manager on the Merchant team at DoorDash.
https://careersatdoordash.com/blog/doordash-llm-transcribe-menu/ 10/15
2025/4/7 09:38 Using LLM to transcribe restaurant menu photos - DoorDash
https://careersatdoordash.com/blog/doordash-llm-transcribe-menu/ 11/15
2025/4/7 09:38 Using LLM to transcribe restaurant menu photos - DoorDash
https://careersatdoordash.com/blog/doordash-llm-transcribe-menu/ 12/15
2025/4/7 09:38 Using LLM to transcribe restaurant menu photos - DoorDash
https://careersatdoordash.com/blog/doordash-llm-transcribe-menu/ 13/15
2025/4/7 09:38 Using LLM to transcribe restaurant menu photos - DoorDash
Careers Home
Mission & Values
Working at DoorDash
Belonging
Career Areas
University Careers
Career Blog
Talent Network
Search Jobs
Statement of Non-Discrimination: In keeping with our beliefs and goals, no employee or applicant will face discrimination or
harassment based on: race, color, ancestry, national origin, religion, age, gender, marital/domestic partner status, sexual
orientation, gender identity or expression, disability status, or veteran status. Above and beyond discrimination and
harassment based on “protected categories,” we also strive to prevent other subtler forms of inappropriate behavior (i.e.,
stereotyping) from ever gaining a foothold in our o ffi
ce. Whether blatant or hidden, barriers to success have no place at
DoorDash. We value a diverse workforce – people who identify as women, nonbinary or gender non-conforming, LGBTQIA+,
American Indian or Native Alaskan, Black or African American, Hispanic or Latinx, Native Hawaiian or Other Paci fi
c Islander,
diff erently-abled, caretakers and parents, and veterans are strongly encouraged to apply. Thank you to the Level Playing
Field Institute for this statement of non-discrimination.
Terms of Service
Consumer Privacy
Applicant Privacy Notice
Do Not Sell or Share My Personal Information
© 2025 DoorDash
https://careersatdoordash.com/blog/doordash-llm-transcribe-menu/ 14/15
2025/4/7 09:38 Using LLM to transcribe restaurant menu photos - DoorDash
https://careersatdoordash.com/blog/doordash-llm-transcribe-menu/ 15/15