Ratings and reviews are an invaluable resource for users exploring an app on the App Store, providing insights into how others have experienced the app. With review summaries now available in iOS 18.4, users can quickly get a high-level overview of what other users think about an app, while still having the option to dive into individual reviews for more detail. This feature is powered by a novel, multi-step LLM-based system that periodically summarizes user reviews.

Our goal in producing review summaries is to ensure they are inclusive, balanced, and accurately reflect the user’s voice. To achieve this, we adhere to key principles of summary quality, prioritizing safety, fairness, truthfulness, and helpfulness.

Summarizing crowd-sourced user reviews presents several challenges, each of which we addressed to deliver accurate, high-quality summaries that are useful for users:

  • Timeliness: App reviews change constantly due to new releases, features, and bug fixes. Summaries must dynamically adapt to stay relevant and reflect the most up-to-date user feedback.
  • Diversity: Reviews vary in length, style, and informativeness. Summaries need to capture this diversity to provide both detailed and high-level insights without losing nuance.
  • Accuracy: Not all reviews are specifically focused on an app’s experience and some can include off-topic comments. Summaries need to filter out noise to produce trustworthy summaries.

In this post, we explain how we developed a robust approach that leverages generative AI to overcome these challenges. In developing our solution, we also created novel frameworks to evaluate the quality of generated summaries across various dimensions. We assessed the effectiveness of this approach using thousands of sample summaries.

Review Summarization Model Design

The overall workflow for summarizing user reviews is shown in Figure 1.

For each app, we first filter out reviews containing spam, profanity, and fraud. Eligible reviews are then passed through a sequence of modules powered by LLMs. These modules extract key insights from each review, understand and aggregate commonly occurring themes, balance sentiment, and finally output a summary reflective of broad user opinion in an informative paragraph between 100 - 300 characters in length. We describe each component in more detail in the subsequent sections.

Insight Extraction

To extract the key points from reviews, we leverage an LLM fine-tuned with LoRA adapters (Hu et al., 2022) to efficiently distill each review into a set of distinct insights. Each insight is an atomic statement, encapsulating one specific aspect of the review, articulated in standardized, natural language, and confined to a single topic and sentiment. This approach facilitates a structured representation of user reviews, allowing for effective comparison of relevant topics across different reviews.

Dynamic Topic Modeling

After extracting insights, we use dynamic topic modeling to group similar themes from user reviews and identify the most prominent topics discussed. To this end, we developed another fine-tuned language model to distill each insight into a topic name in a standardized fashion while avoiding a fixed taxonomy. We then apply careful deduplication logic on an app-by-app basis. This leverages embeddings to combine semantic related topics and pattern matching to account for variations in topic names. Lastly, our model leverages its learned knowledge of the app ecosystem to determine if a topic is linked to the "App Experience" or an "Out-of-App Experience." We prioritize topics relating to app features, performance, and design, while Out-of-App Experiences (like opinions about the quality of food in a review for a food delivery app) are deprioritized.

Topic & Insight Selection

For each app, a set of topics is automatically selected for summarization, prioritizing topic popularity while incorporating additional criteria to enhance balance, relevance, helpfulness, and freshness. To ensure that the selected topics reflect the broader sentiment expressed by users, we make sure that the representative insights gathered that are consistent with the app's overall ratings. Then, we extract the most representative insights corresponding to each topic for inclusion in the final summary. We generate the final summary generation using these selected insights. We use the insights rather than the topics themselves because the insights offer a more naturally phrased perspective coming from users. This results in summaries that are more expressive and rich in detail.

Summary Generation

A third LLM fine-tuned with LoRA adapters then generates a summary from the selected insights that is tailored to the desired length, style, voice, and composition. We fine tuned the model for this task using a large, diverse set of reference summaries written by human experts. We then continued fine-tuning this model using preference alignment (Ziegler et al., 2019). Here, we utilized Direct Preference Optimization (DPO, Rafailov et al., 2023) to tailor the model's output to match human preferences. To run DPO, we assembled a comprehensive dataset of summary pairs - comprised of the model's initially generated output and subsequent human-edited version - focusing on examples where the model's output could have been improved in composition to adhere more closely to the intended style.

Evaluation

To evaluate the summary workflow, sample summaries were reviewed by human raters using four criteria. A summary was deemed high in Safety if it was devoid of harmful or offensive content. Groundedness assesses whether it faithfully represented the input reviews. Composition evaluated grammar and Apple’s voice and style. Helpfulness determined whether it would assist a user in making a download or purchase decision. Each summary was sent to multiple raters: safety requires a unanimous vote, while the other three criteria are based on a majority. We sampled and evaluated thousands of summaries during development of the model workflow to measure its performance and provide feedback to engineers. Simultaneously, some evaluation tasks were automated enabling us to direct human expertise to where it is most needed.

Conclusion

To generate accurate and useful summaries of reviews in the App Store, our system addresses a number of challenges, including the dynamic nature of this multi-document environment and the diversity of user reviews. Our approach leverages a sequence of LLMs fine-tuned with LoRA adapters to extract insights, group them by theme, select the most representative, and finally generate a brief summary. Our evaluations indicate that this workflow successfully produces summaries that faithfully represent user reviews and are helpful, safe, and presented in an appropriate style. In addition to delivering useful summaries for App Store users, this work more broadly demonstrates the potential of LLM-based summarization to enhance decision-making in high-volume, user-generated content settings.

Acknowledgements

Many people contributed to this project including (in alphabetical order): Sean Chao, Srivas Chennu, Yukai Liu, Jordan Livingston, Karie Moorman, Chloe Prud’homme, Sonia Purohit, Hesam Salehian, Sanjay Srivastava, and Susanna Stone.

Related readings and updates.

Introducing Apple’s On-Device and Server Foundation Models

At the 2024 Worldwide Developers Conference, we introduced Apple Intelligence, a personal intelligence system integrated deeply into iOS 18, iPadOS 18, and macOS Sequoia.

Apple Intelligence is comprised of multiple highly-capable generative models that are specialized for our users’ everyday tasks, and can adapt on the fly for their current activity. The foundation models built into Apple Intelligence have been fine-tuned for user experiences such as writing and refining text, prioritizing and summarizing notifications, creating playful images for conversations with family and friends, and taking in-app actions to simplify interactions across apps.

See highlight details

A Multi-Task Neural Architecture for On-Device Scene Analysis

Scene analysis is an integral core technology that powers many features and experiences in the Apple ecosystem. From visual content search to powerful memories marking special occasions in one’s life, outputs (or "signals") produced by scene analysis are critical to how users interface with the photos on their devices. Deploying dedicated models for each of these individual features is inefficient as many of these models can benefit from sharing resources. We present how we developed Apple Neural Scene Analyzer (ANSA), a unified backbone to build and maintain scene analysis workflows in production. This was an important step towards enabling Apple to be among the first in the industry to deploy fully client-side scene analysis in 2016.

See highlight details