LLM-Enhanced Wearables for Comprehensible Health Guidance in LMICs

Mohammad Shaharyar Ahsan 25100169@lums.edu.pk Lahore University of Management Sciences (LUMS)LahorePakistan , Areeba Shahzad Shaikh 25100046@lums.edu.pk Lahore University of Management Sciences (LUMS)LahorePakistan , Maham Zahid 25100152@lums.edu.pk Lahore University of Management Sciences (LUMS)LahorePakistan , Umer Irfan 27100363@lums.edu.pk Lahore University of Management Sciences (LUMS)LahorePakistan , Maryam Mustafa maryam_mustafa@lums.edu.pk Lahore University of Management Sciences (LUMS)LahorePakistan , Naveed Anwar Bhatti naveed.bhatti@lums.edu.pk Lahore University of Management Sciences (LUMS)LahorePakistan and Muhammad Hamad Alizai hamad.alizai@lums.edu.pk Lahore University of Management Sciences (LUMS)LahorePakistan

Abstract.

Personal health monitoring via IoT in LMICs is limited by affordability, low digital literacy, and weak health-data comprehension. We present Guardian Angel, a low-cost, screenless wearable paired with a WhatsApp-based LLM agent that delivers plain-language, personalized insights. The LLM operates directly on raw, noisy sensor waveforms and is robust to the poor signal quality of low-cost hardware. On a benchmark dataset, a standard open-source algorithm produced valid outputs for only 70.29% of segments, whereas Guardian Angel achieved 100% availability (reported as coverage under field noise, distinct from accuracy), yielding a continuous and understandable physiological record. In a 96-hour study involving 20 participants (1,920 participant-hours), users demonstrated significant improvements in health data comprehension and mindfulness of vital signs. These results suggest a practical approach to enhancing health literacy and adoption in resource-constrained settings.

1. Introduction

Preventive health remains a critical, unmet challenge in Low- and Middle-Income Countries (LMICs) (haryanti2023comparative). Restricted access to continuous, understandable health insights contributes significantly to the burden of preventable disease and delayed diagnoses in these regions. While personal health monitoring technologies have proliferated in resource-rich settings, their transformative potential in LMICs is stifled by a triad of socio-economic and technological hurdles: prohibitive costs of devices and data, low levels of digital and health literacy, and the substantial cognitive burden required to interpret health information (malik2020health; babalola2022health; perrins2024health). Bridging this gap calls for personal health monitoring solutions that remain affordable and accessible while also being immediately comprehensible and actionable for users with diverse literacy levels in their current technological contexts.

Challenges. Four obstacles interact and reinforce one another, severely limiting both the adoption and the meaningful use of personal health monitoring technologies in LMICs. The digital divide and interface complexity: low digital literacy thwarts engagement with conventional mobile health platforms. This is often compounded by the complexity of these systems; feature rich graphical user interfaces impose steep learning curves, particularly for users with limited prior technology exposure or general literacy, thereby hampering navigation and the effective retrieval and interpretation of personal metrics (reid2024overcoming; kaphle2024systematic; guimaraes2023interface); economic constraints: with the high upfront cost of wearables inhibit individual adoption and large-scale deployment, especially where out-of-pocket healthcare spending dominates (huesch2024usage; swain2024analysis; wipo2024innovative; hpn2024public); poor personalization and explanation: as generic advice or raw data dashboards seldom clarify a metric’s personal relevance, leaving users with low health literacy struggling to convert data into informed action (krebs2010meta; lee2025challenges; fadhil2024transforming; alqahtani2024recent); and low cost sensor fidelity and algorithmic limits: affordability pressures favour inexpensive, off-the-shelf sensors whose noisy signals overwhelm signal processing algorithms that are not designed to handle such high levels of motion artifact and noise. Motion artifacts and low signal-to-noise ratios yield missing or inaccurate readings, eroding user trust and utility. Consequently, existing personal health monitoring solutions, often designed without these deeply embedded constraints in mind, struggle for traction. Although prior work has attempted to tackle individual issues through simplified interfaces (guimaraes2023interface; choukou2022covid) or ultra-low-cost hardware (larnyoh2022using; yaakob2021cost), a holistic solution that simultaneously addresses affordability, accessibility, personalization, and clear explanation in an integrated manner has remained elusive.

Approach. Recent advancements in LLMs (alkhayat2024large; li2024systematic) present an opportunity to fundamentally re-imagine personal health monitoring accessibility, particularly by addressing the critical barriers of data comprehension and interface complexity. We introduce Guardian Angel, an end-to-end platform architected from the ground up to dismantle the interlocking barriers prevalent in LMICs. Guardian Angel (Figure 1) consists of a wearable health band and an LLM-driven WhatsApp chatbot working in concert (lacroix2023using; conway2023improving; deniz2023quality). The wearable captures vital signs at ultra-low cost, while the chatbot delivers personalized explanations and advice. We deliberately avoid a custom smartphone app; by leveraging WhatsApp, already ubiquitous in our target communities, we eliminate a major adoption barrier of learning new apps. Likewise, the wearable is screenless to minimize cost and complexity, offloading interpretation to the cloud where a tiered LLM pipeline analyzes data. Guardian Angel is a wellness aid, not a diagnostic tool; guidance is framed as conservative self-care cues.

Refer to caption — Figure 1. System Architecture.

Contributions. Our work makes three key contributions to the design and evaluation of low-cost personal wellness monitoring in resource-constrained settings. First, we introduce a holistic architecture that fuses ultra-low-cost, screen-less hardware with a WhatsApp-based conversational agent driven by a tiered LLM pipeline, all tailored to the constraints and user needs of LMICs. Second, we focus on coverage under field noise for low-rate commodity sensors: on a benchmark, a conventional quality-gated pipeline returned values for 70.29% of windows, whereas our system produced outputs for 100% of windows, reported as coverage, distinct from accuracy, to maintain a continuous, understandable record. Third, in an exploratory in-the-wild deployment totalling 1920 participant-hours (20 participants, each continuously engaged for 96 h), we observe substantial improvements in health data comprehension and mindfulness of vital signs, even though pre-study surveys revealed low confidence in interpreting raw physiological data, a pattern common in LMICs (babalola2022health; lee2025challenges).

These contributions position Guardian Angel as a scalable, affordable pathway to democratize personal health monitoring and empower individuals to manage their well-being proactively, where such tools are most critically needed.

2. System Design and Architecture

Guardian Angel couples an ultra–low-cost edge with a resource-aware cloud to deliver usable health guidance in LMICs. A screenless wristband and minimalist phone app minimize BoM and power, while the cloud turns sensor streams into concise, actionable messages in familiar channels (e.g., WhatsApp). Screen-less bands already exist, but most operate as closed-loop, proprietary stacks, preventing meaningful interaction or customization. Guardian Angel opens the loop by decoupling sensing from interpretation and moving feedback to a programmable WhatsApp LLM agent, making interaction transparent and adaptable without changing hardware. The contribution is the co-design of a simplified edge with an adaptive backend that jointly addresses affordability, digital literacy, and comprehension barriers. We adopt a coverage-first design and position Guardian Angel as a wellness aid, not a diagnostic tool.

2.1. Design Goals

•

Affordability and longevity: prefer commodity parts and duty‑cycled sensing to extend lifetime on small batteries.
•

Low cognitive load: avoid device UI; deliver guidance in plain language in the user’s preferred channel and language.
•

Robustness to noise and outages: tolerate motion artifacts, intermittent connectivity, and missing samples without losing continuity.

2.2. Wearable Band

Screenless, passive node. We omit a display and on‑device menus, making the band a passive data node rather than a mini‑phone. This avoids UI complexity, lowers cost, and improves battery life. Interaction is offloaded to the mobile messaging interface. Unlike prior screen-less bands that keep sensing and feedback locked together, we explicitly decouple sensing from explanation so the latter can evolve in the cloud without touching hardware.

Sensor set and sampling. With tight cost/power budgets, the band samples a small set of commodity sensors for wellness trends: PPG for pulse/SpO₂, skin temperature, and a 3‑axis accelerometer for activity/sleep. Sampling uses short fixed windows and simple on‑device filtering to suppress motion artifacts before BLE transfer. We intentionally avoid expensive sensors (e.g., ECG, GPS). Using commodity parts risks lower signal quality; the backend is explicitly engineered to tolerate such noise and maintain coverage.

BLE uplink. BLE is the sole radio for pairing and uploading to the phone. It fits our duty‑cycled sampling, reduces peak power, and keeps the BoM low versus Wi‑Fi/cellular (montanari2017ble). The firmware stages readings in a small FIFO and streams packets opportunistically to the background app when the phone is in range; otherwise, data is buffered locally until the next contact. Pairing uses authenticated BLE; payloads are sent over TLS via the app; device IDs are pseudonymous.

2.3. Background Companion Application

The app bridges the passive band and the cloud service. On first launch, the user provides a phone number and a passcode. The app handles BLE pairing, time sync, and background pulls. It caches outgoing data, retries on network loss, and resumes interrupted uploads to survive low‑connectivity settings. The app stores only pseudonymous tuples (device ID, coarse timestamp, feature vectors). It exposes a minimal UI for consent, data export/delete, and alert preferences (e.g., daily summaries, escalation contacts). Users can pause uploads without unpairing.

2.4. Back-End Orchestration & LLM Pipeline

At the core of Guardian Angel is an orchestrator that supervises all services. When a sensor packet or a WhatsApp message arrives, the orchestrator spawns three LLM agents: Sensor Data Analyzer, LLM Selector, and Prompter Analyzer, and hands them a shared context. It then provisions auxiliary tools requested by each agent: a retrieval augmented generation (RAG) interface backed by a medical literature index for clinical grounding, a cron-style scheduler for reminders, and an on-demand chart generator for trend visualization.

LLM-Driven Processing Stages. Although the orchestrator owns the control logic, the reasoning work is divided into three sequential stages executed by the agents it launches:

Stage 1 – Sensor Data Analyzer: Runs a deterministic routine to derive HR, SpO₂, temperature, and activity labels from each four-second sensor burst. Performing this aggregation step shortens prompts and flags outliers early.

Stage 2 – LLM Selector: Inspects request complexity and urgency, then routes the task to the cheapest model able to answer it: GPT-4o-mini for look-ups, o3-mini for multi-step trend reasoning, and o1 for safety critical interpretation. Section 3.3.1 quantifies the resulting $\sim$ 57% cost reduction.

Stage 3 – Prompter Analyzer (Context & RAG): Assembles a structured prompt comprising (i) fresh clinical features, (ii) user profile + history, (iii) the last dialogue turns, and (iv) explicit output instructions. It then performs a RAG call to inject peer-reviewed medical guidance before passing the enriched prompt to the model chosen in Stage 2. The returned answer may also request a chart or schedule a reminder, in which case the orchestrator dynamically invokes the data visualizer or scheduler tool.

3. Implementation

To support reproducibility, we release the wearable firmware/ hardware, Android app, backend (WhatsApp server), and evaluation scripts in an open-source repo on GitHub (guardianangel_github).

3.1. Wearable Hardware and Firmware

The Guardian Angel wearable is built around an ATmega328P MCU (atmega328p_datasheet). The custom printed circuit board (PCB, Figure 2) hosts three off-the-shelf sensors: an ADXL345 accelerometer (adxl345_datasheet) for motion, a MAX30102 PPG module (max30102_datasheet) for HR and SpO₂, and a 10 k $\Omega$ NTC thermistor (ntcm10k_datasheet) for both ambient and skin temperature. Communication is handled by an HM-10 BLE module (hm10_datasheet), and power is supplied by a 2.7 Ah ER14505 Li-SOCl₂ primary cell (er14505_datasheet) chosen for longevity. Table 1 details the specific components and associated prototype costs.

To balance sensing fidelity with battery life, we adopted specific sampling rates: 34.4 Hz for the accelerometer (sufficient to capture gait granularity and activity types (khan2016optimising)), 31 Hz for the PPG sensor (adequate for reliable HR/SpO₂ waveform reconstruction (bui4182587lossless)), and 1 Hz for the slower-changing skin temperature. These rates allow data collection in the MCU’s 2 kB FIFO buffer, defining a 4-second acquisition window per cycle. We also implement a four-state loop to minimize power consumption: Reset $\rightarrow$ Scan (BLE advertising) $\rightarrow$ Collect (4-second sensor burst acquisition) $\rightarrow$ Transmit. After the Collect phase, the MCU powers down the sensors, activates the BLE module, and streams a 3.76 kB data packet at 9600 baud. A 3.5-second connection window plus a 1 ms inter-sample delay during transmission prevents BLE buffer overflows. The total transmission time per cycle is approximately 7.4 seconds. Between cycles, the radio enters a sleep state, targeting an average current budget below 500 $\mu$ A, and extending battery life. Once transmission is complete or the connection window expires, the device returns to the Scan state. Each BLE transmission contains timestamped sensor readings and metadata, structured as <ts, accel[128], ir[124], red[124], temp[4], CRC16> with authenticated pairing and pseudonymous device IDs.

3.2. Android Companion App

Android application utilizes the open-source FastBLE library (fastble) for BLE management, handling device scanning, pairing, and encrypted communication. After initial pairing, where the band’s unique BLE UUID is stored, a background service maintains the connection and manages the incoming sensor data queue. The app’s sign-up screen (Figure 2) is made to be minimal, only collecting the phone number and a password; remaining information is collected through conversational sign-up discussed in Section 3.5.

The raw ADC values in the data are then processed using: (i) a 3 Hz low-pass Butterworth filter smooths the temperature readings; (ii) a 0.5–2.5 Hz band-pass filter isolates the relevant frequencies in the PPG signals; and (iii) a 0.2 Hz high-pass filter removes the gravitational component from the acceleration data.

The processed sensor data, along with any anomaly flags, is serialized into a JSON object and transmitted securely via HTTPS POST request to the /v1/sensors endpoint on the backend server at 4-minute intervals, except in the case of anomalies, which are transmitted immediately. The application is designed for minimal resource consumption, typically idling below 1% CPU usage and averaging around 6 MB of RAM.

3.3. Back-End LLM Processing Pipeline

Figure 3 shows the end-to-end interaction flow, highlighting both the sensor-data path and the user-message path .

3.3.1. Adaptive Model Selection

High-performance LLMs (like o1) can incur significant inference costs, challenging scalability. To mitigate this, our system employs a tiered approach. Incoming user queries, received via the WhatsApp interface (right side ) and pre-processed if necessary ( ), are first classified based on their inferred complexity. This classification is performed by a lightweight GPT-4o model acting as a router¹¹1All experiments used pre-GPT-5 models; GPT-5 released post-prep.. Based on the router’s classification, the system dynamically selects and invokes the most suitable LLM, balancing cost and capability ( ):

•

Simple queries (e.g., greetings, basic data requests) are routed to GPT-4o-mini.
•

Reasoning-based queries (e.g., summarizing trends, requiring multi-step thought) are directed to o3-mini.
•

High-risk or complex queries (e.g., interpreting potentially urgent anomalies, complex context) are escalated to o1.

This dynamic routing strategy ensures that computational resources are used efficiently, significantly reducing overall operational costs (as evaluated in Section 4.2).

3.3.2. Adaptive Prompt Routing

A parallel prompt selection component ( ) determines the requisite level of prompt specificity for the conversational agent. Simple utterances (e.g., greetings) trigger minimal prompt templates, whereas comprehensive health summary requests automatically trigger more detailed instructions to be embedded in the system prompt. This contextual routing is critical in both conserving computational resources and minimizing model hallucination by providing the LLM with only necessary and actionable contextual information.

3.3.3. Structured Sensor-Data Analysis

Sensing streams arriving from the companion app ( ) traverse a two-stage pipeline:

Interpretation Module. ( ). Low-cost personal health monitoring suffers from fragmented, intermittent data. Conventional pipelines drop noisy segments to preserve precision, which breaks continuity and undermines trust. We integrate an LLM as a pragmatic, instruction-following estimator over raw multi-modal waveforms. The model exploits correlations between accelerometer signatures and concurrent PPG artifacts to infer a plausible physiological state from segments a rule-based system would discard. The result is a coverage-first (continuity under field noise, distinct from accuracy), readable physiological stream that supports user trust and health literacy.

For each 4 s burst, raw time series from IR/Red PPG (31 Hz; 124 points per channel), the 3-axis accelerometer (34 Hz; 408 points across $a_{x},a_{y},a_{z}$ ), and wrist/ambient temperature (1 Hz; 4 points per sensor) are passed to GPT-4o-mini whose parameters are set as follows: temperature=1.3, top-p=0.8.

Unlike feature detectors that search for landmarks (e.g., systolic peaks in clean waveforms), the LLM acts as a sequence-to-values regressor. It processes the full multi-modal window, including accelerometer-linked noise in the PPG, and maps waveform shape to a plausible physiological estimate. Pre-training on diverse sequential data helps it see through noise that defeats deterministic algorithms. The prompt (Listing 3.3.3) directs the model to return ‘the HR and SpO₂ values you think it represents’ and to ‘focus on outlying data’, framing the task as informed estimation rather than exact feature extraction. The model outputs strict JSON: {hr, spo2, activity, activity_verbose, temp_body, temp_ambient}, enabling a stream of numerical results validated in Section 4.1.

This end-to-end conversion builds on prior work (xie2025physllmharnessinglargelanguage; time-series-linguistic-scaffolding; feli2025llmpoweredagentphysiologicaldata). Although internal mechanisms are opaque, we hypothesize that broad sequential pre-training enables direct interpretation of raw physiological waveforms as an emergent capability. Accordingly, our evaluation focuses on empirical comparisons with conventional algorithms (Section 4.1).

Reasoning and Urgency Module.( ). Extracted features are fused with longitudinal trends, demographics, and user-defined thresholds to generate (i) a concise health summary and (ii) an URGENCY flag (urgent/not urgent). Urgent messages undergo an additional quality-control pass before delivery ( ); corresponding events are persisted as long-term memory alongside the vital-sign archive.

Together, these layers convert noisy sensor bursts and free-form chat into timely, personalised coaching, closing the loop from low-cost data acquisition to actionable insight.

3.4. Dynamic Contextualization

To anchor each reply in the user’s evolving context, the backend assembles a bespoke prompt immediately before every LLM call ( ). The prompt embeds:

•

Conversational History: Ensuring context continuity across interactions.
•

Long Term Memory: Past events or medical alerts previously tagged as urgent to inform future interventions.
•

Recent Health Metrics: To set physiological context.
•

Personal Profile Data: User-specific information (name, age, BMI, medical background, demographic, phone number) to enable database lookup and contextual association.
•

Temporal Context: All queries and sensor records are furnished with their respective timestamps.
•

Task Relevant Instructions: To shape the nature and breadth of agent responses.
•

Output Structure Specifications: System prompts require all LLM output to conform to a structured JSON schema (e.g., PERSONAL, IMAGE, URGENCY, RESPONSES, QUESTIONS). Schema enforcement is managed via supported API configurations.

This dynamic scaffolding yields responses that are personalised, temporally grounded, and traceable to the user’s longitudinal record, crucial for safe, trustworthy health guidance.

3.5. WhatsApp Interface Design

End-users interact almost exclusively through WhatsApp (top-right ), which supports multilingual communication at scale. The backend handles five interaction facets:

•

Input Processing: The backend exposes webhooks for incoming messages of type text, audio, and button. Audio messages are processed by retrieving the media file, transcribing it via the OpenAI Whisper API, and converting the transcript to text.
•

Conversational Sign-up: When users sign up on the companion app, they are automatically sent a welcoming info message. It then prompts users to send personal information for contextual responses.
•

Output Formatting: Agent-generated output arrays (under RESPONSES) are broken into sequential WhatsApp messages; line breaks are explicitly handled for enhanced naturalness.
•

Interactive Question Handling: Follow-up questions, sometimes generated by the agent (i.e., QUESTIONS), are presented to the user with WhatsApp interactive buttons, affording high engagement and rapid iterative querying.
•

Visual Data Delivery: If the agent requests a visualization, the image file, generated through a server-side pipeline, is uploaded via WhatsApp’s media API, accompanied by explanatory message content.

This design keeps all functionality within a familiar chat environment, eliminating the need for an app beyond one-time device sign-up.

3.6. Agentic Abstractions and Tool Integration

Built on LangChain, our agent layer endows the LLM with concrete tool affordances that turn chat into action:

•

Task Scheduling: Scheduling is orchestrated through an external job service (cron-job.org) and an internal metadata registry. Callbacks trigger server-side actioning at scheduled times, supporting use cases such as medication or check-in reminders. Some tasks are scheduled by the system on sign-up, such as automated summaries and reminders if no data has been collected for a defined interval.
•

Data Visualization: The agent can invoke a standardized plotting API, generating data visualizations.
•

On-Demand Data Retrieval: LLM agents are enabled to programmatically poll for the latest vital signs, user metadata, or timestamp context, facilitating grounded responses and reducing hallucination risk.
•

RAG: The architecture allows for expansion with a vector database (Pinecone) supporting retrieval-augmented workflows. Using multi-query retrieval, the system can synthesize knowledge both from general medical resources and user-uploaded documents, reinforcing the agent’s factual grounding and personalization for wellness education.

These modular tools underpin robust planning, multimodal feedback, and fact-checked explanations, delivering a richer, more interactive wellness experience.

4. Evaluation

To assess the system’s feasibility, effectiveness, and potential impact, particularly in resource-constrained settings, our evaluation spans three areas: end-to-end system robustness, cost-efficiency and response quality, and user experience and impact. Guided by this scope, our evaluation focuses on where leverage is highest: the interpretation layer. In our target settings, success hinges on delivering continuous, comprehensible feedback at very low cost, not on shaving marginal analog tolerances. The interpretation layer (i) stays robust to commodity-part variance and wear/fit effects, (ii) turns noisy bursts into uninterrupted streams that sustain user trust, and (iii) transfers across devices and supply chains; fine-grained hardware micro-benchmarks, while valuable, are orthogonal to these deployment drivers and thus deliberately excluded from evaluation as they operate at a different level of analysis.

4.1. Evaluating End-to-End System Robustness

We quantitatively compare an LLM with conventional signal processing for deriving HR, SpO₂, and activity levels on a benchmark dataset. Key metrics are activity accuracy, mean absolute error (MAE) for HR and SpO₂, and the fraction of segments with valid HR/SpO₂ estimates (data availability) under noisy PPG.

4.1.1. Methodology

We use the PhysioNet Pulse Transit Time PPG dataset (ppg-dataset), which provides three recordings per subject for 22 healthy subjects with time-aligned reference SpO₂, HR, and IMU signals (iHealth Air pulse oximeter, OMRON monitor). The dataset employs the MAX30101 PPG sensor, matching Guardian Angel hardware. Original signals (PPG at $1\,000\text{\,}\mathrm{Hz}$ , IMU at $500\text{\,}\mathrm{Hz}$ ) are downsampled via linear interpolation to our device rates: $31\text{\,}\mathrm{Hz}$ for IR/Red PPG and $34\text{\,}\mathrm{Hz}$ for the tri-axial accelerometer. This choice ensures sensor-family parity (MAX3010x) and trusted references, so accuracy numbers reflect algorithmic differences rather than mismatched optics or weak ground truth. We target feasibility under commodity, low-rate operation; results should be read in this deployment context rather than as SOTA claims. We compare two processing pipelines:

Conventional Algorithmic Approach.

•

HR/SpO₂ estimation: A widely used open-source implementation (hrcalc), representative of typical signal-processing chains (soriano2022design; singh2023measurement). Steps include DC removal, filtering, IR peak detection, and AC/DC analysis on IR. Internal quality checks suppress outputs when signal criteria are not met.
•

Activity classification: An LSTM (50 units) trained on MotionSense (Malekzadeh:2019:MSD:3302505.3310068) following prior work (har-survey; zhao2017deepresidualbidirlstmhuman; odhiambo2022humanactivityrecognitiontime). Input is tri-axial accelerometer (x, y, z) only; classes are {sit, walk, run}; early stopping is used.

LLM-Based Approach (Guardian Angel). Each 4 s window is processed with the GPT-4o-mini interpreter described in Section 3.3.3, with no additional prompt or parameter tuning during evaluation, treated as a practical, instruction-following regressor; best-effort outputs for all windows.

Performance is measured using:

•

MAE for HR (BPM) and SpO₂ (% points) against dataset references.
•

Activity accuracy from the confusion matrix.
•

Data availability for the conventional HR/SpO₂ pipeline, defined as the percentage of segments yielding valid numeric outputs. The LLM returned numeric estimates for all segments in our runs.

4.1.2. Results and Discussion

Table 2 summarizes results on 1003 traces. The conventional pipeline produced valid HR/SpO₂ outputs for 70.29% of segments. The LLM achieved lower error and full availability: HR MAE 11.96 BPM vs 22.49 BPM and SpO₂ MAE 1.39 % vs 2.30 %. Activity accuracy was modest for both accelerometer-only methods, with a small LLM gain (38.48% vs 32.80%).

Table 2. Performance metrics (N = 1003 traces). ^*MAE for the conventional algorithm computed on segments with valid outputs.

Metric	Conventional	LLM (GPT-4o mini)
Heart Rate MAE (BPM)	22.49^*	11.96
SpO₂ MAE (%)	2.30^*	1.39
HR/SpO₂ Data Availability (%)	70.29	100.00
Activity Classification Accuracy (%)	32.80	38.48

PPG-Derived Metrics. LLM estimates for HR and SpO₂ are both more accurate and consistently available. Figure 4 shows the average error by subject for both metrics; the algorithm’s inability to intuitively adapt to irregular data leads to a noticeably larger error for certain subjects (S11, 13, 14, and 15) and reveals the flaw in it’s deterministic design. Error densities in Figure 5 are sharply peaked near zero for the LLM, indicating more frequent low-error predictions.

The largest gap is availability. The conventional MAX30101 pipeline, sensitive to motion and strict quality checks, emits valid HR/SpO₂ values for fewer segments, which limits continuous monitoring. The LLM produced estimates for all segments in this evaluation, enabling uninterrupted feedback.

Accelerometer-Derived Metric. With accelerometer-only inputs, both methods achieve modest activity accuracy. The baseline LSTM shows a bias toward the run class, often confusing walk and sit. The LLM yields more balanced predictions and higher recall for walk. Future gains likely require additional motion cues such as gyroscope data or improved feature extraction.

Overall, LLM delivers lower HR and SpO₂ error and full availability, with an advantage in activity recognition, which makes it better suited for continuous monitoring on low-cost wearables.

4.2. Cost-Efficiency and Response Quality

As detailed in Section 3.3.1, Guardian Angel employs an adaptive, tiered model selection strategy to optimize inference costs while maintaining response capability. We evaluate the effectiveness of this strategy using a set of 30 diverse user queries, analyzing both cost reduction and response quality preservation compared to a baseline using only the high-performance OpenAI o1 model.

4.2.1. Methodology

Cost. To compare costs, we approximate the token count for each query using OpenAI’s rule-of-thumb heuristic ( $\text{Tokens}\approx\text{length}(\text{query})/4$ ) (openai_token_estimate). The processing cost for both the tiered system (including router overhead) and the baseline was then computed based on the models used and their input token prices (Table 4), using:

(1)

\footnotesize\text{TotalCost}(q)=\sum_{m\in M_{q}}\left(\frac{\text{Tokens}_{m}(q)}{1000}\times\text{Price}_{m,\text{1K}}\right)

Response quality. To assess perceived quality, we generated two responses for each of the 30 queries: one from the model selected by our tiered system (Response A) and one from the o1 baseline (Response B). Five independent users blindly compared these response pairs and selected their preferred one for each query, yielding 150 total comparisons. This is an exploratory, preference-only assessment (no hypothesis testing). Baseline responses were generated under a matched context.

4.2.2. Results and Discussion

The adaptive model selection strategy demonstrated estimated benefits in both cost and perceived quality.

Cost reduction. As summarized in Table 4, the tiered system achieved a substantial 56.57% relative reduction in inference costs compared to the o1-only baseline. Figure 7 visually confirms this, showing the tiered system’s per-query cost consistently remaining below the baseline, particularly for simpler queries requiring less capable models. This ascertains the value of the tiering strategy for economic efficiency, especially in resource-constrained deployments.

Response quality. Importantly, the cost savings did not come at the expense of perceived quality. In the blind user preference tests involving 150 comparisons, responses generated by the tiered model selection system were preferred 63.33% of the time over those from the o1-only baseline. To ensure the comparison was fair, when the response was generated by the o1 model, it was given the user’s sensor data, as well as other relevant meta-data for context. This suggests that the tiered approach reduces costs significantly and maintains user satisfaction with the generated responses.

These results support adaptive model selection as a promising strategy, yielding significant cost savings while preserving, and often improving, the perceived quality of responses.

Table 3. Token cost).

Model	1K Tokens (USD)
GPT–4o-mini	0.00015
GPT-3.5-turbo	0.001
o3-mini-2025-01-31	0.00125
o1-2024-12-17	0.015

Table 4. Inference cost.

Metric	Value (USD)
Total cost (tiered system)	0.0024481
Total cost (o1-only)	0.0056363
Total savings	0.0031882
Relative reduction	56.57%

4.3. User Study

To evaluate Guardian Angel’s real-world feasibility, usability, and health data comprehension, we ran a 96-hour (4-day) micro-longitudinal study.²²2Due to IEEE’s PerCom strict policy, we could not include interview protocols, prompts, and code snippets in the appendix. We released detailed documentation through anonymized GitHub repository (guard_code)

Phase 1 enrolled 20 university students (11 male, 9 female; age 19–29, M=21.78, SD=2.46; 50% prior wearable experience) and produced 4,644 logged interactions. Hardware limits (five functional prototypes; Figure 8) and IRB timelines required a tightly scoped deployment, consistent with iterative HCI practice using early discount studies to surface usability issues before larger trials (nielsen1994guerrilla; scholtz2004framework). This convenience sample was intentionally homogeneous to isolate interface effects; generalizability is future work.

Participants were digitally skilled yet showed low health literacy typical of LMIC settings; pre-study surveys indicated uncertainty in interpreting personal physiological data (babalola2022health; lee2025challenges). The design aimed to isolates the effect of a WhatsApp-delivered LLM on comprehension while holding digital literacy constant. Positive feasibility results motivate an IRB-approved Phase 2 with 60 participants across three age and literacy strata, following a staged deployment model suited to LMIC contexts (perrins2024health).

4.3.1. Study Protocol and Data Collection

Participants completed a baseline survey and semi-structured interview covering demographics, technology habits, attitudes toward AI and health tracking, and self-rated understanding of health metrics. We built five identical devices (Figure 8) and deployed in sequential batches of five participants. During the 96-hour study, participants wore a device and interacted with the LLM over WhatsApp. System logs captured interaction frequency, timing, query type, and wear time. Post-deployment, participants completed a follow-up survey and interview on attitudes, understanding, self-reported behaviors, usability, and perceived impact. All procedures used informed consent and anonymization. Pre/post Likert responses (5-point) were analyzed as paired data; interviews underwent thematic analysis.

4.3.2. System Engagement and Usage

Interaction logs indicate strong engagement. Follow-up questions were suggested on average 22 times per participant, with 12% uptake. Users interacted in bursts (mean 9.1 sessions/day, SD=2.7). Figure 9 shows user-initiated messages (blue) and automated summaries (green). Some participants experienced peripheral issues that reduced message flow, including a faulty Bluetooth link (P20) and a battery failure (P14). Others scheduled additional summaries (P1, P17, P18, P19) or turned summaries off in favor of ad-hoc queries (P7, P11, P12). Wear adherence averaged 7.3 hours/day (SD=1.8), mainly during waking hours.

Users most often requested HR and SpO₂, temperature, and activity information, along with on-demand graphs and guidance from the agent. Table 5 summarizes user and system activity across all 20 participants.

Table 5. Summary of interactions during the 96-hour (N=20).

Category	Interaction Type / Detail	Count / Value
User-Initiated	Voice Messages Sent	84
	Text Messages Sent	554
System-Generated	Scheduled Summaries Delivered	2,030
	Recommendations / Suggestions Given	1,976
Overall Metric	Total Logged Interactions	4,644

4.3.3. Quantitative Survey Results

Pre/post surveys show positive shifts (Table 6). The largest gains were in accessibility of health data (+1.85), and mindfulness of HR (+1.70) and SpO₂ (+1.60). Additional improvements include trust in AI (+0.55), importance of personalized insights (+0.85), mindfulness of movement (+0.85), and health data comprehension (+0.55). Results suggest short-term use increased awareness, perceived value, and the ability to interpret personal health information through the conversational interface.

Table 6. Pre- vs. post-deployment survey results.

Measure	Pre (M $\pm$ SD)	Post (M $\pm$ SD)	$\Delta$
Device-Related Attitudes
Trust in AI	3.25 $\pm$ 0.91	3.80 $\pm$ 0.77	+0.55
Perceived Usability	4.15 $\pm$ 1.09	4.30 $\pm$ 0.73	+0.15
Importance of Personalized Insights	3.00 $\pm$ 1.12	3.85 $\pm$ 0.88	+0.85
Accessibility of Health Data	2.15 $\pm$ 1.14	4.00 $\pm$ 0.65	+1.85
Personal Changes
Proactivity (Self-Reported Behavior Change)	2.90 $\pm$ 1.37	3.00 $\pm$ 1.38	+0.10
Mindfulness of Daily Movement & Activity	3.10 $\pm$ 1.52	3.95 $\pm$ 0.95	+0.85
Mindfulness of Heart Rate (HR)	2.15 $\pm$ 1.23	3.85 $\pm$ 0.81	+1.70
Mindfulness of Blood Oxygen (SpO₂)	1.80 $\pm$ 1.15	3.40 $\pm$ 1.10	+1.60
Mindfulness of Body Temperature	2.25 $\pm$ 1.16	3.45 $\pm$ 1.05	+1.20
Self-Rated Physical Activity Level	2.85 $\pm$ 0.99	3.05 $\pm$ 0.95	+0.20
Health Data Comprehension	3.05 $\pm$ 1.00	3.60 $\pm$ 1.19	+0.55

4.3.4. Qualitative Insights from Interviews

Thematic analysis adds context:

•

Awareness and action. Participants reported greater awareness of physiological states and valued concise, actionable guidance (for example, hydrate, move, rest).
•

Conversational usability. WhatsApp reduced learning burden and supported natural language, voice input, and code-switching; brief prompt acclimation was common and short-lived.
•

Feature value. On-demand graphs and personalized reminders were frequently cited as useful and engaging.

4.3.5. Design Principles for Resource-Constrained Wellness Systems

To evaluate the feasibility, user acceptance, and behavioural impact of delivering an LLM-driven wearable via WhatsApp, we conducted post-study interviews. Grounded in our 1920 participant-hour field study, we distill three design principles that generalize beyond Guardian Angel:

(P1) Ease of Use. Participants praised WhatsApp’s familiarity: it required no new app learning, integrated seamlessly with existing notifications, and accepted mixed-language input.

Fifteen respondents echoed this view. They liked rich, daily summaries yet preferred concise replies to on-demand questions:

Flexible language support further smoothed interaction:

(P2) Perceived Value. Users valued actionable nudges over lengthy coaching, but wanted accuracy to sustain trust.

A few preferred broad guidance that encouraged reflection instead of prescriptive detail:

(P3) Health Awareness. Before the study, many relied on “gut feeling.” Continuous data often contradicted those instincts and prompted reflection.

Data helped users connect metrics to concrete actions and spot sedentary routines:

Outlook. Most participants expressed willingness to keep using the system if sensors and feedback were refined:

Together, these insights show that a WhatsApp-based, LLM-mediated wearable can lower engagement barriers, build situational trust, and foster healthier habits with minimal user effort.

5. Related Work and Discussion

The convergence of mobile devices, wearables, and AI has catalysed digital health for physiology monitoring, behaviour support, and tailored interventions. We focus on two strands most relevant to ours: digital behaviour-change interventions and passive sensing via mobile or wearable devices.

Digital interventions for health behaviour change. Digital behaviour-change interventions (DBCIs) scale widely and cost-effectively, yet sustaining engagement remains difficult (Lee2024). Predictors such as self-efficacy explain only part of participation variance (Lee2024). Chatbots add interactivity and have improved diet in young adults (Ashton2023) and physical activity in older adults (wiratunga2020fitchatconversationalartificialintelligence). Design factors such as dialogue style, barrier reduction, and social support shape acceptance; co-creation is especially important for older adults (Ashton2023; wiratunga2020fitchatconversationalartificialintelligence). Voice interfaces can motivate use but need careful conversational design (wiratunga2020fitchatconversationalartificialintelligence). Evidence is promising (Arakawa2024; Yang2024), although superiority over simpler modalities for long-term change is unproven. Delivery mode matters: a high-engagement SMS programme for community health workers in Vietnam was feasible but did not improve knowledge, indicating that information push alone is insufficient (article).

Mobile and wearable sensing for health monitoring. Passive sensing offers continuous, low-burden tracking outside clinics. For example, fusing smartphone usage with wearable signals improves sleep detection in the wild (martinez2020improved), and privacy-aware mHealth platforms are emerging for broader populations (Fernandes2024). Key challenges include throughput, user burden, and analytics to fuse heterogeneous streams (10.1145/3675094.3677583; Fernandes2024). Recent PerCom evidence shows data interruptions and charging behavior materially impact real-world datasets” (MartinezMMNCDS20). Non-invasive blood-pressure work illustrates the trend toward deep learning: FewShotBP adapts to individuals with one-tenth the labelled data needed for conventional fine-tuning by combining multimodal features with physiological priors (10.1145/3610918).

Overall, gaps persist in sustained engagement (Lee2024; Ashton2023), theory-grounded interventions (Ashton2023; wiratunga2020fitchatconversationalartificialintelligence), effects beyond information delivery (article), and personalised models from limited real-world data (10.1145/3610918; 10.1145/3675094.3677583). Guardian Angel addresses these gaps by coupling ultra-low-cost, screenless wearables with an LLM-powered WhatsApp agent, targeting affordability, digital-literacy barriers, interface complexity, and data comprehension in low-resource settings. The contribution lies in the integration, not isolated components.

Outlook. Our chosen LLM covers general health topics but lacks domain depth and adds serving cost with reliance on a third-party provider. Future work will evaluate lower-cost, open-source and sensor-language foundation models (zhang2025sensorlmlearninglanguagewearable), edge- or on-device–optimized variants, and domain-specific options (e.g., Med-PaLM) when accessible and economical for this use case. Lightweight personalization, such as small on-device adapters or profiles, is a priority to tailor guidance without increasing cost.

The hardware prioritizes low cost but leaves room for refinement. Future iterations may add sensors such as a gyroscope for richer activity features or electrodermal activity for stress, with careful budgeting for cost and battery life. Guardian Angel is a personal wellness aid, not a medical instrument; its values are not clinical-grade and must not be used for diagnosis or treatment. The primary aim is awareness through trends and conversational feedback. Based on user feedback, we will iterate on materials and attachment methods for comfortable day–night wear and improved durability (e.g., dust and water resistance).

6. Conclusion

Guardian Angel demonstrates that pairing screenless sensing with a familiar chat interface can meaningfully extend wearable health technology in LMICs. By co-designing a simplified edge with an adaptive, WhatsApp-based LLM backend, it tackles three entrenched barriers: device cost, limited digital literacy, and the cognitive load of raw metrics. We intentionally favor a continuous physiological narrative over clinical-grade precision, where conventional pipelines drop noisy segments, Guardian Angel sustains uninterrupted, comprehensible feedback, turning failure points into engagement. Our exploratory field deployment showed sustained use and clearer understanding of health metrics, suggesting a practical path from inexpensive hardware to a trusted wellness aid.

Component	Role	USD
ADXL345	Accelerometer	$1.35$
MAX30102	Pulse oximeter	$1.62$
ATmega328P	Microcontroller	$2.42$
HM-10 BLE	BLE module	$2.85$
ER14505 battery	Power (2.7 Ah)	$1.78$
Misc. passives	Caps/resistors	$0.59$
Custom PCB	Circuit board	$2.11$
Total		12.72