An Automatic Scaling Solution for LLM Inference Services Based on Knative

By Xiangxian and Yuanyi

1. Background

In the era of artificial intelligence commercialization, model inference will see much wider use compared with model training, as the core of commercializing technology is its broad application value. It can be foreseen that model inference will become the main battlefield in the future.

However, the huge challenges large-scale model inference faces mainly lie in balancing cost and performance during large-scale deployment. Among them, cost is the most critical factor. As the scale of large models continues to expand, the computing resources required are also increasing. The scarcity and high price of GPU resources also lead to the rising cost of each model inference. And the end user is only willing to pay for value, not for the high cost of inference. Therefore, reducing the cost of each inference is the priority of the infrastructure team.

As a result, performance becomes the key competitive advantage, especially in the ToC field. Faster inference speed and better inference effects are the keys to enhancing user stickiness and experience. Therefore, when optimizing inference services, we need to reduce costs and continuously improve performance and efficiency to cope with the changing market and user demands.

The scarcity of GPUs urges most people to reserve a fixed number of GPU instances for external service peaks during large language model inference. This approach ensures service reliability, but it also brings a significant waste of resources. In online inference scenarios, traffic fluctuates significantly during peak hours and off-peak hours. For example, chatbots have lower traffic at noon and night, resulting in massive idle GPU resources. The biggest difference between GPU and traditional CPU lies in the uncertainty in automatic scaling, and resources will not be easily released, further aggravating waste.

However, we are inspired by the sharing of large-model inference service providers. We can free up GPU resources for model training and offline inference during low-traffic periods to effectively improve resource utilization. This method is called "reverse automatic scaling". Therefore, we are not discussing the traditional automatic scaling of computing nodes, but instead focusing on maximizing resource utilization through automatic scaling of workloads and resource preemption with a fixed number of nodes. However, due to the characteristics of LLM itself, it is not easy to implement automatic scaling while meeting the low latency requirements. You need to meet the following three prerequisites:

A scientific and reasonable automatic scale-out mechanism
Computing resource guarantee during scale-out
SLA user-experience guarantee during the scale-out process

A Scientific and Reasonable Automatic Scale-out Mechanism

Kubernetes provides HPA (Horizontal Pod Autoscaler) and CronHPA to scale up the number of instances on demand to reduce resource waste. However, traditional HPA has the problem of "scaling lag" and lacks sufficient support for common traffic burst scenarios. Although CronHPA can be scaled based on a fixed schedule, it requires a clear periodicity of the business and manual adjustment of policies, which increases O&M costs.

Moreover, there are relatively few automatic scaling metrics suitable for LLM inference. Traditional HPA supports CPU and memory metrics by default, which is difficult to apply to GPU scenarios. Even if the existing GPU exporter can provide some GPU metrics, these metrics often cannot accurately reflect the actual GPU load.

Resource Guarantee During Scale-out

This article focuses on maximizing resource utilization through workload automatic scaling and resource preemption with a fixed number of nodes. Therefore, resource guarantee during scale-out is mainly achieved by preempting other low-priority tasks, such as some training tasks and offline inference tasks.

In the automatic scaling process, considering the different inference performances of different GPUs, users may customize priority configuration to make full use of different heterogeneous resources. This ability to schedule and preempt different heterogeneous resources based on priorities is not available in the current Kubernetes-native scheduler.

SLA User-experience Guarantee during the Scale-out Process

Compared with traditional online services, the biggest difference in large-model inference services is that it takes a longer time from startup to readiness. This is mainly because LLM's inference images and models occupy a large storage space. For example, in terms of the currently popular inference framework vLLM, the image size of v0.6.3 is about 5 GB. The model size of LLM is also usually large, and in the actual production environment, the parameter size of the mainstream model is typically between 7 B and 14 B. Take the common Qwen2.5-7B-Instruct as an example, its model size is approximately 15 GB. This makes the process of loading the model from network storage to the GPU particularly time-consuming.

To address these issues, we propose some automatic scaling solutions for inference scenarios in ACK scenarios.

2. Introduction

2.1 Overall Architecture

To address the preceding issues, we coordinate multiple enterprise-level capabilities of ACK to propose the following deployment solution based on Knative + ResourcePolicy + Fluid. Knative supports the concurrency-based Knative Pod Autoscaling (KPA). This allows you to detect load changes and respond to traffic bursts. During the scaling process, you can use ResoucePolicy to customize the priority-based resource scheduling. When the remaining idle resources are insufficient, you can preempt the resources of online inference services with high priority. Regarding the SLA user-experience guarantee during the scale-out process, we use Fluid to improve the scaling efficiency, pre-download the model in advance, save it in the network storage, and accelerate data access based on Fluid. On the basis of enhancing the scaling efficiency of the system, the response time of the first token is guaranteed while meeting the automatic scaling requirements in inference scenarios.

2.2 Automatic Scaling Based on Inference Requests - Knative

A scientific and reasonable automatic scaling strategy mainly includes two parts: one is a reasonable scaling metric, and the other is a robust automatic scaling mechanism.

A reasonable scaling metric is the basis for developing scaling strategies. Conventionally, GPU utilization is used as an automatic scaling metric. However, as the architecture and computing methods of GPUs are different from those of CPUs, GPU utilization often fails to truly reflect the computing load of GPUs and can only be used as a basis for judging the idle state of machines. So what are the reasonable scaling metrics in LLM inference? Without a unified answer in the industry, this needs to be determined through extensive tests. Some methods utilize the metrics provided by the inference framework itself, such as the num_requests_waiting in vLLM/NVIDIA NIM, and the batch_size and queue_size in TGI. The optimal metric remains to be discussed. However, according to our tests, the concurrency of Knative serving shows good performance. In addition, Knavtive automatically reveals the concurrency and RPS metrics as sidecars, eliminating the need for the business container to reveal them, which is applicable to a wider range of scenarios.

Knative is an open-source Kubernetes-based serverless container framework. Knative supports pod automatic scaling based on resource requests, version management, and canary releases for applications. When no traffic is processed, the number of pods is scaled to zero. KPA supports using concurrency and requests per second (RPS) as automatic scaling metrics. Concurrency is suitable for businesses that consume a large amount of resources and take a long time to process a single request, while RPS is suitable for businesses with a short processing time. In terms of LLM inference, whose duration is usually between hundreds of milliseconds and seconds, it is appropriate to use concurrency as a scaling metric.

Besides the compatibility between metrics and LLM inference, Knative KPA has a more flexible mechanism for handling traffic bursts compared with HPA:

• Quick scale-out: To handle burst traffic, stable and panic modes are designed to distinguish burst traffic from normal traffic. During a short panic window, if the number of concurrent requests exceeds the threshold of the current processing capability (200% by default), the system immediately scales out. By default, HPA is triggered only if no scaling action is taken within 5 minutes to avoid frequent scaling.

• Traffic cache: The activator and queue-proxy of Knative can cache traffic, offering the following benefits:

Prevent existing instances from suffering performance degradation caused by processing too many requests.
Delay request scheduling to avoid allocating all traffic to existing instances during brief periods of high traffic. New instances can then be created to help distribute the load.

2.3 Configure Priority-based Resource Scheduling - ResourcePolicy

Customized priority-based resource scheduling is an advanced scheduling policy provided by the ACK Pro scheduler. In a cluster with heterogeneous GPU resources, you can use ResoucePolicy to configure the order and preemption policies of pods scheduled to different types of GPU nodes during the scale-out process of LLM applications. During the scale-in process, pods are reclaimed in the reverse order of the original scheduling order. This feature helps manage user scheduling of different heterogeneous resources in a sophisticated manner. ResourcePolicy allows you to set a preemption policy. When each scheduling unit fails to be scheduled, the scheduler attempts to preempt resources. Therefore, when inference traffic is at off-peak hours, the system can allocate resources to more training or offline inference tasks. When online inference traffic increases, you can use ResourcePolicy to schedule tasks to a resource pool of a specified type. If resources are insufficient, you can also use the preemption mechanism to ensure the effective use of resources.

2.4 Accelerate Model Data - Fluid

In the cloud environment, we usually use network storage (such as NAS and OSS) to store pre-downloaded models. However, when accessing these remote network storage in an ACK cluster, problems such as high latency and limited bandwidth often occur. In particular, when an AI model inference service is started, releasing or updating the AI model inference service will cause a large number of service instances to start simultaneously. These instances need to concurrently read the same model file from the storage system and load it to the GPU memory. As a result, the speed of pulling directly from the network storage is slowed down.

Fluid is designed to layer the functions of the storage system and divide them into two major capabilities: data storage and data access. At the same time, part of data access capabilities are migrated upward to the computing cluster. In big data (AI) application scenarios, Fluid abstracts the process of accessing data by computing tasks, proposes the concept of elastic datasets, and combines distributed data caching technology with cloud-native autoscaling, portability, and scheduling capabilities. Data offloading is used to reduce the pressure on central storage. Tiered locality cache and cache locality scheduling are used to improve data access performance. Automatic scale-out of cache clusters is used to provide elastic I/O throughput when computing resources access data with high concurrency. This enhances data access efficiency.

To fully leverage the data acceleration effect of Fluid, we need to configure Fluid rationally according to the business performance requirements and budget, including selecting appropriate ECS models, cache media, and cache system parameters. Although it sounds complicated, Fluid provides detailed best-practice configuration documentation, and you can follow its instructions.

3. Quick Practice

3.1 Prepare the Environment

• Inference framework: vLLM:0.4.1

• LLM: Qwen-7B-Chat-Int8

• An ACK cluster is created and multiple GPU-accelerated nodes are added. In this example, the following machines are included:

Three ecs.gn7i-c32g1.8xlarge (A10): Machines that simulate mainstream inference.
One ecs.gn6i-c4g1.xlarge (T4): The backup machine with reduced configuration resources when traffic is at off-peak hours.
Three ecs.g8i.24xlarge: Machines used to store Fluid data cache.

• Install ack-knative: Deploy Knative_Alibaba Cloud Container Service for Kubernetes (ACK) - Alibaba Cloud Help Center.

• Install ack-fluid: The cloud-native AI suite is installed and ack-fluid components are deployed. For more information, see Install Cloud-native AI Suite.

3.2 Prepare Model Data

Prepare model data by referring to the documentation. In this example, a Qwen-7B-Chat-Int8 model is uploaded to NAS or OSS.

Configure a persistent volume (PV) and a persistent volume claim (PVC) that are named Qwen-7B-Chat-Int8 for the cluster.

3.3 Accelerate Model Loading Based on Fluid

Fluid can cache data stored in persistent volumes (PVs) of Kubernetes clusters to accelerate data access, thus accelerating the model loading speed. We recommend you use JindoRuntime.

1. Configure the dataset to be scheduled to the prepared cache ECS models (ecs.g8i.24xlarge).

apiVersion: data.fluid.io/v1alpha1
kind: Dataset
metadata:
  name: qwen-7b-chat-int8-dataset
spec:
  mounts:
    - mountPoint: pvc://qwen-7b-chat-int8 # The name of the PVC that you prepared.
      name: data
      path: /
  accessModes:
    - ReadOnlyMany
  # Configure the worker pods of the cache system to be scheduled to the node of the specified ECS model.
  nodeAffinity:
    required:
      nodeSelectorTerms:
        - matchExpressions:
            - key: node.kubernetes.io/instance-type
              operator: In
              values: 
                - "ecs.g8i.24xlarge"
---
apiVersion: data.fluid.io/v1alpha1
kind: JindoRuntime
metadata:
  name: qwen-7b-chat-int8-dataset
spec:
  replicas: 2
  tieredstore:
    levels:
      # Configure the storage type as memory.
      - mediumtype: MEM
        volumeType: emptyDir
        path: /dev/shm
        quota: 20Gi
        high: "0.9"
        low: "0.8"

Expected output:

$ kubectl get datasets.data.fluid.io
NAME                        UFS TOTAL SIZE   CACHED   CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
qwen-7b-chat-int8-dataset   17.01GiB         0.00B    40.00GiB         0.0%                Bound   5m46s
$ kubectl get pvc
NAME                        STATUS   VOLUME                              CAPACITY   ACCESS MODES   STORAGECLASS   VOLUMEATTRIBUTESCLASS   AGE
qwen-7b-chat-int8           Bound    qwen-7b-int8                        20Gi       RWX            nas            <unset>                 7d1h
qwen-7b-chat-int8-dataset   Bound    default-qwen-7b-chat-int8-dataset   100Pi      ROX            fluid          <unset>                 4m11s

2. Create a DataLoad to perform cache preheating.

apiVersion: data.fluid.io/v1alpha1
kind: DataLoad
metadata:
  name: dataset-warmup
spec:
  dataset:
    name: qwen-7b-chat-int8-dataset
    namespace: default
  loadMetadata: true
  target:
    - path: /
      replicas: 1

Expected output:

$ kubectl get dataloads               
NAME             DATASET                     PHASE      AGE   DURATION
dataset-warmup   qwen-7b-chat-int8-dataset   Complete   12m   47s
$ kubectl get dataset  
NAME                        UFS TOTAL SIZE   CACHED     CACHE CAPACITY   CACHED PERCENTAGE   PHASE   AGE
qwen-7b-chat-int8-dataset   17.01GiB         17.01GiB   40.00GiB         100.0%              Bound   31m

Note: When you use Fluid to accelerate data access of a model volume, you must create a JindoFuse pod on the corresponding node. If a business pod uses a dataset, it must wait for the JindoFuse pod to be ready. To reduce the time that is required to start the pod, you can start it in advance.

3.4 Priority-based Scheduling and Preemption Configuration

In the process of automatic scaling, considering the different inference performances of different GPUs, users may customize priority configurations to ensure that different AI services are scheduled to different resource pools. In addition, preemption capabilities are configured for some services with high real-time requirements.

• Priority-based scheduling: Configure a ResourcePolicy to prioritize scheduling to A10 nodes and to T4 nodes when A10 nodes are insufficient.

• Preemption:

Set preemptPolicy to BeforeNextUnit. The scheduler attempts to preempt each unit when it fails to be scheduled.
PriorityClass of inference pods is higher than that of training pods to ensure successful preemption.

apiVersion: scheduling.alibabacloud.com/v1alpha1
kind: ResourcePolicy
metadata:
  name: qwen
  namespace: default
spec:
  selector:
    release: qwen # You must specify the label of the pods to which you want to apply the ResourcePolicy.
  strategy: prefer
  # Try to preempt the training pod.
  preemptPolicy: BeforeNextUnit
  units:
  - resource: ecs
    nodeSelector:
      aliyun.accelerator/nvidia_name: NVIDIA-A10
  - resource: ecs
    nodeSelector:
      aliyun.accelerator/nvidia_name: Tesla-T4 
  # - resource: eci
---
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata:
  name: infrence-with-high-priority
value: 1000000
globalDefault: false
description: "This priority class should only be used for inference service pods."

3.5 Configure KPA Automatic Scaling

Knative Pod Autoscaler (KPA) is an out-of-the-box feature that can scale pods based on the number of requests. Knative Serving adds a Queue Proxy container named queue-proxy to each pod. The container automatically reports the metrics about the requests of application pods to KPA. After KPA receives the metrics, KPA automatically adjusts the number of pods provisioned for a Deployment based on the number of concurrent requests and the related algorithm. By default, KPA supports "concurrency", "RPS", "CPU", and "memory" metrics. In the LLM and text-to-image scenarios, a single inference takes a long time. It is recommended to use concurrency as the scaling metric.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  labels:
    release: qwen
  name: qwen
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/class: kpa.autoscaling.knative.dev
        autoscaling.knative.dev/metric: "concurrency" # Configure the concurrency as the automatic scaling metric.
        autoscaling.knative.dev/target: "2" # Set the concurrency target to 2.
        autoscaling.knative.dev/min-scale: "1" # Set the minimum number of replicas.
        autoscaling.knative.dev/max-scale: "3" # Set the maximum number of replicas.
      labels:
        release: qwen
    spec:
      containers:
      - command:
        - sh
        - -c
        - python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code
          --served-model-name qwen --model /mnt/models/Qwen-7B-Chat-Int8 --gpu-memory-utilization
          0.95 --quantization gptq --max-model-len=6144
        image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1
        imagePullPolicy: IfNotPresent
        name: vllm-container
        priorityClassName: infrence-with-high-priority # Set the priority for scheduling preemption.
        readinessProbe:
          tcpSocket:
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
        resources:
          limits:
            cpu: "32"
            memory: 64Gi
            nvidia.com/gpu: "1"
          requests:
            cpu: "16"
            memory: 64Gi
            nvidia.com/gpu: "1"
        volumeMounts:
        - mountPath: /mnt/models/Qwen-7B-Chat-Int8
          name: qwen-7b-chat-int8
      volumes:
      - name: qwen-7b-chat-int8
        persistentVolumeClaim:
          claimName: qwen-7b-chat-int8-dataset # The PVCs created by using the dataset.

The following table describes the parameters.

scaleMetric	The scaling metric, supporting concurrency and RPS. The default value is the concurrency metric.
scaleTarget	The scaling threshold.
minReplicas	The minimum number of pod replicas for scaling. The value of this parameter must be an integer greater than or equal to 0. KPA supports setting the minimum value to 0.
maxReplicas	The maximum number of pod replicas for scaling. The value of this parameter must be an integer greater than the value of the minReplicas parameter.

Expected output:

$ kubectl get ksvc
NAME   URL                               LATESTCREATED   LATESTREADY   READY   REASON
qwen   http://qwen.default.example.com   qwen-00001      qwen-00001    True

3.6 Configure Resource Downgrading

Most inference services have a certain time cycle. If you maintain high-specification GPU-accelerated instances during off-peak hours, a large amount of GPU resources may be wasted. If the number of pods is scaled to zero during off-peak hours to reduce costs, the next time you start the application, the application will experience a time-consuming cold start. ACK Knative provides the feature of reserving instances and downgrading resources. This feature allows you to retain a low-specification GPU-accelerated instance to balance the cost and startup duration.

For example, we use ecs.gn7i-c32g1.8xlarge instances (￥21.253 per hour) to handle traffic peaks, and use ecs.gn6i-c4g1.xlarge (￥8.896 per hour) during traffic off-peaks. About 60% of the cost can be saved during the off-peak traffic period. The idle high-specification GPU-accelerated instances can be used for tasks such as model training.

apiVersion: serving.knative.dev/v1
kind: Service
metadata:
  labels:
    release: qwen
  name: qwen
  namespace: default
spec:
  template:
    metadata:
      annotations:
        autoscaling.knative.dev/class: kpa.autoscaling.knative.dev
        autoscaling.knative.dev/metric: "concurrency" # Configure the concurrency as the automatic scaling metric.
        autoscaling.knative.dev/target: "2" # Set the concurrency target to 2.
        autoscaling.knative.dev/min-scale: "0"
        autoscaling.knative.dev/max-scale: "3" 
        knative.aliyun.com/reserve-instance: enable # Enable the instance reservation feature.
        knative.aliyun.com/reserve-instance-type: ecs # Configure the reserved instance type.
        knative.aliyun.com/reserve-instance-ecs-use-specs: ecs.gn6i-c4g1.xlarge # Configure the reserved instance specifications. You can configure multiple instance specifications.
        knative.aliyun.com/reserve-instance-cpu-resource-request: "2" # Configure
        knative.aliyun.com/reserve-instance-cpu-resource-limit: "2"
        knative.aliyun.com/reserve-instance-memory-resource-request: "8Gi"
        knative.aliyun.com/reserve-instance-memory-resource-limit: "8Gi"
      labels:
        release: qwen
    spec:
      containers:
      - command:
        - sh
        - -c
        - python3 -m vllm.entrypoints.openai.api_server --port 8080 --trust-remote-code
          --served-model-name qwen --model /mnt/models/Qwen-7B-Chat-Int8 --gpu-memory-utilization
          0.95 --quantization gptq --max-model-len=6144
        image: kube-ai-registry.cn-shanghai.cr.aliyuncs.com/kube-ai/vllm:0.4.1
        imagePullPolicy: IfNotPresent
        name: vllm-container
        priorityClassName: infrence-with-high-priority # Set the priority for scheduling preemption.
        readinessProbe:
          tcpSocket:
            port: 8080
          initialDelaySeconds: 5
          periodSeconds: 5
        resources:
          limits:
            cpu: "16"
            memory: 60Gi
            nvidia.com/gpu: "1"
          requests:
            cpu: "8"
            memory: 36Gi
            nvidia.com/gpu: "1"
        volumeMounts:
        - mountPath: /mnt/models/Qwen-7B-Chat-Int8
          name: qwen-7b-chat-int8
      volumes:
      - name: qwen-7b-chat-int8
        persistentVolumeClaim:
          claimName: qwen-7b-chat-int8-dataset

Expected output:

$ kubectl get po -o wide
NAME                                             READY   STATUS        RESTARTS   AGE   IP            NODE                         NOMINATED NODE   READINESS GATES
qwen-00001-deployment-reserve-67664799b7-s7kmr   2/2     Running       0          51s   10.0.6.250    ap-southeast-1.10.0.6.236    <none>           <none>

4. Test Performance

In production scenarios, users are often concerned about the efficiency of automatic scaling and the user response latency during the process. Here, we take the scenario of scaling to zero as the baseline. Based on the preceding experimental environment, we test the performance in various scenarios.

According to the test results, we have the following suggestions:

1. Fluid for model data acceleration: In scaling scenarios, it can significantly reduce the cold start time of the inference service container and thus shorten the user response latency. Here, TTFT is mainly used as an example. Fluid acceleration can effectively reduce the TTFT time of Max, P99, and P90, regardless of the deployment method of scaling to 0, reserving instances, or reserved instances with resource downgrading.

2. Reserving instances + Fluid: This mode is suitable for users with high response latency requirements, such as online inference services. When a reserved instance carries 10QPS requests, the Max_TTFT is basically controlled within 1.2s and the P99_TTFT is controlled within 0.6s.

3. Reserving instances with resource downgrading + Fluid: This model is suitable for scenarios where users need to balance response latency and costs. The maximum latency is higher after the resource is downgraded; however, in clusters with heterogeneous resources, the user cost can be significantly reduced.

4. Scaling to 0 + Fluid: This model is suitable for business types that are not sensitive to user response latency, such as offline inference. Although the latency of individual requests is high in the scale-out phase, the speed of batch processing is not low in terms of P99_TTFT and Avg_Output_Tokens.

5. Summary

The request-based automatic scaling configuration of Knative is highly compatible with the inference scenario of large language models (LLMs). In addition, its resource downgrading feature can significantly help users reduce costs. ResourcePolicy, on the other hand, enhances the fine-grained use of heterogeneous computing resources and significantly improves utilization by customizing priorities and preemption. For scaling scenarios in the inference process, the model loading time is the main factor that causes the cold start latency. Fluid can significantly improve the scaling efficiency of LLM applications: the loading time of a 17 GB model can be shortened from 90 seconds to about 20 seconds, and the maximum TTFT for scale-out from 0 can be shortened from 94 seconds to 21 seconds when the model is scaled to 0. With one A10 instance retained, the maximum TTFT is reduced from 2 seconds to 1.2 seconds.

Reference

[1] Best Practices of Fluid Data Cache Optimization Policies_Alibaba Cloud Container Service for Kubernetes (ACK)

[2] Deploy Knative_Alibaba Cloud Container Service for Kubernetes (ACK)

[3] Configure Priority-based Resource Scheduling_Alibaba Cloud Container Service for Kubernetes (ACK)

Community

An Automatic Scaling Solution for LLM Inference Services Based on Knative

1. Background

A Scientific and Reasonable Automatic Scale-out Mechanism

Resource Guarantee During Scale-out

SLA User-experience Guarantee during the Scale-out Process

2. Introduction

2.1 Overall Architecture

2.2 Automatic Scaling Based on Inference Requests - Knative

2.3 Configure Priority-based Resource Scheduling - ResourcePolicy

2.4 Accelerate Model Data - Fluid

3. Quick Practice

3.1 Prepare the Environment

3.2 Prepare Model Data

3.3 Accelerate Model Loading Based on Fluid

3.4 Priority-based Scheduling and Preemption Configuration

3.5 Configure KPA Automatic Scaling

3.6 Configure Resource Downgrading

4. Test Performance

5. Summary

Reference

Read previous post:

Read next post:

Alibaba Container Service

You may also like

Comments

Alibaba Container Service

Related Products

Container Service for Kubernetes

EasyDispatch for Field Service Management

Conversational AI Service

Platform For AI