SpotServe, Serving Generative Large Language Models on Preemptible Instances

SpotServe: Serving Generative Large Language Models on Preemptible Instances

Author: Xupeng Miao, et al.

Background

Generative LLM

img

  • Input: tokens;

  • Output: token sequence

  • Stops when: output reaches maximum length/ending token

Two types of GPU Instances

  • On-demand GPUs:

    • Expensive
    • Use anytime you need
  • Preemptible GPU Instances (e.g. Spot Instances):

    • Cheap (run on spare capacity)
    • Might be Preempted anytime
    • Offers a grace period after preemption to complete currently running tasks

    img

Challenges

  • The number of available preemptible instances changes frequently => Dynamic reparallelization for optimized serving performance

  • Restarting LLM results in great overhead of reloading parameters => Find the optimal migration strategy that minimize the cost

  • Grace periods may not be long enough for finishing current request

  • The reduction of throughput during this process might lead to accumulation of subsequent requests

SpotServe

Overview

img

  • Request Manager: handle requests, partition into batches, assign them to instances, send output to users
  • Instance server: monitor the preemption and acquisition of instances
  • Meta-context manager: schedule the context migration between GPU instances (parameters, outputs, etc.)

Meta-context Manager

Parallelization Controller

Parallelization Configurations:
  • D: Data Parallelization: partition requests and assign them to different pipelines
  • P: Pipeline model parallelization: run different stages of a inference process simultaneously (like pipeline in CPU)
  • M: Tensor model parallelization: split the model into shards and assign to different GPUsParallel
  • Configuration C = (D, P, M)

img

img

Configuration Optimizer

img

If there is a configuration such that the throughput is larger than the request arrival rate, then choose the configuration that has minimum inference latency while making sure the throughput is larger than arrival rate. Otherwise, find the C for maximum throughput.

After adjusting the configuration, allocate or free instances.

Offline process => low latency

What do we have now?

  • We now have a configuration for the next step.
  • However, we only decided the parallelization structure.
  • How should us decide how to map the physical instances to logical positions?

Device Mapper

  • Goal: find the matching strategy that maximize reusable context (i.e., edge weight sum)

img

  • Approach: *KM (*Kuhn-Munkres) Algorithm for Bipartite Graph Maximum Weight Matching

edges: e(u, v) indicates reusable parameters and caches when mapping GPU u to position v.

Migration Planner

img

  1. From front layers (0, 1) to back layers (N), find the layers whose context do not exceed the buffer size and prioritize migration of these layers.
  2. For other layers, add them to the sequence by the order of instance buffer memory usage
  3. After we have the sequence, sequentially migrate the layers in this sequence. If all layers of a specific stage is migrated, start running this stage.

Just-in-time Arrangement

What should we do if we receive a preemption/acquisition notification while there are still requests running/waiting?

  • Immediately suspend and migrate? => high inference latency
  • Finish all requests? => no enough time for migration
  • SpotServe: for preemption, maximize token generation (since migration happens during the grace period); for acquisition, minimize token generation when exploiting the whole grace period (since migration happens after the grace period)

Experimental Evaluation

Comparison

  • Stable workload

  • P99: latency at the 99th percentage (99% latency <= this value)

  • Baselines: FasterTransformer (reparallelization but no cocntext migration, rerouting with pre-defined config)

    img

  • Fluctuating Workload

    img

Ablation

img


SpotServe, Serving Generative Large Language Models on Preemptible Instances
http://example.com/2024/02/03/SpotServe/
Author
dingzr2001
Posted on
February 3, 2024
Licensed under