llm

vllm

https://docs.vllm.ai/en/latest/getting_started/quickstart.html

由于hugging-face无法访问quick start里面的下面代码会报错

llm = LLM(model="facebook/opt-125m")

先用proxychains下载,下载代码如下(https://github.com/vllm-project/vllm/discussions/1405):

from huggingface_hub import snapshot_download

#model_id="deepseek-ai/DeepSeek-R1-Distill-Qwen-14B"
model_id="Qwen/Qwen2.5-1.5B-Instruct"
model_path = snapshot_download(
    repo_id=model_id,
    local_dir="./models/"+model_id,
    max_workers=4  # Increase for faster parallel downloads
)

scp到服务器

offline mode

(torch2) A|a141|2025-03-24 16:46:36[like@ vllm]cat n1_fb.py

from vllm import LLM, SamplingParams
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
#llm = LLM(model="facebook/opt-125m")
#llm = LLM(model="/share_data/users/like/hf-models/facebook/opt-125m")
gpu_memory_utilization=0.013
llm = LLM(model="/share_data/users/like/hf-models/facebook/opt-125m", gpu_memory_utilization=gpu_memory_utilization)
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
print(f"enter to end,,gpu_memory_utilization:{gpu_memory_utilization}")
x = input()
print(f"x:{x}")

vllm serve

vllm serve /share_data/users/like/hf-models/facebook/opt-125m/ --gpu-memory-utilization 0.2

vllm bench serve

https://github.com/vllm-project/vllm/pull/17625

# Server side with triton backend (Plz use VLLM_ATTENTION_BACKEND=CUTLASS_MLA_VLLM_V1 for cutlass backend):
VLLM_LOGGING_LEVEL=DEBUG \
VLLM_WORKER_MULTIPROC_METHOD=spawn \
  vllm serve deepseek-ai/DeepSeek-V3 \
    --trust-remote-code \
    --max-model-len=2048 \
    --block-size=128 \
    --max-num-seqs=512 \
    --gpu_memory_utilization=0.97 \
    --data-parallel-size $NUM_GPUS --enable-expert-parallel \
    --disable-log-requests

# client side:
python $VLLM_PATH/benchmarks/benchmark_serving.py \
  --model deepseek-ai/DeepSeek-V3 \
  --dataset-name random \
  --ignore-eos \
  --num-prompts 3000 \
  --max-concurrency 3000 \
  --random-input-len 1000 \
  --random-output-len 1

结果

# With default triton backend:
============ Serving Benchmark Result ============
Successful requests:                     2989
Benchmark duration (s):                  1046.01
Total input tokens:                      2989000
Total generated tokens:                  2989000
Request throughput (req/s):              2.86
Output token throughput (tok/s):         2857.52
Total Token throughput (tok/s):          5715.04
---------------Time to First Token----------------
Mean TTFT (ms):                          200716.51
Median TTFT (ms):                        199463.35
P99 TTFT (ms):                           395239.25
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          826.04
Median TPOT (ms):                        826.20
P99 TPOT (ms):                           1001.39
---------------Inter-token Latency----------------
Mean ITL (ms):                           826.04
Median ITL (ms):                         648.89
P99 ITL (ms):                            8337.69
==================================================

With cutlass_mla backend:
============ Serving Benchmark Result ============
Successful requests:                     2989
Benchmark duration (s):                  881.52
Total input tokens:                      2989000
Total generated tokens:                  2989000
Request throughput (req/s):              3.39
Output token throughput (tok/s):         3390.73
Total Token throughput (tok/s):          6781.46
---------------Time to First Token----------------
Mean TTFT (ms):                          190244.11
Median TTFT (ms):                        189563.96
P99 TTFT (ms):                           372713.07
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          685.60
Median TPOT (ms):                        686.96
P99 TPOT (ms):                           858.01
---------------Inter-token Latency----------------
Mean ITL (ms):                           685.60
Median ITL (ms):                         518.56
P99 ITL (ms):                            7738.23
==================================================
To repro the results:

curl

curl v1/chat/completions

sglang启动命令:

CUDA_VISIBLE_DEVICES=0,1 python3 -m sglang.launch_server --model-path /mnt/yrfs/llm_weights/Meta-Llama-3.1-8B-Instruct/ --quantization fp8 --port 30000 --host 0.0.0.0 --tp-size 2 > ~/package//sglang_kernel_src/temp/sglang-server.log 2>&1 &

客户端命令:

curl http://localhost:30000/v1/chat/completions  -H "Content-Type: application/json"  -d '{ "model": "/mnt/yrfs/llm_weights/Meta-Llama-3.1-8B-Instruct/", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "what is rust?"} ] }'

vllm推荐的命令
https://docs.vllm.ai/en/latest/getting_started/quickstart/#openai-compatible-server

curl http://localhost:30000/v1/completions  -H "Content-Type: application/json"  -d '{ "model": "/mnt/yrfs/llm_weights/Meta-Llama-3.1-8B-Instruct/", "prompt": "San Francisco is a", "max_tokens": 20, "temperature": 0 }'

v1/chat/completion也可以这样写

curl http://localhost:30000/v1/chat/completions  -H "Content-Type: application/json"  -d '{ "model": "/mnt/yrfs/llm_weights/Meta-Llama-3.1-8B-Instruct/", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Who won the world series in 2020?"} ] }' 

结果:                                                                {"id":"2c2b992d86f344cfb04467baa08df3a9","object":"chat.completion","created":1764814659,"model":"/mnt/yrfs/llm_weights/Meta-Llama-3.1-8B-Instruct/","choices":[{"index":0,"message":{"role":"assistant","content":"The Los Angeles Dodgers won the World Series in 2020, defeating the Tampa Bay Rays in the series 4 games to 2. It was the Dodgers' first World Series title since 1988.","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":128009}],"usage":{"prompt_tokens":31,"total_tokens":74,"completion_tokens":43,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}

Leave a Comment