vllm
https://docs.vllm.ai/en/latest/getting_started/quickstart.html
由于hugging-face无法访问quick start里面的下面代码会报错
llm = LLM(model="facebook/opt-125m")
先用proxychains下载,下载代码如下(https://github.com/vllm-project/vllm/discussions/1405):
from huggingface_hub import snapshot_download
#model_id="deepseek-ai/DeepSeek-R1-Distill-Qwen-14B"
model_id="Qwen/Qwen2.5-1.5B-Instruct"
model_path = snapshot_download(
repo_id=model_id,
local_dir="./models/"+model_id,
max_workers=4 # Increase for faster parallel downloads
)
scp到服务器
offline mode
(torch2) A|a141|2025-03-24 16:46:36[like@ vllm]cat n1_fb.py
from vllm import LLM, SamplingParams
prompts = [
"Hello, my name is",
"The president of the United States is",
"The capital of France is",
"The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
#llm = LLM(model="facebook/opt-125m")
#llm = LLM(model="/share_data/users/like/hf-models/facebook/opt-125m")
gpu_memory_utilization=0.013
llm = LLM(model="/share_data/users/like/hf-models/facebook/opt-125m", gpu_memory_utilization=gpu_memory_utilization)
outputs = llm.generate(prompts, sampling_params)
for output in outputs:
prompt = output.prompt
generated_text = output.outputs[0].text
print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
print(f"enter to end,,gpu_memory_utilization:{gpu_memory_utilization}")
x = input()
print(f"x:{x}")
vllm serve
vllm serve /share_data/users/like/hf-models/facebook/opt-125m/ --gpu-memory-utilization 0.2
vllm bench serve
https://github.com/vllm-project/vllm/pull/17625
# Server side with triton backend (Plz use VLLM_ATTENTION_BACKEND=CUTLASS_MLA_VLLM_V1 for cutlass backend):
VLLM_LOGGING_LEVEL=DEBUG \
VLLM_WORKER_MULTIPROC_METHOD=spawn \
vllm serve deepseek-ai/DeepSeek-V3 \
--trust-remote-code \
--max-model-len=2048 \
--block-size=128 \
--max-num-seqs=512 \
--gpu_memory_utilization=0.97 \
--data-parallel-size $NUM_GPUS --enable-expert-parallel \
--disable-log-requests
# client side:
python $VLLM_PATH/benchmarks/benchmark_serving.py \
--model deepseek-ai/DeepSeek-V3 \
--dataset-name random \
--ignore-eos \
--num-prompts 3000 \
--max-concurrency 3000 \
--random-input-len 1000 \
--random-output-len 1
结果
# With default triton backend:
============ Serving Benchmark Result ============
Successful requests: 2989
Benchmark duration (s): 1046.01
Total input tokens: 2989000
Total generated tokens: 2989000
Request throughput (req/s): 2.86
Output token throughput (tok/s): 2857.52
Total Token throughput (tok/s): 5715.04
---------------Time to First Token----------------
Mean TTFT (ms): 200716.51
Median TTFT (ms): 199463.35
P99 TTFT (ms): 395239.25
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 826.04
Median TPOT (ms): 826.20
P99 TPOT (ms): 1001.39
---------------Inter-token Latency----------------
Mean ITL (ms): 826.04
Median ITL (ms): 648.89
P99 ITL (ms): 8337.69
==================================================
With cutlass_mla backend:
============ Serving Benchmark Result ============
Successful requests: 2989
Benchmark duration (s): 881.52
Total input tokens: 2989000
Total generated tokens: 2989000
Request throughput (req/s): 3.39
Output token throughput (tok/s): 3390.73
Total Token throughput (tok/s): 6781.46
---------------Time to First Token----------------
Mean TTFT (ms): 190244.11
Median TTFT (ms): 189563.96
P99 TTFT (ms): 372713.07
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 685.60
Median TPOT (ms): 686.96
P99 TPOT (ms): 858.01
---------------Inter-token Latency----------------
Mean ITL (ms): 685.60
Median ITL (ms): 518.56
P99 ITL (ms): 7738.23
==================================================
To repro the results:
curl
curl v1/chat/completions
sglang启动命令:
CUDA_VISIBLE_DEVICES=0,1 python3 -m sglang.launch_server --model-path /mnt/yrfs/llm_weights/Meta-Llama-3.1-8B-Instruct/ --quantization fp8 --port 30000 --host 0.0.0.0 --tp-size 2 > ~/package//sglang_kernel_src/temp/sglang-server.log 2>&1 &
客户端命令:
curl http://localhost:30000/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "/mnt/yrfs/llm_weights/Meta-Llama-3.1-8B-Instruct/", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "what is rust?"} ] }'
vllm推荐的命令
https://docs.vllm.ai/en/latest/getting_started/quickstart/#openai-compatible-server
curl http://localhost:30000/v1/completions -H "Content-Type: application/json" -d '{ "model": "/mnt/yrfs/llm_weights/Meta-Llama-3.1-8B-Instruct/", "prompt": "San Francisco is a", "max_tokens": 20, "temperature": 0 }'
v1/chat/completion也可以这样写
curl http://localhost:30000/v1/chat/completions -H "Content-Type: application/json" -d '{ "model": "/mnt/yrfs/llm_weights/Meta-Llama-3.1-8B-Instruct/", "messages": [ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Who won the world series in 2020?"} ] }'
结果: {"id":"2c2b992d86f344cfb04467baa08df3a9","object":"chat.completion","created":1764814659,"model":"/mnt/yrfs/llm_weights/Meta-Llama-3.1-8B-Instruct/","choices":[{"index":0,"message":{"role":"assistant","content":"The Los Angeles Dodgers won the World Series in 2020, defeating the Tampa Bay Rays in the series 4 games to 2. It was the Dodgers' first World Series title since 1988.","reasoning_content":null,"tool_calls":null},"logprobs":null,"finish_reason":"stop","matched_stop":128009}],"usage":{"prompt_tokens":31,"total_tokens":74,"completion_tokens":43,"prompt_tokens_details":null,"reasoning_tokens":0},"metadata":{"weight_version":"default"}}