llm

vllm

https://docs.vllm.ai/en/latest/getting_started/quickstart.html

由于hugging-face无法访问quick start里面的下面代码会报错

llm = LLM(model="facebook/opt-125m")

先用proxychains下载,下载代码如下(https://github.com/vllm-project/vllm/discussions/1405):

from huggingface_hub import snapshot_download

#model_id="deepseek-ai/DeepSeek-R1-Distill-Qwen-14B"
model_id="Qwen/Qwen2.5-1.5B-Instruct"
model_path = snapshot_download(
    repo_id=model_id,
    local_dir="./models/"+model_id,
    max_workers=4  # Increase for faster parallel downloads
)

scp到服务器

offline mode

(torch2) A|a141|2025-03-24 16:46:36[like@ vllm]cat n1_fb.py

from vllm import LLM, SamplingParams
prompts = [
    "Hello, my name is",
    "The president of the United States is",
    "The capital of France is",
    "The future of AI is",
]
sampling_params = SamplingParams(temperature=0.8, top_p=0.95)
#llm = LLM(model="facebook/opt-125m")
#llm = LLM(model="/share_data/users/like/hf-models/facebook/opt-125m")
gpu_memory_utilization=0.013
llm = LLM(model="/share_data/users/like/hf-models/facebook/opt-125m", gpu_memory_utilization=gpu_memory_utilization)
outputs = llm.generate(prompts, sampling_params)

for output in outputs:
    prompt = output.prompt
    generated_text = output.outputs[0].text
    print(f"Prompt: {prompt!r}, Generated text: {generated_text!r}")
print(f"enter to end,,gpu_memory_utilization:{gpu_memory_utilization}")
x = input()
print(f"x:{x}")

vllm serve

vllm serve /share_data/users/like/hf-models/facebook/opt-125m/ --gpu-memory-utilization 0.2

Leave a Comment