vllm.v1.outputs ¶
AsyncModelRunnerOutput ¶
Bases: ABC
Source code in vllm/v1/outputs.py
get_output abstractmethod ¶
Get the ModelRunnerOutput for this async output.
This is a blocking call that waits until the results are ready, which might involve copying device tensors to the host. This method should only be called once per AsyncModelRunnerOutput.
Source code in vllm/v1/outputs.py
LogprobsTensors ¶
Bases: NamedTuple
Source code in vllm/v1/outputs.py
empty_cpu staticmethod ¶
empty_cpu(
num_positions: int, num_tokens_per_position: int
) -> LogprobsTensors
Create empty LogprobsTensors on CPU.
Source code in vllm/v1/outputs.py
filter ¶
filter(mask: Tensor) -> LogprobsTensors
Filter the logprobs tensors with the given bool mask.
Source code in vllm/v1/outputs.py
make_empty_encoder_model_runner_output ¶
Create a ModelRunnerOutput stub that contains the correct per-request bookkeeping but no generated data yet.