vllm.model_executor.models.step3_vl ¶
Step3VLImageEmbeddingInputs ¶
Bases: TensorSchema
Dimensions
- bn: Batch size * number of images
- f: Image feature size
- h: Hidden size (must match the hidden size of language model backbone)
Source code in vllm/model_executor/models/step3_vl.py
Step3VLImagePixelInputs ¶
Bases: TensorSchema
Dimensions
- bn: Batch size * number of images
- c: Number of channels (3)
- h: Height
- w: Width
- bnp: Batch size * number of images * number of patches
- hp: Height of patch
- wp: Width of patch
Source code in vllm/model_executor/models/step3_vl.py
Step3VisionAttention ¶
Bases: Module
Multi-headed attention from 'Attention Is All You Need' paper
Source code in vllm/model_executor/models/step3_vl.py
forward ¶
forward(hidden_states: Tensor)
Input shape: Batch x Time x Channel