vllm.model_executor.models.idefics2_vision_model ¶
PyTorch Idefics2 model.
Idefics2Encoder ¶
Bases: Module
Transformer encoder consisting of config.num_hidden_layers self attention layers. Each layer is a [Idefics2EncoderLayer].
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
config | Idefics2Config | Idefics2Config | required |
Source code in vllm/model_executor/models/idefics2_vision_model.py
forward ¶
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
inputs_embeds | Tensor | Optionally, instead of passing | required |
Source code in vllm/model_executor/models/idefics2_vision_model.py
Idefics2EncoderLayer ¶
Bases: Module
Source code in vllm/model_executor/models/idefics2_vision_model.py
forward ¶
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
hidden_states | `torch.FloatTensor` | Input to the layer of shape | required |
Source code in vllm/model_executor/models/idefics2_vision_model.py
Idefics2VisionAttention ¶
Bases: Module
Multi-headed attention from 'Attention Is All You Need' paper
Source code in vllm/model_executor/models/idefics2_vision_model.py
Idefics2VisionEmbeddings ¶
Bases: Module
This is a modified version of siglip.modelign_siglip.SiglipVisionEmbeddings to enable images of variable resolution.
The modifications are adapted from Patch n' Pack: NaViT, a Vision Transformer for any Aspect Ratio and Resolution which allows treating images in their native aspect ratio and without the need to resize them to the same fixed size. In particular, we start from the original pre-trained SigLIP model(which uses images of fixed-size square images) and adapt it by training on images of variable resolutions.