vllm.lora.ops.triton_ops.fused_moe_lora_op ¶
_adjust_kernel_inputs ¶
_adjust_kernel_inputs(
num_active_loras: int,
sorted_token_ids: Tensor | None,
expert_ids: Tensor,
)
helper function to adjust kernel inputs when sorted_token_ids is None
Source code in vllm/lora/ops/triton_ops/fused_moe_lora_op.py
_get_expert_id ¶
_get_expert_id(
expert_ids_ptr,
lora_id,
pid_m,
stride_el,
max_loras,
naive_block_assignment: constexpr,
)
Returns expert_id
Source code in vllm/lora/ops/triton_ops/fused_moe_lora_op.py
_get_lora_id ¶
_get_lora_id(
lora_ids,
token_lora_mapping_ptr,
lora_idx,
pid_m,
top_k_num,
naive_block_assignment: constexpr,
)
Returns lora_id
Source code in vllm/lora/ops/triton_ops/fused_moe_lora_op.py
_get_ptr ¶
_LORA_PTR_DICT collects the required information during profile_run, After this, it remains constant and subsequent usage is through LUT. Refer to: https://github.com/triton-lang/triton/blob/release/3.1.x/python/tutorials/08-grouped-gemm.py
Source code in vllm/lora/ops/triton_ops/fused_moe_lora_op.py
_get_token_offs ¶
_get_token_offs(
sorted_token_ids_ptr,
lora_id,
pid_m,
offs,
stride_tl,
max_loras,
num_valid_tokens,
naive_block_assignment: constexpr,
BLOCK_SIZE_M: constexpr,
)
Returns token offsets