vllm.model_executor.layers.fused_moe.fused_moe ¶
Fused MoE Triton kernels.
TritonExperts ¶
Bases: FusedMoEPermuteExpertsUnpermute
Triton-based fused MoE expert implementation.
Source code in vllm/model_executor/layers/fused_moe/fused_moe.py
1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907 1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921 1922 1923 1924 1925 1926 1927 1928 1929 1930 1931 1932 1933 1934 1935 1936 1937 1938 1939 1940 1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979 1980 1981 1982 1983 1984 1985 1986 1987 1988 1989 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 2010 2011 2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032 2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079 2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 | |
_ensure_block_size_k_divisible ¶
Ensure block_size_k is a divisor of size_k and divisible by group_size.
This ensures BLOCK_SIZE_K compatibility with MoeWNA16 CUDA kernel which requires size_k % BLOCK_SIZE_K == 0 and BLOCK_SIZE_K % group_size == 0.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
size_k | int | The size_k dimension that must be divisible by result. | required |
block_size_k | int | Preferred block size (will be adjusted if needed). | required |
group_size | int | The result must be divisible by this. | required |
Returns:
| Type | Description |
|---|---|
int | A valid BLOCK_SIZE_K that divides size_k and is divisible by group_size. |
Source code in vllm/model_executor/layers/fused_moe/fused_moe.py
_get_config_quant_dtype ¶
_get_config_quant_dtype(
use_fp8_w8a8: bool,
use_int8_w8a8: bool,
ocp_mx_scheme: str | None,
) -> None | dtype | str
Get the quantization type based on the quantization strategy flags. We don't have a quant_config at this point so we need to work backwards. A return type of None means no quantization is required because the input is unquantized or has been quantized prior to calling fused_experts_impl.
Source code in vllm/model_executor/layers/fused_moe/fused_moe.py
fused_experts ¶
fused_experts(
hidden_states: Tensor,
w1: Tensor,
w2: Tensor,
topk_weights: Tensor,
topk_ids: Tensor,
inplace: bool = False,
activation: MoEActivation = SILU,
apply_router_weight_on_input: bool = False,
global_num_experts: int = -1,
expert_map: Tensor | None = None,
quant_config: FusedMoEQuantConfig | None = None,
) -> Tensor
Run fused MoE expert computation using Triton kernels.
Source code in vllm/model_executor/layers/fused_moe/fused_moe.py
fused_moe_kernel ¶
fused_moe_kernel(
a_ptr,
b_ptr,
c_ptr,
b_bias_ptr,
a_scale_ptr,
b_scale_ptr,
topk_weights_ptr,
sorted_token_ids_ptr,
expert_ids_ptr,
num_tokens_post_padded_ptr,
N,
K,
EM,
num_valid_tokens,
stride_am: int64,
stride_ak: int64,
stride_be: int64,
stride_bk: int64,
stride_bn: int64,
stride_cm: int64,
stride_cn: int64,
stride_asm: int64,
stride_ask: int64,
stride_bse: int64,
stride_bsk: int64,
stride_bsn: int64,
stride_bbe: int64,
stride_bbn: int64,
group_n: constexpr,
group_k: constexpr,
naive_block_assignment: constexpr,
BLOCK_SIZE_M: constexpr,
BLOCK_SIZE_N: constexpr,
BLOCK_SIZE_K: constexpr,
GROUP_SIZE_M: constexpr,
SPLIT_K: constexpr,
MUL_ROUTED_WEIGHT: constexpr,
top_k: constexpr,
compute_type: constexpr,
use_fp8_w8a8: constexpr,
use_int8_w8a8: constexpr,
use_int8_w8a16: constexpr,
per_channel_quant: constexpr,
HAS_BIAS: constexpr,
)
Implements the fused computation for a Mixture of Experts (MOE) using token and expert matrices.
Key Parameters: - A: The input tensor representing tokens with shape (, K), where '' can be any shape representing batches and K is the feature dimension of each token. - B: The stacked MOE weight tensor with shape (E, N, K), where E is the number of experts, K is the input feature dimension, and N is the output feature dimension. - C: The output cache tensor with shape (M, topk, N), where M is the total number of tokens post padding, topk is the number of times each token is repeated, and N is the output feature dimension. - sorted_token_ids: A tensor containing the sorted indices of tokens, repeated topk times and arranged by the expert index they are assigned to. - expert_ids: A tensor containing the indices of the expert for each block. It determines which expert matrix from B should be used for each block in A. - naive_block_assignment: A boolean flag indicating whether to use naive token wise block assignment. If True, each block corresponds to a single token. This kernel performs the multiplication of a token by its corresponding expert matrix as determined by expert_ids. The sorting of sorted_token_ids by expert index and padding ensures divisibility by BLOCK_SIZE_M, which is necessary to maintain consistency in block matrix multiplication across different blocks processed by the same expert.
Source code in vllm/model_executor/layers/fused_moe/fused_moe.py
313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344 345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397 398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460 461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516 517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 | |
fused_moe_kernel_gptq_awq ¶
fused_moe_kernel_gptq_awq(
a_ptr,
b_ptr,
c_ptr,
b_scale_ptr,
b_zp_ptr,
topk_weights_ptr,
sorted_token_ids_ptr,
expert_ids_ptr,
num_tokens_post_padded_ptr,
N: constexpr,
K: constexpr,
EM,
num_valid_tokens,
stride_am: int64,
stride_ak: int64,
stride_be: int64,
stride_bk: int64,
stride_bn: int64,
stride_cm: int64,
stride_cn: int64,
stride_bse: int64,
stride_bsk: int64,
stride_bsn: int64,
stride_bze: int64,
stride_bzk: int64,
stride_bzn: int64,
block_k_diviable: constexpr,
group_size: constexpr,
BLOCK_SIZE_M: constexpr,
BLOCK_SIZE_N: constexpr,
BLOCK_SIZE_K: constexpr,
GROUP_SIZE_M: constexpr,
SPLIT_K: constexpr,
MUL_ROUTED_WEIGHT: constexpr,
top_k: constexpr,
compute_type: constexpr,
has_zp: constexpr,
use_int4_w4a16: constexpr,
use_int8_w8a16: constexpr,
)
Implements the fused computation for a Mixture of Experts (MOE) using token and expert matrices.
Key Parameters: - A: The input tensor representing tokens with shape (, K), where '' can be any shape representing batches and K is the feature dimension of each token. - B: The stacked MOE weight tensor with shape (E, N, K), where E is the number of experts, K is the input feature dimension, and N is the output feature dimension. - C: The output cache tensor with shape (M, topk, N), where M is the total number of tokens post padding, topk is the number of times each token is repeated, and N is the output feature dimension. - sorted_token_ids: A tensor containing the sorted indices of tokens, repeated topk times and arranged by the expert index they are assigned to. - expert_ids: A tensor containing the indices of the expert for each block. It determines which expert matrix from B should be used for each block in A. This kernel performs the multiplication of a token by its corresponding expert matrix as determined by expert_ids. The sorting of sorted_token_ids by expert index and padding ensures divisibility by BLOCK_SIZE_M, which is necessary to maintain consistency in block matrix multiplication across different blocks processed by the same expert.
Source code in vllm/model_executor/layers/fused_moe/fused_moe.py
80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195 196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 | |
get_moe_configs cached ¶
get_moe_configs(
E: int,
N: int,
dtype: str | None,
block_n: int | None = None,
block_k: int | None = None,
) -> dict[int, Any] | None
Return optimized configurations for the fused MoE kernel.
The return value will be a dictionary that maps an irregular grid of batch sizes to configurations of the fused_moe kernel. To evaluate the kernel on a given batch size bs, the closest batch size in the grid should be picked and the associated configuration chosen to invoke the kernel.
Source code in vllm/model_executor/layers/fused_moe/fused_moe.py
1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 | |