Skip to content

Latest commit

 

History

History
15 lines (12 loc) · 513 Bytes

README.md

File metadata and controls

15 lines (12 loc) · 513 Bytes

Optimus: HuggingFace-Aligned 3D-parallel backend

  • flash attention 2 support on training
  • flash attention 2 support on left-padding generation with kv cache
  • fmha on GQA & MQA
  • multi model topology support by mpu context
  • more model type for experiment (PPL,RM,...)

TODO:

  • GQA & MQA generation (left-padding)

  • less model control option

  • generator based on non-batch flash attention and self-design cuda fused kernel

  • Fixed pipeline model

  • KV Cache management by pre-malloc and reuse (pre-calculate)