FlagOpen · FuncSherl · May 27, 2024 · May 27, 2024 · May 27, 2024 · May 27, 2024
diff --git a/.github/workflows/python-test.yaml b/.github/workflows/python-test.yaml
@@ -0,0 +1,42 @@
+
+# This workflow will install Python dependencies, run tests and lint with a single version of Python
+# For more information see: https://docs.github.com/en/actions/automating-builds-and-tests/building-and-testing-python
+
+name: flag-gems-test
+
+on:
+  push:
+    branches: [ "master" ]
+  pull_request:
+    branches: [ "master" ]
+
+jobs:
+  container-unit-test:
+    runs-on: [self-hosted, docker]
+    container:
+      image: localhost:5000/flag-gems-ci:v1.0
+      ports:
+        - 81
+      options: --gpus all --hostname flag-gems_cicd_ut
+    steps:
+      - name: checkout-code
+        uses: actions/checkout@v2
+
+      - name: unit_test-flag-gems
+        run: |
+          CUDA_VISIBLE_DEVICES=0 pytest -s tests/test_*
+
+  container-model-test:
+    runs-on: [self-hosted, docker]
+    container:
+      image: localhost:5000/flag-gems-ci:v1.0
+      ports:
+        - 82
+      options: --gpus all --hostname flag-gems_cicd_model -v /home/flaggems_cicd/huggingface_cache_bert:/__w/_temp/_github_home/.cache/huggingface
+    steps:
+      - name: checkout-code
+        uses: actions/checkout@v2
+
+      - name: examples-flag-gems
+        run: |
+          CUDA_VISIBLE_DEVICES=1 pytest -s examples/model_bert_test.py
diff --git a/CONTRIBUTING.md b/CONTRIBUTING.md
@@ -24,11 +24,9 @@ FlagGems
 │   │   ├──ops: single operators
 │   │   ├──fused: fused operators
 │   │   ├──__init__.py
-├── tests
-│   ├──flag_gems
-│   │   ├──model_bert_test.py: test for BERT model running with flag_gems
-│   │   ├──op_accu_test.py: test for accuracy of operators
-│   │   ├──op_perf_test.py: test for performance of operators
+├── tests: accuracy test files
+├── benchmark: performance test files
+├── examples: model test files
 ├── LICENSE
 ├── README.md
 ├── README_cn.md

diff --git a/OperatorList.md b/OperatorList.md
@@ -1,7 +1,5 @@
 ## Operator List
 
-FlagGems will implement the following operators as planned. Version 1.0 will be released within 6 months. 
-
 ## v1.0
 - addmm
 - bmm
@@ -32,6 +30,7 @@ FlagGems will implement the following operators as planned. Version 1.0 will be
 
 ## v2.0
 
+- mv
 - all
 - any
 - bitwise_and
@@ -41,7 +40,6 @@ FlagGems will implement the following operators as planned. Version 1.0 will be
 - eq
 - ge
 - gt
-- is_nonzero
 - isinf
 - isnan
 - le

diff --git a/README.md b/README.md
@@ -14,13 +14,19 @@ By registering with the ATen backend of PyTorch, FlagGems facilitates a seamless
 - support pointwise operators: abs, add, div, dropout, exp, gelu, mul, pow, reciprocal, relu, rsqrt, silu, sub, triu  
 - support reduction operators: cumsum, layernorm, mean, softmax  
 
+### v2.0
+- support BLAS operator: mv, outer  
+- support pointwise operators: bitwise_and, bitwise_not, bitwise_or, cos, clamp, eq, ge, gt, isinf, isnan, le, lt, ne, neg, or, sin, tanh, sigmoid  
+- support reduction operators: all, any, amax, argmax, max, min, prod, sum, var_mean, vector_norm, cross_entropy_loss, group_norm, log_softmax, rms_norm  
+- support fused operators: skip_rms_norm, skip_layer_norm, gelu_and_mul, silu_and_mul, apply_rotary_position_embedding  
+
 ## Quick Start
 
 ### Requirements
 
 1. Triton >= 2.2.0  
 2. PyTorch >= 2.1.2  
-3. Transformers >= 4.31.0  
+3. Transformers >= 4.40.2  
 
 ### Installation  
 
@@ -61,37 +67,50 @@ pip install .
 
 ### Execute
 
-1. Run Tests  
-    - Operator Accuracy  
+1. Test Operator Accuracy  
+    - Run reference on cuda  
+        ```shell
+        cd tests
+        pytest test_xx_ops.py
+        ```
+    - Run reference on cpu  
         ```shell
-        cd tests/flag_gems
-        pytest op_accu_test.py
+        cd tests
+        pytest test_xx_ops.py --device cpu
         ```
-    - Model Accuracy  
+
+2. Test Model Accuracy  
+    ```shell
+    cd examples
+    pytest model_xx_test.py
+    ```
+
+3. Test Operator Performance  
+    - Test CUDA performance  
         ```shell
-        cd tests/flag_gems
-        pytest model_bert_test.py
+        cd benchmark
+        pytest test_xx_perf.py -s
         ```
-    - Operator Performance  
+    - Test end-to-end performance  
         ```shell
-        cd tests/flag_gems
-        python op_perf_test.py
+        cd benchmark
+        pytest test_xx_perf.py -s --mode cpu
         ```
 
-2. Run tests with logging infomation  
+4. Run tests with logging infomation  
     ```shell
     pytest program.py --log-cli-level debug
     ```
+    Not recommended in performance testing.  
 
 ## Supported Operators
 
 Operators will be implemented according to [OperatorList.md](https://github.com/FlagOpen/FlagGems/blob/master/OperatorList.md).
 
 ## Supported Models
 
-| Model | float16 | float32 | bfloat16 |
-| :---: | :---: | :---: | :---: |
-| Bert_base | ✓ | ✓ | ✓ |
+- Bert-base-uncased  
+- Llama-2-7b  
 
 ## Supported Platforms
 

diff --git a/README_cn.md b/README_cn.md
@@ -13,13 +13,19 @@ FlagGems通过对PyTorch的后端aten算子进行覆盖重写，实现算子库
 - 支持pointwise类算子：abs, add, div, dropout, exp, gelu, mul, pow, reciprocal, relu, rsqrt, silu, sub, triu  
 - 支持reduction类算子：cumsum, layernorm, mean, softmax  
 
+### v2.0
+- 支持BLAS类算子: mv, outer  
+- 支持pointwise类算子: bitwise_and, bitwise_not, bitwise_or, cos, clamp, eq, ge, gt, isinf, isnan, le, lt, ne, neg, or, sin, tanh, sigmoid  
+- 支持reduction类算子: all, any, amax, argmax, max, min, prod, sum, var_mean, vector_norm, cross_entropy_loss, group_norm, log_softmax, rms_norm  
+- 支持融合算子: skip_rms_norm, skip_layer_norm, gelu_and_mul, silu_and_mul, apply_rotary_position_embedding  
+
 ## 快速入门
 
 ### 依赖
 
 1. Triton >= 2.2.0  
 2. PyTorch >= 2.1.2  
-3. Transformers >= 4.31.0  
+3. Transformers >= 4.40.2  
 
 ### 安装  
 
@@ -60,37 +66,49 @@ pip install .
 
 ### 执行
 
-1. 运行测试  
-    - 算子正确性测试  
+1. 算子正确性测试  
+    - 在CUDA上运行参考实现  
         ```shell
         cd tests/flag_gems
         pytest op_accu_test.py
         ```
-    - 模型正确性测试  
+    - 在CPU上运行参考实现  
         ```shell
-        cd tests/flag_gems
-        pytest model_bert_test.py
+        cd tests
+        pytest test_xx_ops.py --device cpu
         ```
-    - 算子性能测试  
+2. 模型正确性测试  
+    ```shell
+    cd examples
+    pytest model_xx_test.py
+    ```
+
+3. 算子性能测试  
+    - 测试CUDA性能  
         ```shell
-        cd tests/flag_gems
-        python op_perf_test.py
+        cd benchmark
+        pytest test_xx_perf.py -s
+        ```
+    - 测试端到端性能  
+        ```shell
+        cd benchmark
+        pytest test_xx_perf.py -s --mode cpu
         ```
 
 2. 运行时打印日志信息  
     ```shell
     pytest program.py --log-cli-level debug
     ```
+    测试性能时不建议打开。  
 
 ## 支持算子
 
 算子将按照文档[OperatorList.md](https://github.com/FlagOpen/FlagGems/blob/master/OperatorList.md)的顺序逐步实现。
 
 ## 支持模型
 
-| Model | float16 | float32 | bfloat16 |
-| :---: | :---: | :---: | :---: |
-| Bert_base | ✓ | ✓ | ✓ |
+- Bert-base-uncased  
+- Llama-2-7b  
 
 ## 支持平台
 

diff --git a/benchmark/__init__.py b/benchmark/__init__.py
diff --git a/benchmark/conftest.py b/benchmark/conftest.py
@@ -0,0 +1,15 @@
+def pytest_addoption(parser):
+    parser.addoption(
+        "--mode",
+        action="store",
+        default="cuda",
+        required=False,
+        choices=["cuda", "cpu"],
+        help="record latency in cuda or cpu",
+    )
+
+
+def pytest_configure(config):
+    value = config.getoption("--mode")
+    global CPU_MODE
+    CPU_MODE = value == "cpu"