TensorRT OSS v8.2 Early Access Release

Signed-off-by: Rajeev Rao <[email protected]>
HuangCongQing · Oct 5, 2021 · 2d517d2 · 2d517d2
1 parent 80674b3
commit 2d517d2
Show file tree

Hide file tree

Showing 278 changed files with 432,506 additions and 56,929 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -1,5 +1,44 @@
 # TensorRT OSS Release Changelog
 
+## [8.2.0 EA](https://docs.nvidia.com/deeplearning/tensorrt/release-notes/tensorrt-8.html#rel-8-2-0-EA) - 2021-10-05
+### Added
+- [Demo applications](demo/HuggingFace) showcasing TensorRT inference of [HuggingFace Transformers](https://huggingface.co/transformers).
+  - Support is currently extended to GPT-2 and T5 models.
+- Added support for the following ONNX operators:
+  - `Einsum`
+  - `IsNan`
+  - `GatherND`
+  - `Scatter`
+  - `ScatterElements`
+  - `ScatterND`
+  - `Sign`
+  - `Round`
+- Added support for building TensorRT Python API on Windows.
+
+### Updated
+- Notable API updates in TensorRT 8.2.0.6 EA release. See [TensorRT Developer Guide](https://docs.nvidia.com/deeplearning/tensorrt/developer-guide/index.html) for details.
+  - Added three new APIs, `IExecutionContext: getEnqueueEmitsProfile()`, `setEnqueueEmitsProfile()`, and `reportToProfiler()` which can be used to collect layer profiling info when the inference is launched as a CUDA graph.
+  - Eliminated the global logger; each `Runtime`, `Builder` or `Refitter` now has its own logger.
+  - Added new operators: `IAssertionLayer`, `IConditionLayer`, `IEinsumLayer`, `IIfConditionalBoundaryLayer`, `IIfConditionalOutputLayer`, `IIfConditionalInputLayer`, and `IScatterLayer`.
+  - Added new `IGatherLayer` modes: `kELEMENT` and `kND`
+  - Added new `ISliceLayer` modes: `kFILL`, `kCLAMP`, and `kREFLECT`
+  - Added new `IUnaryLayer` operators: `kSIGN` and `kROUND`
+  - Added new runtime class `IEngineInspector` that can be used to inspect the detailed information of an engine, including the layer parameters, the chosen tactics, the precision used, etc.
+  - `ProfilingVerbosity` enums have been updated to show their functionality more explicitly.
+- Updated TensorRT OSS container defaults to cuda 11.4
+- CMake to target C++14 builds.
+- Updated following ONNX operators:
+  - `Gather` and `GatherElements` implementations to natively support negative indices
+  - `Pad` layer to support ND padding, along with `edge` and `reflect` padding mode support
+  - `If` layer with general performance improvements.
+
+### Removed
+- Removed `sampleMLP`.
+- Several flags of trtexec have been deprecated:
+  - `--explicitBatch` flag has been deprecated and has no effect. When the input model is in UFF or in Caffe prototxt format, the implicit batch dimension mode is used automatically; when the input model is in ONNX format, the explicit batch mode is used automatically.
+  - `--explicitPrecision` flag has been deprecated and has no effect. When the input ONNX model contains Quantization/Dequantization nodes, TensorRT automatically uses explicit precision mode.
+  - `--nvtxMode=[verbose|default|none]` has been deprecated in favor of `--profilingVerbosity=[detailed|layer_names_only|none]` to show its functionality more explicitly.
+
 ## [21.10](https://github.com/NVIDIA/TensorRT/releases/tag/21.10) - 2021-10-05
 ### Added
 - Benchmark script for demoBERT-Megatron
@@ -33,7 +72,6 @@
   - Mark BOOL tiles as unsupported
   - Remove unnecessary shape tensor checks
 
-
 ### Removed
 - N/A
 

diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -58,9 +58,11 @@ option(BUILD_PLUGINS "Build TensorRT plugin" ON)
 option(BUILD_PARSERS "Build TensorRT parsers" ON)
 option(BUILD_SAMPLES "Build TensorRT samples" ON)
 
-set(CMAKE_CXX_STANDARD 11)
+# C++14
+set(CMAKE_CXX_STANDARD 14)
 set(CMAKE_CXX_STANDARD_REQUIRED ON)
 set(CMAKE_CXX_EXTENSIONS OFF)
+
 set(CMAKE_CXX_FLAGS "-Wno-deprecated-declarations ${CMAKE_CXX_FLAGS} -DBUILD_SYSTEM=cmake_oss")
 
 ############################################################################################

diff --git a/README.md b/README.md
@@ -15,12 +15,12 @@ This repository contains the Open Source Software (OSS) components of NVIDIA Ten
 To build the TensorRT-OSS components, you will first need the following software packages.
 
 **TensorRT GA build**
-* [TensorRT](https://developer.nvidia.com/nvidia-tensorrt-download) v8.0.3.4
+* [TensorRT](https://developer.nvidia.com/nvidia-tensorrt-download) v8.2.0.6
 
 **System Packages**
 * [CUDA](https://developer.nvidia.com/cuda-toolkit)
   * Recommended versions:
-  * cuda-11.3.1 + cuDNN-8.2
+  * cuda-11.4.x + cuDNN-8.2
   * cuda-10.2 + cuDNN-8.2
 * [GNU make](https://ftp.gnu.org/gnu/make/) >= v4.1
 * [cmake](https://github.com/Kitware/CMake/releases) >= v3.13
@@ -34,16 +34,16 @@ To build the TensorRT-OSS components, you will first need the following software
   * [Docker](https://docs.docker.com/install/) >= 19.03
   * [NVIDIA Container Toolkit](https://github.com/NVIDIA/nvidia-docker)
 * Toolchains and SDKs
-  * (Cross compilation for Jetson platform) [NVIDIA JetPack](https://developer.nvidia.com/embedded/jetpack) >= 4.6 (July 2021)
+  * (Cross compilation for Jetson platform) [NVIDIA JetPack](https://developer.nvidia.com/embedded/jetpack) >= 4.6 (current support only for TensorRT 8.0.1)
   * (For Windows builds) [Visual Studio](https://visualstudio.microsoft.com/vs/older-downloads/) 2017 Community or Enterprise edition
   * (Cross compilation for QNX platform) [QNX Toolchain](https://blackberry.qnx.com/en)
 * PyPI packages (for demo applications/tests)
-  * [onnx](https://pypi.org/project/onnx/) 1.8.0
+  * [onnx](https://pypi.org/project/onnx/) 1.9.0
   * [onnxruntime](https://pypi.org/project/onnxruntime/) 1.8.0
-  * [tensorflow-gpu](https://pypi.org/project/tensorflow/) >= 2.4.1
-  * [Pillow](https://pypi.org/project/Pillow/) >= 8.1.2
-  * [pycuda](https://pypi.org/project/pycuda/) < 2020.1
-  * [numpy](https://pypi.org/project/numpy/) 1.21.0
+  * [tensorflow-gpu](https://pypi.org/project/tensorflow/) >= 2.5.1
+  * [Pillow](https://pypi.org/project/Pillow/) >= 8.3.2
+  * [pycuda](https://pypi.org/project/pycuda/) < 2021.1
+  * [numpy](https://pypi.org/project/numpy/)
   * [pytest](https://pypi.org/project/pytest/)
 * Code formatting tools (for contributors)
   * [Clang-format](https://clang.llvm.org/docs/ClangFormat.html)
@@ -66,27 +66,27 @@ To build the TensorRT-OSS components, you will first need the following software
 
     Else download and extract the TensorRT GA build from [NVIDIA Developer Zone](https://developer.nvidia.com/nvidia-tensorrt-download).
 
-    **Example: Ubuntu 18.04 on x86-64 with cuda-11.3**
+    **Example: Ubuntu 18.04 on x86-64 with cuda-11.4**
 
     ```bash
     cd ~/Downloads
-    tar -xvzf TensorRT-8.0.3.4.Ubuntu-18.04.x86_64-gnu.cuda-11.3.cudnn8.2.tar.gz
-    export TRT_LIBPATH=`pwd`/TensorRT-8.0.3.4
+    tar -xvzf TensorRT-8.2.0.6.Linux.x86_64-gnu.cuda-11.4.cudnn8.2.tar.gz
+    export TRT_LIBPATH=`pwd`/TensorRT-8.2.0.6
     ```
 
-    **Example: Windows on x86-64 with cuda-11.3**
+    **Example: Windows on x86-64 with cuda-11.4**
 
     ```powershell
     cd ~\Downloads
-    Expand-Archive .\TensorRT-8.0.3.4.Windows10.x86_64.cuda-11.3.cudnn8.2.zip
-    $Env:TRT_LIBPATH = '$(Get-Location)\TensorRT-8.0.3.4'
+    Expand-Archive .\TensorRT-8.2.0.6.Windows10.x86_64.cuda-11.4.cudnn8.2.zip
+    $Env:TRT_LIBPATH = '$(Get-Location)\TensorRT-8.2.0.6'
     $Env:PATH += 'C:\Program Files (x86)\Microsoft Visual Studio\2017\Professional\MSBuild\15.0\Bin\'
     ```
 
 
 3. #### (Optional - for Jetson builds only) Download the JetPack SDK
     1. Download and launch the JetPack SDK manager. Login with your NVIDIA developer account.
-    2. Select the  platform and target OS  (example: Jetson AGX Xavier, `Linux Jetpack 4.4`), and click Continue.
+    2. Select the  platform and target OS  (example: Jetson AGX Xavier, `Linux Jetpack 4.6`), and click Continue.
     3. Under `Download & Install Options` change the download folder and select `Download now, Install later`. Agree to the license terms and click Continue.
     4. Move the extracted files into the `<TensorRT-OSS>/docker/jetpack_files` folder.
 
@@ -98,13 +98,13 @@ For Linux platforms, we recommend that you generate a docker container for build
 1. #### Generate the TensorRT-OSS build container.
     The TensorRT-OSS build container can be generated using the supplied Dockerfiles and build script. The build container is configured for building TensorRT OSS out-of-the-box.
 
-    **Example: Ubuntu 18.04 on x86-64 with cuda-11.3**
+    **Example: Ubuntu 18.04 on x86-64 with cuda-11.4.2 (default)**
     ```bash
-    ./docker/build.sh --file docker/ubuntu-18.04.Dockerfile --tag tensorrt-ubuntu18.04-cuda11.3 --cuda 11.3.1
+    ./docker/build.sh --file docker/ubuntu-18.04.Dockerfile --tag tensorrt-ubuntu18.04-cuda11.4
     ```
-    **Example: CentOS/RedHat 8 on x86-64 with cuda-10.2**
+    **Example: CentOS/RedHat 7 on x86-64 with cuda-10.2**
     ```bash
-    ./docker/build.sh --file docker/centos-8.Dockerfile --tag tensorrt-centos8-cuda10.2 --cuda 10.2
+    ./docker/build.sh --file docker/centos-7.Dockerfile --tag tensorrt-centos7-cuda10.2 --cuda 10.2
     ```
     **Example: Ubuntu 18.04 cross-compile for Jetson (aarch64) with cuda-10.2 (JetPack SDK)**
     ```bash
@@ -114,7 +114,7 @@ For Linux platforms, we recommend that you generate a docker container for build
 2. #### Launch the TensorRT-OSS build container.
     **Example: Ubuntu 18.04 build container**
 	```bash
-	./docker/launch.sh --tag tensorrt-ubuntu18.04-cuda11.3 --gpus all
+	./docker/launch.sh --tag tensorrt-ubuntu18.04-cuda11.4 --gpus all
 	```
 	> NOTE:
 	1. Use the `--tag` corresponding to build container generated in Step 1.
@@ -125,7 +125,7 @@ For Linux platforms, we recommend that you generate a docker container for build
 ## Building TensorRT-OSS
 * Generate Makefiles or VS project (Windows) and build.
 
-    **Example: Linux (x86-64) build with default cuda-11.3**
+    **Example: Linux (x86-64) build with default cuda-11.4.2**
 	```bash
 	cd $TRT_OSSPATH
 	mkdir -p build && cd build
@@ -156,21 +156,20 @@ For Linux platforms, we recommend that you generate a docker container for build
 	msbuild ALL_BUILD.vcxproj
 	```
 	> NOTE:
-	1. The default CUDA version used by CMake is 11.3.1. To override this, for example to 10.2, append `-DCUDA_VERSION=10.2` to the cmake command.
+	1. The default CUDA version used by CMake is 11.4.2. To override this, for example to 10.2, append `-DCUDA_VERSION=10.2` to the cmake command.
 	2. If samples fail to link on CentOS7, create this symbolic link: `ln -s $TRT_OUT_DIR/libnvinfer_plugin.so $TRT_OUT_DIR/libnvinfer_plugin.so.8`
 * Required CMake build arguments are:
 	- `TRT_LIB_DIR`: Path to the TensorRT installation directory containing libraries.
 	- `TRT_OUT_DIR`: Output directory where generated build artifacts will be copied.
 * Optional CMake build arguments:
 	- `CMAKE_BUILD_TYPE`: Specify if binaries generated are for release or debug (contain debug symbols). Values consists of [`Release`] | `Debug`
-	- `CUDA_VERISON`: The version of CUDA to target, for example [`11.3.1`].
+	- `CUDA_VERISON`: The version of CUDA to target, for example [`11.4.2`].
 	- `CUDNN_VERSION`: The version of cuDNN to target, for example [`8.2`].
 	- `PROTOBUF_VERSION`:  The version of Protobuf to use, for example [`3.0.0`]. Note: Changing this will not configure CMake to use a system version of Protobuf, it will configure CMake to download and try building that version.
 	- `CMAKE_TOOLCHAIN_FILE`: The path to a toolchain file for cross compilation.
 	- `BUILD_PARSERS`: Specify if the parsers should be built, for example [`ON`] | `OFF`.  If turned OFF, CMake will try to find precompiled versions of the parser libraries to use in compiling samples. First in `${TRT_LIB_DIR}`, then on the system. If the build type is Debug, then it will prefer debug builds of the libraries before release versions if available.
 	- `BUILD_PLUGINS`: Specify if the plugins should be built, for example [`ON`] | `OFF`. If turned OFF, CMake will try to find a precompiled version of the plugin library to use in compiling samples. First in `${TRT_LIB_DIR}`, then on the system. If the build type is Debug, then it will prefer debug builds of the libraries before release versions if available.
 	- `BUILD_SAMPLES`: Specify if the samples should be built, for example [`ON`] | `OFF`.
-	- `CUB_VERSION`: The version of CUB to use, for example [`1.8.0`].
 	- `GPU_ARCHS`: GPU (SM) architectures to target. By default we generate CUDA code for all major SMs. Specific SM versions can be specified here as a quoted space-separated list to reduce compilation time and binary size. Table of compute capabilities of NVIDIA GPUs can be found [here](https://developer.nvidia.com/cuda-gpus). Examples:
         - NVidia A100: `-DGPU_ARCHS="80"`
         - Tesla T4, GeForce RTX 2080: `-DGPU_ARCHS="75"`

diff --git a/VERSION b/VERSION
@@ -1 +1 @@
-8.0.3.4
+8.2.0.6
diff --git a/cmake/modules/set_ifndef.cmake b/cmake/modules/set_ifndef.cmake
@@ -13,7 +13,6 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
-
 function (set_ifndef variable value)
   if(NOT DEFINED ${variable})
     set(${variable} ${value} PARENT_SCOPE)

diff --git a/cmake/toolchains/cmake_aarch64-android.toolchain b/cmake/toolchains/cmake_aarch64-android.toolchain
@@ -20,8 +20,7 @@ set(CMAKE_SYSTEM_PROCESSOR aarch64)
 set(CMAKE_C_COMPILER $ENV{AARCH64_ANDROID_CC})
 set(CMAKE_CXX_COMPILER $ENV{AARCH64_ANDROID_CC})
 
-set(CMAKE_C_FLAGS "$ENV{AARCH64_ANDROID_CFLAGS} -pie -fPIE"
-  CACHE STRING "" FORCE)
+set(CMAKE_C_FLAGS "$ENV{AARCH64_ANDROID_CFLAGS} -pie -fPIE" CACHE STRING "" FORCE)
 set(CMAKE_CXX_FLAGS "${CMAKE_C_FLAGS}" CACHE STRING "" FORCE)
 
 set(CMAKE_C_COMPILER_TARGET aarch64-none-linux-android)
@@ -37,11 +36,8 @@ set(CMAKE_CUDA_HOST_COMPILER ${CMAKE_CXX_COMPILER} CACHE STRING "" FORCE)
 set(CMAKE_CUDA_FLAGS "-I${CUDA_INCLUDE_DIRS} -Xcompiler=\"-fPIC ${CMAKE_CXX_FLAGS}\"" CACHE STRING "" FORCE)
 set(CMAKE_CUDA_COMPILER_FORCED TRUE)
 
-
 set(CUDA_LIBS -L${CUDA_ROOT}/lib64)
 
-set(ADDITIONAL_PLATFORM_LIB_FLAGS ${CUDA_LIBS} -lcublas -lcudart -lnvToolsExt -lculibos -lcudadevrt -llog)
-
+set(ADDITIONAL_PLATFORM_LIB_FLAGS ${CUDA_LIBS} -lcudart -lnvToolsExt -lculibos -lcudadevrt -llog)
 
-set(DISABLE_SWIG TRUE)
 set(TRT_PLATFORM_ID "aarch64-android")
diff --git a/cmake/toolchains/cmake_aarch64.toolchain b/cmake/toolchains/cmake_aarch64.toolchain
@@ -16,22 +16,29 @@
 
 set(CMAKE_SYSTEM_NAME Linux)
 set(CMAKE_SYSTEM_PROCESSOR aarch64)
+
 set(TRT_PLATFORM_ID "aarch64")
-set(CUDA_PLATFORM_ID "aarch64-linux")
 
-set(CMAKE_C_COMPILER /usr/bin/aarch64-linux-gnu-gcc)
-set(CMAKE_CXX_COMPILER /usr/bin/aarch64-linux-gnu-g++)
+if("$ENV{ARMSERVER}" AND "${CUDA_VERSION}" VERSION_GREATER_EQUAL 11.0)
+    set(CUDA_PLATFORM_ID "sbsa-linux")
+else()
+    set(CUDA_PLATFORM_ID "aarch64-linux")
+endif()
+
+set(CMAKE_C_COMPILER $ENV{AARCH64_CC})
+set(CMAKE_CXX_COMPILER $ENV{AARCH64_CC})
 
-set(CMAKE_C_FLAGS "" CACHE STRING "" FORCE)
-set(CMAKE_CXX_FLAGS "" CACHE STRING "" FORCE)
+set(CMAKE_C_FLAGS "$ENV{AARCH64_CFLAGS}" CACHE STRING "" FORCE)
+set(CMAKE_CXX_FLAGS "$ENV{AARCH64_CFLAGS}" CACHE STRING "" FORCE)
 
-set(CMAKE_C_COMPILER_TARGET aarch64)
-set(CMAKE_CXX_COMPILER_TARGET aarch64)
+set(CMAKE_C_COMPILER_TARGET aarch64-linux-gnu)
+set(CMAKE_CXX_COMPILER_TARGET aarch64-linux-gnu)
 
 set(CMAKE_C_COMPILER_FORCED TRUE)
 set(CMAKE_CXX_COMPILER_FORCED TRUE)
 
 set(CUDA_ROOT /usr/local/cuda-${CUDA_VERSION}/targets/${CUDA_PLATFORM_ID} CACHE STRING "CUDA ROOT dir")
+
 set(CUDNN_ROOT_DIR /pdk_files/cudnn)
 set(BUILD_LIBRARY_ONLY 1)
 
@@ -46,6 +53,4 @@ set(CMAKE_CUDA_COMPILER_FORCED TRUE)
 
 set(CUDA_LIBS -L${CUDA_ROOT}/lib)
 
-set(ADDITIONAL_PLATFORM_LIB_FLAGS ${CUDA_LIBS} -lcublas -lcudart -lstdc++ -lm)
-
-set(DISABLE_SWIG TRUE)
+set(ADDITIONAL_PLATFORM_LIB_FLAGS ${CUDA_LIBS} -lcudart -lstdc++ -lm)
diff --git a/cmake/toolchains/cmake_ppc64le.toolchain b/cmake/toolchains/cmake_ppc64le.toolchain
@@ -19,9 +19,10 @@ set(CMAKE_SYSTEM_PROCESSOR ppc64le)
 
 set(CMAKE_C_COMPILER powerpc64le-linux-gnu-gcc)
 set(CMAKE_CXX_COMPILER powerpc64le-linux-gnu-g++)
+set(CMAKE_AR /usr/bin/ar CACHE STRING "" FORCE)
 
-set(CMAKE_C_COMPILER_TARGET ppc64le)
-set(CMAKE_CXX_COMPILER_TARGET ppc64le)
+set(CMAKE_C_COMPILER_TARGET powerpc64le-linux-gnu)
+set(CMAKE_CXX_COMPILER_TARGET powerpc64le-linux-gnu)
 
 set(CMAKE_CUDA_HOST_COMPILER ${CMAKE_CXX_COMPILER} CACHE STRING "" FORCE)
 set(CMAKE_CUDA_FLAGS "-I${CUDA_ROOT}/include -Xcompiler=\"-fPIC ${CMAKE_CXX_FLAGS}\"" CACHE STRING "" FORCE)

diff --git a/cmake/toolchains/cmake_qnx.toolchain b/cmake/toolchains/cmake_qnx.toolchain
@@ -14,7 +14,7 @@
 # limitations under the License.
 #
 
-set(CMAKE_SYSTEM_NAME qnx)
+set(CMAKE_SYSTEM_NAME QNX)
 set(CMAKE_SYSTEM_PROCESSOR aarch64)
 
 if(DEFINED ENV{QNX_BASE})
@@ -39,8 +39,8 @@ message(STATUS "QNX_TARGET = ${QNX_TARGET}")
 set(CMAKE_C_COMPILER ${QNX_HOST}/usr/bin/aarch64-unknown-nto-qnx7.0.0-gcc)
 set(CMAKE_CXX_COMPILER ${QNX_HOST}/usr/bin/aarch64-unknown-nto-qnx7.0.0-g++)
 
-set(CMAKE_C_COMPILER_TARGET aarch64)
-set(CMAKE_CXX_COMPILER_TARGET aarch64)
+set(CMAKE_C_COMPILER_TARGET aarch64-unknown-nto-qnx)
+set(CMAKE_CXX_COMPILER_TARGET aarch64-unknown-nto-qnx)
 
 set(CMAKE_C_COMPILER_FORCED TRUE)
 set(CMAKE_CXX_COMPILER_FORCED TRUE)
@@ -54,8 +54,6 @@ set(CMAKE_CUDA_COMPILER_FORCED TRUE)
 
 set(CUDA_LIBS -L${CUDA_ROOT}/lib)
 
-set(ADDITIONAL_PLATFORM_LIB_FLAGS ${CUDA_LIBS} -lcublas  -lcudart)
-#...Disable swig
-set(DISABLE_SWIG TRUE)
+set(ADDITIONAL_PLATFORM_LIB_FLAGS ${CUDA_LIBS} -lcudart)
 
 set(TRT_PLATFORM_ID "aarch64-qnx")
diff --git a/cmake/toolchains/cmake_x64_win.toolchain b/cmake/toolchains/cmake_x64_win.toolchain
@@ -36,13 +36,12 @@ set(W10_LIBRARY_SUFFIXES .lib .dll)
 set(W10_CUDA_ROOT ${CUDA_TOOLKIT_ROOT_DIR})
 set(W10_LINKER ${MSVC_COMPILER_DIR}/bin/amd64/link)
 
-
 set(CMAKE_CUDA_HOST_COMPILER ${CMAKE_NVCC_COMPILER} CACHE STRING "" FORCE)
 
 set(ADDITIONAL_PLATFORM_INCL_FLAGS "-I${MSVC_COMPILER_DIR}/include -I${MSVC_COMPILER_DIR}/../ucrt/include")
 set(ADDITIONAL_PLATFORM_LIB_FLAGS ${ADDITIONAL_PLATFORM_LIB_FLAGS} "-LIBPATH:${NV_TOOLS}/ddk/wddmv2/official/17134/Lib/10.0.17134.0/um/x64")
 set(ADDITIONAL_PLATFORM_LIB_FLAGS ${ADDITIONAL_PLATFORM_LIB_FLAGS} "-LIBPATH:${MSVC_COMPILER_DIR}/lib/amd64" )
 set(ADDITIONAL_PLATFORM_LIB_FLAGS ${ADDITIONAL_PLATFORM_LIB_FLAGS} "-LIBPATH:${MSVC_COMPILER_DIR}/../ucrt/lib/x64")
-set(ADDITIONAL_PLATFORM_LIB_FLAGS ${ADDITIONAL_PLATFORM_LIB_FLAGS} "-LIBPATH:${W10_CUDA_ROOT}/lib/x64 cudart.lib cublas.lib")
+set(ADDITIONAL_PLATFORM_LIB_FLAGS ${ADDITIONAL_PLATFORM_LIB_FLAGS} "-LIBPATH:${W10_CUDA_ROOT}/lib/x64 cudart.lib")
 
 set(TRT_PLATFORM_ID "win10")
diff --git a/samples/sampleMLP/CMakeLists.txt → ...oolchains/cmake_x86_64_agnostic.toolchain b/samples/sampleMLP/CMakeLists.txt → ...oolchains/cmake_x86_64_agnostic.toolchain
@@ -13,10 +13,17 @@
 # See the License for the specific language governing permissions and
 # limitations under the License.
 #
-set(SAMPLE_SOURCES
-    sampleMLP.cpp
-)
 
-set(SAMPLE_PARSERS "caffe")
+set(CMAKE_SYSTEM_NAME Linux)
+set(CMAKE_SYSTEM_PROCESSOR x86_64)
 
-include(../CMakeSamplesTemplate.txt)
+set(CMAKE_C_COMPILER /opt/rh/devtoolset-8/root/usr/bin/gcc)
+set(CMAKE_CXX_COMPILER /opt/rh/devtoolset-8/root/usr/bin/g++)
+
+if(DEFINED CUDA_ROOT)
+  set(CUDA_TOOLKIT_ROOT_DIR ${CUDA_ROOT})
+endif()
+
+set(CUDA_INCLUDE_DIRS ${CUDA_ROOT}/include)
+
+set(TRT_PLATFORM_ID "x86_64")
diff --git a/demo/HuggingFace/.gitignore b/demo/HuggingFace/.gitignore
@@ -0,0 +1,2 @@
+*.pyc
+__pycache__/
diff --git a/demo/HuggingFace/GPT2/.gitkeep b/demo/HuggingFace/GPT2/.gitkeep