CASK
is a tool for exploring customised architectures for sparse algebra. It
focuses on micro-architectural optimsations for iterative solvers and the
sparse matrix vector multiplication kernel.
CASK
is released under the MIT license - use it as you please, but we assume
no responsibility. Also, if you find the work interesting and this project
useful, we kindly ask that you cite the following:
Paul Grigoras, Pavel Burovskiy, Wayne Luk, CASK: Open-Source Custom Architectures for Sparse Kernels. FPGA 2016.
@inproceedings{grigoras2015cask,
author = {Grigoras, Paul and Burovskiy, Pavel and Luk, Wayne},
title = {{CASK}: Open-Source Custom Architectures for Sparse Kernels},
booktitle = FPGA,
year = {2016},
pages = {179--184},
}
The main steps in CASK are:
- Design Space Exploration (DSE) - given a benchmark and a set of architectures find the optimal instantiations for each architecture
- Build - generate an implementation library from the architectures which can be used in an application
- Benchmark - measure performance of the architectures
- Collect and post-process results - collects results of various steps in the process and displays them in a meaningful, user-friendly way
CASK supports three main flows available from the top-level makefile:
mock-flow
- generates CPU based stubs for the architectures; as such it has minimal external dependencies and can run on a local machine without any FPGA vendor tools available; useful for developing the infrastructure as it skips the most expensive steps (building simulation & hardware implementations)sim-flow
- build simulation versions of the hardware implementations; these are useful for checking correctness as they will pick up most functional issueshw-flow
- build actual hardware (FPGA) implementations for the designs
Each of these flows takes as input:
- a sparse matrix benchmark
- a range of design parameters to explore
It produces as output:
- a shared library with the implementation
- benchmarking results
All flows run on the local machine and require:
g++ 4.9.2
or higher (for C++11 support)Boost 1.55
or higher
To support sim-flow
the local machine must also have:
Maxeler MaxCompiler 2013.2.2
or higher
To support hw-flow
the local machine must also have:
- An installed FPGA card (Maxeler Dataflow Engine such as Vectis or Maia)
- MaxelerOS version matching the MaxCompiler version used for the hardware build
Install python deps:
virtualenv venv/ && source venv/bin/activate
pip install -f requirements.txt
Clone this repository then run:
git submodule update --init
cd cask && mkdir build && cd build && cmake ..
make -C main
First, compile the DSE program:
mkdir build && cd build && cmake .. && make main
You can then use the DSE flow through scripts/spark.py
.
bash-4.1$ python spark.py -h
usage: spark.py [-h] [-d] -t {dfe,sim} -p PARAM_FILE -b BENCHMARK_DIR
[-m MAX_BUILDS]
Run Spark DSE flow
optional arguments:
-h, --help show this help message and exit
-d, --dse
-t {dfe,sim}, --target {dfe,sim}
-p PARAM_FILE, --param-file PARAM_FILE
-b BENCHMARK_DIR, --benchmark-dir BENCHMARK_DIR
-m MAX_BUILDS, --max-builds MAX_BUILDS
spark.py
automates:
- DSE process based on a set of benchmark matrices
- generating simulation and hardware configurations
- running and benchmarking simulation and hardware configurations
Once a simulation / hardware library has been generated through the DSE flow, it will be available in lib-generated
.
From here you can use also CMake directly to rebuild the client
targets, for example test_spmv_sim
. Just build the corresponding Makefile target:
make -C build test_spmv_sim
Tests are provided for both software, hardware simulation and hardware runs. These should always pass on the master
branch. Beware though that it's not practical to test every design parameter configuration, so awkward issues may arise.
First, compile the test binaries (e.g.): make -C build test_spmv_sim
Then you can:
- Run unit tests with
ctest -R unit
- Run hardware simulation tests with
ctest -R sim
- Run hardware tests with
ctest -R hw
Note Some simulation tests may take a long time to run (`~60s), particularly if a large architecture is simulated.