Skip to content
This repository has been archived by the owner on Nov 15, 2024. It is now read-only.

Commit

Permalink
Update readme with benchmark results
Browse files Browse the repository at this point in the history
  • Loading branch information
mikepk committed Mar 26, 2018
1 parent 55b66cf commit 4614a05
Show file tree
Hide file tree
Showing 7 changed files with 40 additions and 17 deletions.
57 changes: 40 additions & 17 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,34 +12,57 @@ There are three primary differences between the official implementation and Spav

This has the net effect of greatly improving the throughput of reading and writing individual datums, since the schema isn't interrogated for every datum. This can be especially beneficial for "compatible" schema reading where both a read and write schema are needed to be able to read a complete data set.

## Performance
## Performance / Benchmarks

Some (very non-scientific) benchmarks on an arbitray data set (using my development laptop) show a 2-17x improvement in serialization / deserialization throughput. A simple test was run using ~135k relatively large (5k) real world records. YMMV but in all cases spavro was faster than both the default apache implementation and the "fastavro" library.

### deserialize 135k avro records
default implementation
360.23 records/s
### Methodology

fastavro library
2003.62 records/s
Benchmarks were performed with the `benchmark.py` script in the `/benchmarks` path in the repository (if you'd like to run your own tests).

spavro
6521.34 records/s
Many of the records that led to the creation of spavro were of the form `{"type": "record", "name": "somerecord", "fields": [1 ... n fields usually with a type of the form of a union of ['null' and a primitive type]]}` so the benchmarks were created to simulate that type of record structure. I believe this is a _very_ common use case for avro so the benchmarks were created around this pattern.

### Serialize 135k avro records
default implementation
464.61 records / s
The benchmark creates a random schema of a record with a mix of string, double, long and boolean types and a random record generator to test that schema. The pseudo-random generator is seeded with the same string to make the results deterministic (but with varied records). The number of fields in the record was varied from one to 500 and the performance of the avro implementations were tested for each of the cases.

fastavro library
686.62 records / s
The serializer and deserializer benchmarks create an array of simulated records in memory and then attempts to process them using the three different implementation as quickly as possible. This means the max working size is limited to memory (a combination of the number of records and the number of fields in the simulated record). For these benchmarks 5m datums were processed for each run (divided by the number of fields in each record).

Each run of the schema/record/implementation was repeated ten times and the time to complete was averaged.


### Results

These tests were run using an AWS `m4.large` instance running CentOS 7. They were run with the following versions: `avro-python3==1.8.2`, `fastavro==0.17.9`, `spavro==1.1.10`. Python `3.6.4` was used for the python 3 tests.

The TLDR is that spavro has *14-23x* the throughput of the default Apache avro implementation and *2-4x* the throughput of the fastavro library (depending on the shape of the records).

### Deserialize avro records (read)


Records per second read:

![Read 1 field recs per sec](/benchmark/results/read_1field_rec_per_sec.png?raw=true "Read 1 field recs per sec")
![Read 500 fields recs per sec](/benchmark/results/read_500field_rec_per_sec.png?raw=true "Read 500 fields recs per sec")

Datums per second (individual fields):

![Read datums/fields per second](/benchmark/results/read_datum_per_sec.png?raw=true "Read datums/fields per second")

### Serialize avro records (write)


Records per second read:

![Write 1 field recs per sec](/benchmark/results/write_1field_rec_per_sec.png?raw=true "Write 1 field recs per sec")
![Write 500 fields recs per sec](/benchmark/results/write_500field_rec_per_sec.png?raw=true "Write 500 fields recs per sec")

Datums per second (individual fields):

![Write datums/fields per second](/benchmark/results/write_datum_per_sec.png?raw=true "Write datums/fields per second")

spavro library
4719.05 records / s


## API

Spavro keeps the default Apache library's API. This allows spavro to be a drop-in replacement for code using the existing Apache implementation.
Spavro keeps the default Apache library's API. This allows spavro to be a drop-in replacement for code using the existing Apache implementation.

## Tests

Expand Down
Binary file added benchmark/results/read_1field_rec_per_sec.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added benchmark/results/read_500field_rec_per_sec.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added benchmark/results/read_datum_per_sec.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added benchmark/results/write_1field_rec_per_sec.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added benchmark/results/write_500field_rec_per_sec.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added benchmark/results/write_datum_per_sec.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit 4614a05

Please sign in to comment.