Update readme with benchmark results

pluralsight · Mar 26, 2018 · 4614a05 · 4614a05
1 parent 55b66cf
commit 4614a05
Show file tree

Hide file tree

Showing 7 changed files with 40 additions and 17 deletions.
diff --git a/README.md b/README.md
@@ -12,34 +12,57 @@ There are three primary differences between the official implementation and Spav
 
 This has the net effect of greatly improving the throughput of reading and writing individual datums, since the schema isn't interrogated for every datum. This can be especially beneficial for "compatible" schema reading where both a read and write schema are needed to be able to read a complete data set.
 
-## Performance
+## Performance / Benchmarks
 
-Some (very non-scientific) benchmarks on an arbitray data set (using my development laptop) show a 2-17x improvement in serialization / deserialization throughput. A simple test was run using ~135k relatively large (5k) real world records. YMMV but in all cases spavro was faster than both the default apache implementation and the "fastavro" library.
 
-### deserialize 135k avro records
-default implementation
-360.23 records/s
+### Methodology
 
-fastavro library
-2003.62 records/s
+Benchmarks were performed with the `benchmark.py` script in the `/benchmarks` path in the repository (if you'd like to run your own tests).
 
-spavro
-6521.34 records/s
+Many of the records that led to the creation of spavro were of the form `{"type": "record", "name": "somerecord", "fields": [1 ... n fields usually with a type of the form of a union of ['null' and a primitive type]]}` so the benchmarks were created to simulate that type of record structure. I believe this is a _very_ common use case for avro so the benchmarks were created around this pattern.
 
-### Serialize 135k avro records
-default implementation
-464.61 records / s
+The benchmark creates a random schema of a record with a mix of string, double, long and boolean types and a random record generator to test that schema. The pseudo-random generator is seeded with the same string to make the results deterministic (but with varied records). The number of fields in the record was varied from one to 500 and the performance of the avro implementations were tested for each of the cases.
 
-fastavro library
-686.62 records / s
+The serializer and deserializer benchmarks create an array of simulated records in memory and then attempts to process them using the three different implementation as quickly as possible. This means the max working size is limited to memory (a combination of the number of records and the number of fields in the simulated record). For these benchmarks 5m datums were processed for each run (divided by the number of fields in each record).
+
+Each run of the schema/record/implementation was repeated ten times and the time to complete was averaged.
+
+
+### Results
+
+These tests were run using an AWS `m4.large` instance running CentOS 7. They were run with the following versions: `avro-python3==1.8.2`, `fastavro==0.17.9`, `spavro==1.1.10`. Python `3.6.4` was used for the python 3 tests.
+
+The TLDR is that spavro has *14-23x* the throughput of the default Apache avro implementation and *2-4x* the throughput of the fastavro library (depending on the shape of the records).
+
+### Deserialize avro records (read)
+
+
+Records per second read:
+
+![Read 1 field recs per sec](/benchmark/results/read_1field_rec_per_sec.png?raw=true "Read 1 field recs per sec")
+![Read 500 fields recs per sec](/benchmark/results/read_500field_rec_per_sec.png?raw=true "Read 500 fields recs per sec")
+
+Datums per second (individual fields):
+
+![Read datums/fields per second](/benchmark/results/read_datum_per_sec.png?raw=true "Read datums/fields per second")
+
+### Serialize avro records (write)
+
+
+Records per second read:
+
+![Write 1 field recs per sec](/benchmark/results/write_1field_rec_per_sec.png?raw=true "Write 1 field recs per sec")
+![Write 500 fields recs per sec](/benchmark/results/write_500field_rec_per_sec.png?raw=true "Write 500 fields recs per sec")
+
+Datums per second (individual fields):
+
+![Write datums/fields per second](/benchmark/results/write_datum_per_sec.png?raw=true "Write datums/fields per second")
 
-spavro library
-4719.05 records / s
 
 
 ## API
 
-Spavro keeps the default Apache library's API. This allows spavro to be a drop-in replacement for code using the existing Apache implementation. 
+Spavro keeps the default Apache library's API. This allows spavro to be a drop-in replacement for code using the existing Apache implementation.
 
 ## Tests
 

diff --git a/benchmark/results/read_1field_rec_per_sec.png b/benchmark/results/read_1field_rec_per_sec.png
diff --git a/benchmark/results/read_500field_rec_per_sec.png b/benchmark/results/read_500field_rec_per_sec.png
diff --git a/benchmark/results/read_datum_per_sec.png b/benchmark/results/read_datum_per_sec.png
diff --git a/benchmark/results/write_1field_rec_per_sec.png b/benchmark/results/write_1field_rec_per_sec.png
diff --git a/benchmark/results/write_500field_rec_per_sec.png b/benchmark/results/write_500field_rec_per_sec.png
diff --git a/benchmark/results/write_datum_per_sec.png b/benchmark/results/write_datum_per_sec.png