This document is meant to give an overview of the design of stenographer and stenotype at a medium/high level. For low-level stuff, look at the code :). The architecture described in this document has changed relatively little over the course of the project, and we doubt it will change much in the future.
Stenographer consists of a stenographer
server, which serves user requests and
manages disk, and which runs a stenotype
child process. stenotype
sniffs
packet data and writes it to disk, communicating with stenographer
simply by
un-hiding files when they're read for consumption. The user scripts stenocurl
and stenoread
provide simple wrappers around curl
, which allow analysts to
request packet data from the stenographer
server simply and easily.
Stenographer is actually a few separate processes.
Stenographer is a long-running server, the binary that you start up if you want
to "run stenographer" on your system. It manages the stenotype
binary as a
child process, watches disk usage and cleans up old files, and serves data to
analysts based on their queries.
First off, stenographer is in charge of making sure that stenotype
(discussed
momentarily) starts and keeps running. It starts stenotype as a subprocess,
watching for failures and restarting as necessary. It also watches stenotype's
output (the files it creates) and may kill/restart stenotype itself if it feels
it is misbehaving or not generating files fast enough.
Stenographer watches the disks that stenotype uses and tries to keep them tidy and usable. This includes deleting old files when disk space decreases below a threshold, and deleting old temporary files that stenotype creates, if stenotype crashes before it can clean up after itself.
Stenographer handles disk management in two ways. First, it runs checks whenever it starts up a new stenotype instance to make sure files from an old, possibly crashed instance are no longer around and causing issues. Secondly, it periodically checks disk state for out-of-disk issues (currently every 15 seconds). During that periodic check, it also looks for new files stenotype may have generated that it can use to serve analyst requests (described momentarily).
Stenographer is also in charge of serving any analyst requests for packet data. It watches the data generated by stenotype, and when analysts request packets it looks up their requests in the generated data and returns them.
Stenographer provides data to analysts over TLS. Queries are POST'd to the /query HTTP handler, and responses are streamed back as PCAP files (MIME type application/octet-stream).
Currently, stenographer only binds to localhost, so it doesn't accept remote user requests.
Access to the server is controlled with client certificates. On install, a
script, stenokeys.sh
, is run to generate a CA certificate and use it to
create/sign a client and server certificate. The client and server authenticate
each other on every request using the CA certificate as a source of truth.
POSIX permissions are used locally to control access to the certs... the
stenographer
user which runs steno has read access to the server key
(steno:root -r--------
). The stenographer
group as read access to the
client key (root:steno ----r-----
). Key usage extensions specify that the
server key must be used as a TLS server, and the client key must be used as a
TLS client.
Due to the file permissions mentioned above, giving steno access to a local user
simply requires adding that user to the local stenographer
group, thus giving
them access to client_key.pem
.
Once keys are created on install, they're currently NEVER REVOKED. Thus, if
someone gets access to a client cert, they'll have access to the server ad
infinitum. Should you have problems with a key being released, the current best
way to handle this is by deleting all data in the /etc/stenographer/certs
directory and rerunning stenokeys.sh
to generate an entirely new set of keys
rooted to a new CA.
stenokeys.sh
will not modify keys/certs that already exist in
/etc/stenographer/certs
. Thus, if you have more complex topologies, you can
overwrite these values and they'll happily be used by Stenographer. If, for
example, you already have a CA in your organization, you can copy its cert into
the ca_cert.pem
file, then create {client,server}_{key,cert}.pem
files
rooted in that CA and copy them in. This also allows folks to use a single CA
cert over multiple stenographer instances, allowing a single client cert to
access multiple servers over the network.
Stenotype's sole purpose is to read packet data off the wire, index it, and write it to disk. It uses a multi-threaded architecture, while trying to limit context switching by having most processing on a single core stay within a single thread.
Stenotype tries to be as performant as possible by allowing the kernel to do the vast majority of the work. It uses AF_PACKET, which asks the kernel to place packets into blocks in a shared memory region, then notify stenotype when blocks are available. After indexing the packets in each block, it passes the block directly back to the kernel as an O_DIRECT asynchronous write operation.
Besides indexing, then, stenotype's main job is to wait for the kernel to put packets in a memory region, then immediately ask the kernel to take that region back and write it. An important benefit of this design is that packets are never copied out of the kernel's shared memory space. The kernel writes them from the NIC to shared memory, then the kernel uses that same shared memory for O_DIRECT writes to disk. The packets transit the bus twice and are never copied from RAM to RAM.
As detailed above, the "file format" used by stenotype is actually to directly dump data as it's presented by AF_PACKET. Thus, data is written as blocks, with each block containing a small header followed by a linked list of packets. Blocks are large (1M), and are dumped regularly (every 10s), so there's a good chance that for slow networks we use far more disk than we need. However, as network speed increases past 1M/minute/thread, this format becomes quite efficient. There will always be overhead, however.
Stenotype guarantees that a packet file will not exceed 4GB, by rotating files if they reach that size. It also rotates files older than 1 minute. Files are named for the microsecond timestamp they were created at. While a file is being written, it will be hidden (.1422693160230282). When rotating, the file will be renamed to no longer be hidden (.1422693160230282 -> 1422693160230282). This rename only occurs after all data has been successfully flushed to disk, so external processes which see this rename happen (like stenographer) can immediately start to use the newly renamed file.
Stenotype takes advantage of AF_PACKET's excellent load-balancing options to split up the work of processing packets across many CPUs. It uses AF_PACKET's PACKET_FANOUT to create a separate memory region for N different threads, then request that the kernel split up incoming packets across these regions. One stentoype packet reading/writing thread is created for each of these regions. Within that single thread, block processing (reading in a block, indexing it, starting an async write, reading the next block, etc...) happens serially.
After getting a block of packets from the kernel but before passing them back to be written out, stenotype reads through each packet and creates a small number of indexes in memory. These indexes are very simple, mapping a packet attribute to a file seek offset. Attributes we use include ports (src and dst), protocols (udp/tcp/etc) and IPs (v4 and v6). Indexes are dumped to disk when file rotation happens, with a corresponding index file created for each packet file, of the same name but in a different directory. Given the example above, when the .1422693160230282 -> 1422693160230282 file rotation happens, an index also named .1422693160230282 will be created and written, then renamed to 1422693160230282 when the index has been fully flushed to disk. Once both the packets directory and index directory have a 1422693160230282 file, stenographer can read both in and use the index to lookup packets.
Indexes are leveldb SSTables, a simple, compressed file format that stores key-value pairs sorted by key and provides simple, efficient mechanisms to query individual keys or key ranges. Among other things, leveldb tables give us great compression capabilities, keeping our indexes small while still providing fast reads.
We store each attribute (port number, protocol number, IP, etc) and its associated packet positions in the blockfile using the format:
Key: [type (1 byte)][value (? bytes)] Value: [position 0 (4 bytes)][position 1 (4 bytes)] ...
The type specifies the type of attribute being indexed (1 == protocol, 2 == port, 4 == IPv4, 6 == IPv6). The value is 1 byte for protocol, 2 for ports, 4 and 16 respectively for IPv4 and IPv6 addresses. Each position is a seek offset into a packet file (which are guaranteed to not exceed 4GB) and are always exactly 4 bytes long. All values (ports, protocols, positions) are big endian. Looking up packets involves reading key for a specific attribute to get all positions for that value, then seeking into the packet files to find the packets in question and returning them. For example, to find all packets with port 80, you'd read in the positions for key:
[\x02 (type=port) \x00\x50 (value=80)]
The main stenotype packet sniffing thread tries to very quickly read in packet blocks, index them, then pass them back to the kernel. It does all disk operations asynchronously, in order to keep its CPU busy with indexing, by far the most time-intensive part of the whole operation. It would be extremely detrimental to performance to have this thread block on each file rotation to convert in-memory indexes to on-disk indexes and write out index files. Because of this, index writing is relegated to a separate thread. For each reading/writing thread, a index-writing thread is created, and a thread-safe producer-consumer queue created to link them up. When the reader/writer wants to rotate a file, it simply passes a pointer to its in-memory index over the queue, then creates a new empty index and starts populating it with packet data for its new file.
The index-writing thread sits in an endless loop, watching the queue for new indexes. When it gets a new index, it creates a leveldb table, iterates through the index to populate that table, and flushes that table to disk. Since index writing takes (in our experience) far less time/energy than packet writing, the index-writing thread does all of its operations serially, blocking while the index is flushed to disk, then moving that index into its usable (non-hidden) location.
As detailed above in Stenographer's "Access Control" section, we require TLS
handshakes in order to verify that clients are indeed allowed access to packet
data. To aid in this, the simple shell script stenocurl
wraps the curl
utility, adding the various flags necessary to use the correct client
certificate and verify against the correct server certificate. stenoread
is a
simple addition to stenocurl, which takes in a query string, passes the query to
stenocurl as a POST request, then passes the resulting PCAP file through tcpdump
in order to allow for additional filtering, writing to disk, printing in a
human-readable format, etc.
An analyst that wants to query stenographer calls the stenoread
script,
passing in a query string (see README.md for the query language format). This
string is then POST'd (via stenocurl, using TLS certs/keys) to stenographer.
Stenographer parses the query into a Query object, which allows it to decide:
- which index files it should read
- which keys it should read from each index file
- how it should combine packet file positions it gets from each key
To illustrate, for the query string
(port 1 or ip proto 2) and after 3h ago
Stenographer would translate:
after 3h ago
-> only read index files with microsecond names greater than (now() - 3h)- within these files, compute the union (because of the
or
) of position sets from- key
\x02\x00\x01
(port == 1) - key
\x01\x02
(protocol == 2)
- key
Once it has computed a set of packet positions for each index file, it then seeks in the corresponding packet files, reads the packets out, and merges them into a single PCAP file which it serves back to the analyst.
This PCAP file comes back via stenocurl as a stream to STDOUT, where stenoread passes it through tcpdump. With no additional options, tcpdump just prints the packet data out in a nice format. With various options, tcpdump could do further filtering (by TCP flags, etc), write its input to disk (-w out.pcap), or do all the other things tcpdump is so good at.
Stenographer has gRPC support that enables secure, remote interactions with the program. Given the sensitive nature of packet data and the requirements of many users to manage a fleet of servers running Stenographer, the gRPC channel only supports encryption with client authentication and expects the administrator to use certificates that are managed separately from those generated by stenokeys.sh
(for easily generating certificates, take a look at Square's certstrap utility). The protobuf that defines Stenographer's gRPC service can be found in protobuf/steno.proto.
gRPC support is optional and can be enabled by adding an Rpc dictionary of settings to steno.conf
. An example configuration is shown below:
, "Rpc": { "CaCert": "/path/to/rpc/ca/cert"
, "ServerKey": "/path/to/rpc/key"
, "ServerCert": "/path/to/rpc/cert"
, "ServerPort": 8443
, "ServerPcapPath": "/path/to/rpc/pcap/directory"
, "ServerPcapMaxSize": 1000000000
, "ClientPcapChunkSize": 1000
, "ClientPcapMaxSize": 5000000
}
This call allows clients to remotely retrieve PCAP via stenoread
. To retrieve PCAP, clients send the service a unique identifier, the size of PCAP file chunks to stream in return, the maximum size of the PCAP file to return, and the stenoread
query used to parse packet data. In response, clients receive streams of messages containing the unique identifier and PCAP file chunks (which need to be reassembled client-side). Below is a minimalist example (shown in Python) of how a client can request PCAP and save it to local disk:
with grpc.secure_channel(server, creds) as channel:
stub = steno_pb2_grpc.StenographerStub(channel)
pb = steno_pb2.PcapRequest()
pb.uid = str(uuid.uuid4())
pb.chunk_size = 1000
pb.max_size = 500000
pb.query = 'after 5m ago and tcp'
pcap_file = os.path.join('.', '{}.pcap'.format(uid))
with open(pcap_file, 'wb') as fout:
for response in stub.RetrievePcap(pb):
fout.write(response.pcap)
RetrievePcap
requires the gRPC server to be configured with the following fields (in addition to any fields that require the server to startup):
- ServerPcapPath: local path to the directory where
stenoread
PCAP is temporarily stored - ServerPcapMaxSize: upper limit on how much PCAP a client is allowed to receive (used to restrict clients from receiving excessively large PCAPs)
- ClientPcapChunkSize: size of the PCAP chunks to stream to the client (used if the client has not specified a size in the request)
- ClientPcapMaxSize: upper limit on how much PCAP a client will receive (used if the client has not specified a size in the request)
We're pretty scared of stenotype, because:
- We're processing untrusted data: packet
- We've got very strong permissions: the ability to read packets
- It's written in a memory-unsafe language: C++
- We're not perfect.
Because of this, we've tried to use security best practices to minimize the risk of running these binaries with the following methods:
- Runing as an unprivileged user
stenographer
- We
setcap
the stenotype binary to just have the ability to read raw packets. - If you DON'T want to use
setcap
, we also offer the ability to drop privileges withsetuid/setgid
after startingstenotype
... you can start it asroot
, then drop privs to an untrusted user (that user must still be able to open/write files in the index/packet directories).
- We
seccomp
sandboxing:stenotype
sandboxes itself after opening up sockets for packet reading. This sandbox isn't particularly granular, but it should stop us from doing anything too crazy if thestenotype
binary is compromized.- Fuzzing: We've extracted the most concerning bit of code (the indexing
code that processes packet data) and fuzzed it as best we can, using the
excellent AFL fuzzer. If you'd like to
run your own fuzzing, install AFL, then run
make fuzz
in thestenotype/
subdirectory, and watch your CPUs become forced-air heaters. - We're considering AppArmor, and may add some configs to use it for locking down stenotype as well.
We're slightly less concerned about stenographer
, since it doesn't actually
process packet information. It also has a smaller attack surface, especially
when bound to localhost. Our major attack vector in stenographer
is queries
coming in over TLS. However, TLS certificate handling is all done with the
Go standard library (which we trust prett well ;), so our code only ever
touches queries that come from a user in the stenographer
group. Since we run
it as user stenographer
, if someone in the stenographer
group does achieve a
shell, they'll be able to... read packets. The big concern here is that they'll
be able to read more packets than allowed by default (let's say that we've
passed in a BPF filter to stenotype, for example). Our primary defenses, then,
are:
- Running as an unprivileged user
stenographer
- Using Go's standard library TLS to reject requests not coming from relatively trusted users
- Using Go, which is much more memory-safe (runtime array bounds checks, etc)
- We're considering AppArmor here, too, and will update this doc if we come up with good configs.
Some of Stenographer's design decisions make it perform poorly in certain environments or give it strange performance characteristics. This section aims to point these out in advance, so folks have a better understanding of some of the idiosyncracies they may see when deploying Stenographer.
Stenographer is optimized for fast links, and some of those optimizations give it strange behavior on slow links. The first of these is file size. You may notice that on a network link that's REALLY slow, you'll still see 6MB files created every minute. This is because currently, Stenographer will:
- Store packets in 1MB blocks
- Flush one block every 10 seconds
Of course, if your link generates over 1MB every 10 seconds, this doesn't matter to you at all. If it does, though, you're going to waste disk space. We're considering flushing one block a minute or every thirty seconds.
With stenotype
writing files and stenographer
reading them, a packet
won't show up in a request's response until it's on disk, its index is on
disk, and stenographer
has noticed both of these things occurring. This
means that packets are generally 1-2 minutes behind real-time, since
- Packets are stored by the kernel for up to 10 seconds before being written to disk
- Packet files flush every minute
- Index files created/flushed starting when packet files are written
stenographer
looks for new files on disk every 15 seconds
Altogether, this means that there's a maximum 100-120 second delay between
stenotype
seeing a packet and stenographer
being able to serve that
packet based on analyst requests.
Note that for fast links, this time is reduced slightly, since:
- Stenotype flushes a block whenever it gets 1MB of packets, reducing the initial 10-second wait for the kernel.
stenotype
flushes at 1 minute OR at 4GB, whichever comes first, so if you get over 4GB/min, you'll flush files/indexes faster than once a minute.