Skip to content

Datasets generated within the data collection and analysis systems

License

Notifications You must be signed in to change notification settings

dnstapir/datasets

Repository files navigation

DNS TAPIR Datasets

Datasets generated and processed within the DNS TAPIR data collection and analysis components.

Events

Events are sent for analysis as they happen.

These events are sent from Edge whenever a domain is new locally. Effectively, the Edge DNSTAP Minimiser (EDM) keeps a lookup table for domains it processes (excluding the Well-known set, see below).

Globally New Domain

TAPIR Core receives the "New Domain" events from TAPIR Edge and compares them to a central lookup table that summarises new domains across all edge nodes. Patterns in how these events propagate is also analysed, for instance co-occurence across multiple nodes. The lookup is implemented using a Key-Value database containing the "Previously Seen" set (see below), where non-existence indicates a globally new domain.

This data is sent to Edge Policy Processor as part of other observations, described in "Aggregated Observations" below.

Data format for transmitting this data outside of DNS Tapir is evolving (see "Detailed Observations" below).

Reports

Reports are collections of data that are sent for analysis at timed intervals

This report is generated by Edge DNSTAP Minimiser (EDM) for the Well-known Domain data collected from DNSTAP data. The Edge local analysis. when fully implemented, will also aggregate data to this format for less well-known domains after ensuring privacy levels are met.

Reports from TAPIR Edge instances are aggregated and summarised to (currently) 5 minute windows into this report.

Vectors

Vectors are encoded sequences of queries, mainly for machine learning purposes. All tokenisation and encoding is done in TAPIR Edge, and is a work in progress - strongly dependent on the implementation of the edge analysis engine. The following can be seen as a rough example of one such strategy

Vectorised query information from Edge - Work in progress

Sets

Sets are, in essence, lookup tables. These are generally viewed as either known good or known bad baselines, but without actual ground truth the nominations are more correctly described as probably good or probably bad. Note that sets are not necessarily implemented as separate tables, and can very well be a unified lookup table with multiple sets.

TAPIR Edge uses a number of sets to map incoming data into categories. The most central of these is the list of well-known domains from which to generate summarised statistics.

Examples of such a list (or lists, for exact and wildcard) can be found here:

This dataset is generated by TAPIR Core based on inputs such as OpenPageRank as well as internal research and used by EDM to categorise and minimise data send to Core.

Since some aspects of data only exists in TAPIR Edge, any categorisation based on those parameters needs to happen at the edge. The following are examples of such sets with which to tag the data:

To highlight data for Edge Analyse, data on known client addresses that exhibit suspicious behavior can be tagged. This is helpful as metadata for generating vectors of query streams that may contain maliciosu domains.

  • Suspect Clients Tag local data from known suspicious clients - Work in progress

Edge Policy Manager also requires some datasets for generating policy decisions, such as allow-lists for domains to be excluded from policy. These can be local or received from TAPIR Core, for example

TAPIR Core maintains a global list of seen domains, used to assert that (within the time window of that data) a domain is new.

Edge DNSTAP Minimise also requires datasets that ensure some data is never processed. Those can, for obvious reasons, not come from or be handled by TAPIR Core. Some examples are:

Transformations

Transformations is the process that transforms one dataset into a different dataset, typically for the purpose of feature extraction, aggregation and/or privacy enhancement.

Classification tags are different across the system. This is on purpose, since it became very hard to align the requirements of the different components. Edge DNSTAP Minimise uses a 64-bit integer to represent 64 different tags on the incoming data. TAPIR Core uses an UTF-8 string where unicode glyphs can represent a large (practically infinite) number of observable traits. Finally, Edge Policy Processor is limited to a 32-bit Integer, representing 32 different "meta-tags" created by joining Core traits into observations.

  • From EDM to TAPIR Core
  • From TAPIR Core to POP

TAPIR Core implements automated processes to refine the incoming data. Currently the incoming histograms are grouped into an extended histogram (see "Well-known Histogram" above) spanning a 5 minute window. The transform can be found in this Jupyter Notebook example:

DataLoad

Filters

Filters primarily serve to minimise data by removing uninteresting data or noise. These filters act on the collected data and are different from filters acting on the DNS query-response process. This includes known single-label queries and other artefacts that cannot resolve.

TBD

Observations

Observations are publicised domains that pass a threshold. There may be multiple thresholds signifying an estimation of reliability or risk.

To receive Events and generate Observations, two example Jupyter Notebooks can be found below. One uses a one-shot mechanism to send continuous MQTT messages to Edge Policy Processor, and the other implements a server that prints out incoming Events - and if a domain arrives as "something.something.foo.example.com" generates an observation for "something.something.foo".