Skip to content

DAS Implementation Details

abowd001 edited this page Sep 16, 2021 · 9 revisions

Measurement

We use the term "measurement" to describe a noisy estimator of a function of the enumerated confidential data; that is, a noisy estimator of a function computed on the 2020 Census Edited File. In the DAS, queries are generalized marginal queries: individual attributes may be coarsened (aggregating some of their levels together, and optionally dropping some levels), then arbitrary Cartesian products of these coarsened attributes may be used as predicates for the query. This procedure yields a special class of counting queries for which the levels within each query are mutually exclusive (adding or deleting a single record can alter at most the count corresponding to one scalar level in each query), and in which the unbounded (add-delete-a-record) L1 sensitivity of each change is at most 1. Although the theory underlying the DAS (paper to be linked here, when a public link is available) can support queries with arbitrary finite sensitivity, these special "coarsened marginal" queries correspond closely to the queries specified in the published published tables from a decennial census, and have been the primary focus of development on the DAS system and its tuning. Coarsened marginal queries correspond to the SumOverQuery and SumOverGroupedQuery classes in programs/queries/querybase.py, which are the only two query types returned by the makeTabularGroupQuery staticmethod in this same file, which in turn is the only query type recognized by the redistricting production repository's implementation of Schema.buildQuery in programs/schema/schema.py. For the redistricting production release, users must limit queries to the coarsened marginal type.

To give a concrete example, measuring the variable 'cenrace' for a particular county results in a query whose values are any of the 63 census race combinations and whose counts are the number of individuals in the county who reported each particular combination of races. This query can be implemented by fully coarsening all dimensions except CENRACE, and not coarsening the CENRACE dimension at all. Compound queries indicating multiple attributes involve Cartesian products of the individual attributes. For example, the query cenrace x hispanic has 126 possible values (hispanic has two values, either yes or no).

In the redistricting production implementation, evaluation and management of these queries is typically further optimized through efficient Kronecker product representations, which are especially convenient for working with matrices as representations of Cartesian products of individual coarsened attributes.

Noise Injection

Noise injection consists of adding the random output of the Discrete Gaussian Mechanism to the count query generated by the process described in the measurement section. The result of adding the random perturbations to the queries on the confidential data is called a "noisy measurement." The noisy measurements provably protect the original data (in the sense described in Background page of this wiki). The confidential attributes of the individuals are protected with a guarantee provided for each attribute singly, and each combination of attributes, depending on the amount and distribution of the privacy-loss budget. The noisy measurements may be negative and inconsistent. In the redistricting production run on the 2020 Census Edited File, there are 16.6 billion noisy measurements. These noisy measurements are post-processed into a single, consistent microdata representation that is used for subsequent tabulations published as the P.L. 94-171 Redistricting Data Summary File.

Microdata-as-Histograms from Noisy Measurements

Mathematical optimization is used to infer from the noisy measurements a "best-fitting" set of microdata, which can be conveniently represented as a histogram. These same optimization procedures also ensure that designated invariants and constraints over the resulting MDF are satisfied.

Optimization is organized in the DAS into Nonnegative Least Squares (NNLS) passes, which generally return float-valued estimates of histograms, and Rounder passes, which target a less rich set of queries and use the L1 rather than L2 norm, but, under suitable technical conditions, efficiently and provably generate nonnegative integer-valued histograms. These nonnegative integer-valued histograms can then be interpreted as counts of the number of records of each type, in the current geographic unit. They are essentially identical to microdata, but written in a form more convenient for matrix algebra than for human reading and review.

Note that, because the input to the mathematical optimization that generates histograms from the noisy measurements is already privacy-protected, and the mathematical optimization only has access to the true, sensitive data indirectly through the noisy measurements (and select invariants), the resulting histograms are also privacy-protected. This property is a reflection of the "post-processing" property of differentially private algorithms (including zero-Concentrated Differentially Private Algorithms), suitably adapted to a setting where some non-trivial invariants are present.

Histograms to Microdata

Once the privacy-protected histograms have been generated down to the census block geographic level, they are used to generate a privacy-protected microdata file. This transformation is just a change in format: each index in the numpy ndarray representing each histogram corresponds to a record type, and its nonnegative count indicates the expected number of records of that type. The DAS writers are responsible for performing this conversion, generating a number of microdata records identical the count in each ndarray index. The privacy-protected microdata records generated in this way can then be used for conventional statistical tabulations, or be easily read by humans, etc.