-
Notifications
You must be signed in to change notification settings - Fork 3
DAS Implementation Overview
This is an overview of the implementation of the DAS. More details can be found here.
The DAS is a Python Spark (PySpark) application that runs on Amazon Web Services (AWS) Elastic Map Reduce (EMR). The DAS cluster employs a single Master node (on which the DAS PySpark Driver application is typically deployed) and multiple Core and Task nodes (in which the noisy measurements are taken for each geographic unit, and the mathematical optimization to generate microdata in each parent-children set of geographic units occurs) for both data storage and computation. Processing proceeds in parallel across the core and task nodes managed by the master node and the EMR infrastructure. Inputs are provided and outputs are saved in AWS S3 at locations specified as part of the configuration of a given run of the DAS.
For the P.L. 94-171 redistricting data release, the inputs to the DAS consist of:
- 2020 Census enumeration data contained in 52 Census Edited Files (CEF)--one for each state, DC and the Commonwealth of Puerto Rico, as described here.
- Geographic data (GRF-C files) indicating geographic IDs in a hierarchy: nation, state, county, census tract, block group, and block.
- A query strategy that specifies the queries for which noisy measurements (and invariants) are taken.
- A query workload that specifies the publication table structure as a separate set of queries used to evaluate system performance relative to the 2010 CEF; terminology following the strategy-workload distinction made in the Matrix Mechanism papers).
- Key supporting parameters:
- a privacy-loss budget associated with each query at each geographic level (computed based on related config parameters)
- query orderings specifying the order in which strategy queries are added to each of the optimization problems used to generate well-formed (non-negative integral) histogram counts
- a set of invariants and constraints which the output histograms (used to generate protected microdata) must satisfy
Before the DAS runs, the input CEFs and GRF-C are validated, checking the numeric ranges of CEF fields and that geographic locations referenced in the CEF are defined in the GRF-C data. Note that for the redistricting production run, the DAS used a subset of the CEF fields; other fields support other Census Bureau business processes.
For the PL94-171 redistricting data release, the DAS produced four privacy-protected microdata detail files (MDFs):
- Individual person microdata for the US
- Individual person microdata for Puerto Rico
- Housing unit microdata for the US
- Housing unit microdata for Puerto Rico
Each run also saved the query histograms for each geographic region (down to the block level) as noisy measurements for future use. The TDA algorithm was run four times to generate the privacy-protected MDFs listed above.
Each run shares the following steps:
- The input files are combined to create an internal representation of the relevant subset of the approximately 330 million people and 140 million households in the United States and its territories (e.g. all US states and state-equivalents, but not Puerto Rico, are combined for run # 1 above)
- For each geographic level and each region (node) within each level, a measurement for each strategy query is generated using Discrete Gaussian Mechanism; these are the noisy measurements, which can be negative (and often will be, if the enumerated count in the CEF was 0 or near 0) and provide many inconsistent estimators (e.g., summing the CENRACE query and the VOTINGAGE query both yield estimates of the TOTAL population, but these will generally be different estimates)
- To avoid negative counts and impose consistency, the noisy measurements are optimized to produce well-formed arithmetically consistent histograms that satisfy the designated invariants and constraints. This is done by translating the constraints into a system of linear equations and inequalities, which are appended to optimization problems (with measures of distance to query count estimates in the objective function) and solved by the Gurobi optimizer (accessed via its Python API). In order to avoid certain kinds of bias, queries are added to the linear system in a particular order (which may vary across geographic levels) and solved in multiple passes, often targeting queries expected to be large in true count in earlier passes (though this expectation is formed without examining the to-be-protected data, to avoid leaking unprotected information)
- The histograms generated via optimization are then used to generate a set of records for the MDF, which is then used downstream to produce the published tabulations
- These records are then written to S3 in the MDF format. The noisy measurements are also saved for later release
- For tuning experiments using the 2010 CEF, the output microdata may be compared against the original inputs to quantify their accuracy
The injection of noise in Step 2 and the subsequent use of those noisy measurements in the remaining steps ensures that any derived products cannot be used to infer more about an individual's attributes, beyond what could be inferred if their data were not present, than is admitted by the privacy-loss budget and its distribution.
For the PL94-171 redistricting data release, the workload consisted of 10 queries (with 2,477 possible values) across about 6 million geographic regions at 6 levels (5 for Puerto Rico). Thus, the noisy measurements consisted of 16.6 billion cells.
After the four MDF files were generated, they are copied to AWS S3 together with the noisy measurements and other artifacts from the DAS run. These include configuration files, execution logs, and saved application sources and libraries. The contents of this source release was extracted from these logs for the officially published 2020 Census P.L. 94-171 Redistricting Data Summary Files.