Skip to content

Changes since the 2019 Release

abowd001 edited this page Sep 16, 2021 · 8 revisions

In late 2019, the Census Bureau released a version of the DAS source code as part of the first "Demonstration Data Product"--one in a series of working code releases intended to help external users react to and provide feedback on usability of the DAS-generated data, at candidate privacy-loss budget settings, and to identify specific areas where they felt improvement was important, or where the DAS appeared to have introduced algorithmic artifacts, etc. Numerous, significant changes have been made to the DAS since that late 2019 DAS release (with several changes driven directly by public feedback on that release), which this page summarizes:

  • In the 2019 release, a single histogram (or fully-saturated contingency table; that is, a vector of counts with each entry corresponding to the number of records that satisfy one set of attributes in the list of all legal attribute values) was used to generate the microdata for each product. In the current repository, in programs/schema/schemas/schemamaker.py, the single histogram was replaced with a multi-histogram representation. For the redistricting production release, the second histogram is degenerate in the sense that it is used only for invariant calculations. In the DHCH schema, (not fully implemented in the redistricting data release, there are examples of non-degenerate use of multiple histograms in the schema.

  • In the redistricting production release, anticipating the DHC production run, several individual attributes were combined (flattened) into a single complex attribute called "dhch" to achieve a more computationally efficient representation of the publication tables by removing many of the impossible combinations that arise from taking the Cartesian product of the simple attributes. Although the schema using dhch is present in this release, this attribute was not used in the P.L. 94-171 redistricting production run.

  • The redistricting production release uses the randomgen library to access the Intel instruction RDRAND for thermal-noise-based random number generation, while the 2019 release used MKLRandom.

  • In the 2019 release, the geographic spine implemented the TopDown Algorithm was basically a fixed data structure. It mirrored the Census Bureau Geography Division spine as illustrated in the geography definitions displayed in tabular summary documentation [XXXX insert link to PL.94-171 Redistricting Data Summary File PDF documentation on census.gov] with limited modification based on truncating geocode strings at different lengths to alleviate run-time/memory-use bottlenecks. The redistricting production version of the DAS code features a new approach to the geographic spine that includes options to integrate AIAN entities onto-or-near the spine, and an optional heuristic for explicitly minimizing the number of on-spine geographic entities that must be added to or subtracted from one another to estimate a target off-spine Census Place or Minor Civil Division (when available, a public link will be provided with more technical details). For a fixed privacy-loss budget and fixed allocation between queries and geographic levels, the DAS post-processing error in off-spine entities increases the more distant (in the add-subtract sense used above) they are from the geographic DAS spine. Therefore, attempting to explicitly minimize this distance for some target off-spine entities gives the DAS more control over its off-spine performance. (Note: the spine optimization also re-allocates privacy-loss budget between parent and child geographies. In some cases, this requires a generalization of the privacy accounting argument, which is the subject of an unpublished working paper, to be linked here when publicly available.)

  • In the 2019 release, the DAS used pure differential privacy as its privacy framework. The 2019 code implemented the Geometric Mechanism based on numpy samplers, which use inexact floating-point operations to approximate target distributions. To reduce the incidence of outliers, and to remove inexact floating-point operations, which can generate subtle privacy vulnerabilities as initially discovered by Mironov (2012) while still maintaining a precise, formally private guarantee, the redistricting production DAS code uses zero-Concentrated Differential Privacy (zCDP) as its primary framework via a config option that selects the active DP framework. zCDP is implemented using an exact Discrete Gaussian mechanism as its primitive noise algorithm (based on the work of Cannone, Kamath, and Steinke). (Note: Formal privacy guarantees provide a bound on how much more an attacker can learn than could have been learned had some or all of an arbitrary respondent's data been replaced with dummy values. There are few assumptions about what side information or computational power the attacker may have. This, including how invariants weaken such guarantees, will be detailed in a forthcoming semantics paper (public link to be provided here when available).)

  • In the late 2019 DAS release, the Rounder (the second part of each major TopDown optimization step) only used the "detailed query" (fully saturated contingency table) to generate integer-valued estimates close to the previously estimated Nonnegative Least Squares solution. At the time, the DAS team only had theory to support use of this single query in the Rounder; on discovering that use of only the detailed query in the Rounder generated significant increases in post-processing error and non-monotonic behavior (accuracy sometimes decreased as the privacy-loss budget was increased), the DAS team revisited this theory and proved that two trees of queries could be supported in the Rounder. The redistricting production Rounder supports a wider range of target queries. Because of this generalization, which is specified in the configuration files, one tree is implicitly taken up by the geographic structure of the DAS's TopDown Algorithm. If more than one additional tree is specified in the config file, intractable or infeasible Rounder optimization problems may result.

  • After the May 2020 demonstration data release, one of our external collaborators pointed out a significant artifact in the released privacy-protected microdata file: the number of record types occurring exactly 5 times was anomalously large. This artifact was caused by a tolerance parameter used in the DAS nonnegative least squares optimizations. Inequality constraints were imposed only up to this tolerance, and a constant tolerance of 5 was found to yield stable optimization solves in all geographic units. To address this artifact while preserving stable optimizations, the DAS team integrated an optional "optimal-tolerance" optimization that solves an auxiliary optimization problem between each major nonnegative least squares step to determine the smallest possible tolerance (up to a "fudge factor", specifiable in the config) for which the subsequent nonnegative least squares (NNLS) problems would remain numerically solvable, yielding a different, dynamically-chosen tolerance for each optimization problem. (Note: if the "fudge factor" is set to be too small, the optimal-tolerance auxiliary problems can generate exactly the kind of numerical difficulties warned about in the Gurobi numerical issues documentation, so this factor should be modified with care.)

  • The current code release includes support for "multipass" and "interleaved" optimization: multipass refers to allowing sequential optimization of different queries in distinct NNLS or Rounder passes. Initially, all NNLS passes were required to occur before all Rounder passes; the "interleaved" sequentialOptimizer parameter relaxes this constraint, so that NNLS and Rounder passes can also be placed in arbitrary sequence. (Note: the interleaved sequentialOptimizer class was used in the redistricting production run, but all NNLS passes were executed before all Rounder passes.)

  • Generation of schemas is different in the redistricting production code release than those used in the 2019 code release. Rather than explicitly defining each schema, the DAS code now defines individual attributes, then creates a schema from a list of desired attributes. The 2019 DAS code base contained some examples of this approach to schema generation, but in the redistricting production release, all schemas are built from individual attributes.