Skip to content

Background

Phil Leclerc edited this page Sep 14, 2021 · 14 revisions

In 2020, the United States Census Bureau conducted the 2020 Decennial Census. Known formally as the Decennial Census of Population and Housing, the census aims to enumerate every person residing in the United States, covering all 50 states, the District of Columbia, and Puerto Rico. All persons alive on April 1, 2020 residing in these places, according to residency criteria finalized in 2018, must be counted.

Following completion of the census, the Census Bureau must submit state population totals to the United States President. The United States Constitution mandates this decennial enumeration be used to determine each state’s Congressional representation.

Public Law 94-171 directs the Census Bureau to provide data to the governors and legislative leadership in each of the 50 states for redistricting purposes. This product is the first of several large Decennial Census data product releases, and includes tabulations calculated for demographic and housing characteristics in detailed geographic areas (with Census Blocks forming the finest degree of geographic resolution).

As part of the Census Bureau’s collection activities, the Census Bureau by statute must assure that the Decennial Census data products meet the legal requirements of Title 13, Sections 8(b) and 9 of the U.S. Code, and, specifically, must not "make any publication whereby the data furnished by any particular establishment or individual under this title can be identified." The 2020 Disclosure Avoidance System embodied in this code release formalizes and quantifies this goal: by using zero-Concentrated Differential Privacy primitives, the 2020 DAS is able to provide provable bounds on how much more an "attacker" could learn about a target individual then could have been possible even if most features of the individual's data had been replaced with placeholder values.1 The DAS also guarantees bounds on the amount that can be learned about a specific feature of an individual respondent, beyond what could already have been guessed about that feature even if the respondent had never reported their value on it.

In previous decennial censuses, a variety of techniques were used to protect the confidentiality of responses, including the use of synthetic data and, most notably, household swapping (as documented in Section 7-6 of the 2010 PL94-171 technical documentation, and on page 10 of "Disclosure Avoidance Techniques" Used for the 1970 through 2010 Decennial Censuses of Population and Housing"). For the 2020 Census, the Census Bureau applied the latest science in developing the 2020 Census Disclosure Avoidance System (DAS). Following the instructions of the Data Stewardship Executive Policy Committee (DSEP), the Census Bureau implemented a sequence of algorithms whose privacy-conferring primitives were differentially private, including in particular the DAS contained in the current repository. Differential privacy allows for a quantitative, general accounting of the privacy loss to respondents associated with any set of tabulation releases, and so allows for the Census Bureau to bound how much privacy loss is consistent with its Title 13 responsibilities.

This public release of the 2020 Census P.L. 94-171 Redistricting Data Summary File DAS source code is intended to help sophisticated external users by providing a transparent view into the source code used in the 2020 PL94-171 production run of the DAS. It is notable that this transparency is a feature specifically enabled by the use of differentially private algorithms: the reader will note that the 2010 documentation on swapping previously linked is generally terse, and does not include algorithmic pseudo-code or a public code release. This lack of detail is important, because swapping does not provide mathematical proofs of privacy guarantees against general attackers that hold even if the attacker knows the algorithm in use. Differential privacy (and zero-Concentrated Differential Privacy) does, however, have this property, allowing the transparent publication of comprehensive algorithmic detail, including this code repository, without endangering the worst-case privacy loss bounds.

Overview

Article 1 Section 2 of the U.S. Constitution directs the U.S. Government to conduct an “actual enumeration” of the population every ten years. In 2020, the Census Bureau conducted the 24th Decennial Census of Population and Housing with reference date April 1, 2020 and has begun producing public-use data products that conform to the requirements of Title 13 of the U.S. Code. The goal of the census is to count everyone once, only once, and in the right place. All residents must be counted. After the data have been collected by the Census Bureau, but before the data are tabulated to produce data products for dissemination, the confidential data must undergo statistical disclosure limitation so that the impact of statistical data releases on the confidentiality of individual census responses can be quantified and controlled.

In the 2010 Census, the trade-off between accuracy and privacy protection was viewed as a technical matter to be determined by disclosure avoidance statisticians. Disclosure avoidance was performed primarily using household-level record swapping and was supported by maintaining the secrecy of key disclosure avoidance parameters.

However, there is a growing recognition in the scientific community that record-level household swapping fails to provide provable confidentiality guarantees when the side information and computational power to attackers is unknown. In the absence of these properties, rapid growth outside the Census Bureau in access to high-powered computing, sophisticated algorithms, and external databases has caused growing concern that it may be possible to reconstruct a significant portion of the confidential data that underlies the census data releases using a so-called database reconstruction attack, as originally outlined by Dinur and Nissim (2003). Once reconstructed, microdata can easily be used to attempt to re-identify individual respondents' record, and to attempt to infer features about individuals that may only be learnable because they participated in the Census2. Indeed, in 2019 the Census Bureau announced that it had performed a database reconstruction attack using just the publicly available 2010 decennial census publications and had been able to reconstruct microdata that was overwhelmingly consistent with the 2010 confidential microdata.

In order to fulfil its requirements to produce an accurate count and to protect personally identifiable information, the Disclosure Avoidance System (DAS) for the 2020 Census implements mathematically rigorous disclosure avoidance controls based primarily on the set of mathematical techniques known as differential privacy (and, specifically, zero-Concentrated Differential Privacy). In its production use, the DAS reads the Census Edited File (CEF) and applies formally private algorithms to produce a Microdata Detail File (MDF).3

The DAS can be thought of as a filter that allows some aspects of data to pass through the filter with high fidelity, while controlling the leakage of confidential data to no more than the level permitted by the differential privacy parameters. By policy, all data that are publicly released by the U.S. Census Bureau based on the 2020 Census must go through some form of mathematically defensible formal privacy mechanism.

DAS Design Decisions

Many of the principal features, requirements, and parameters of the Census Bureau’s implementation of the DAS for production of the 2020 Census P.L. 94-171 Redistricting Data Summary File were policy decisions made by the Census Bureau’s Data Stewardship Executive Policy Committee (DSEP). These policy decisions impacting DAS design include: the list of invariants (those data elements to which no noise is added); the overall privacy-loss budget; and the allocation of the privacy-loss budget across geographic levels and across queries. While DSEP is responsible for significant decisions, actions, and accomplishments of the 2020 Census Program, the Associate Director for Decennial Programs publicly documents these policies in the 2020 Census Decision Memorandum Series for the purpose of informing stakeholders, coordinating interdivisional efforts, and documenting important historical changes. This memorandum series is available at the 2020 Census Memorandum Series public website.

1: Although provable guarantees are also possible when comparing to a 'counterfactual' world in which the individual respondent's entire record was replaced by placeholder values, these guarantees are made more complicated by the presence of 'invariants' -- statistics that the DAS may not perturb.

2: This distinction is important: even if a respondent does not participate in the Decennial Census, it may be easy for an attacker to use published Decennial Census tabulations to infer some of their attributes. For example, an attacker may know that Person A resides in the state of Montana. In the 2010 Decennial Census, nearly 90% of Montana residents were tabulated as 'White Alone': hence, even if Person A had not participated in the 2010 Census, just by observing the published tabulations and knowing that Person A lives in Montana, an attacker could form a high-probability guess about the respondent's census-reported race. This kind of inference is not intended to be defended against by the DAS; it is intended primarily to control how much more an attacker could have learned than if a respondent had not participated in the Decennial Census, or had some of their data replaced with placeholder values. Or, put somewhat differently, the DAS takes as its definition of 'private information': information which is unique to a respondent, in the sense of only being learnable because of their data's presence. Information that can be learned without a respondent participating, instead, classified as scientific or sociological inference (which may desirable or undesirable, but is outside the scope of the DAS to control).

3:Generation of privacy-protected microdata is not necessary, and is arguably unusual, when working within a differentially private framework, but was regarded as an important design requirement by DSEP for the PL94-171 data product.