-
Notifications
You must be signed in to change notification settings - Fork 1
Product: Population Factfinder
Welcome to the db-factfinder wiki!
In the following sections, you will find documentation about the methodology used to create the data driving the Population Fact Finder application. The package we've created for this purpose serves other purposes, too, such as calculating ACS data for various community district geographies for the community profiles dataset.
-
Calculating Estimates, MOEs, and CVs, which relies on:
- Parsing metadata
- Downloading raw data from the API
-
Transforming census variables into PFF variables
- Which in some cases requires computing medians from binned data
-
Aggregating smaller geographies into larger geographies
- Occasionally with the additional step of converting 2010 geographies into 2020 geographies
- Calculating Percent Estimates and Percent MOEs
- Cleaning, formatting, and rounding the results
In order to facilitate time-series analysis of demographic trends, data originally released in 2010 census geographies need to be converted to 2020 census geographies. This involves allocating count values in 2010 census tracts to 2020 tracts, accounting for tract splits or merges. In cases of tract splits, counts from the 2010 tract-level data are distributed to the multiple 2020 tracts in a way that is proportional to the 2010 population distribution within the tract. DCP uses a one-to-one relationship between 2010 blocks and 2020 blocks in order to estimate the proportion of 2010 population contained within each new tract.
For example:
- 2010 tract 1 (containing 8 blocks) split into 2020 tracts 1.1 (containing blocks 1, 2, 3 and 4) and 1.2 (containing blocks 5, 6, 7, and 8)
- In 2010, tract 1 had a total population of 4000, made up of:
- Block 1: 1000
- Block 2: 500
- Block 3: 1000
- Block 4: 500
- Block 5: 200
- Block 6: 200
- Block 7: 500
- Block 8: 100
- The 2010 population contained in the blocks now associated with each of the 2020 tracts is:
- Tract 1.1 (blocks 1-4): 1000 + 500 + 1000 + 500 = 3000
- Tract 1.2 (blocks 5-8): 200 + 200 + 500 + 100 = 1000
- The proportion of total 2010 population in the blocks now associated with each of the 2020 tracts is:
- Tract 1.1: 75%
- Tract 1.2: 25%
These proportions are contained in ratio.csv
, in the following format (using the above example for demonstration):
2020 Tract | 2010 Tract | ratio |
---|---|---|
1.1 | 1 | .75 |
1.2 | 1 | .25 |
For cases of merges, ratios are 1 if the entireties of multiple 2010 tracts are combined into a new, larger 2020 tract.
These ratios are used to proportionately allocate count values from 2010 to 2020 tracts. Tract-to-tract conversion is the first step before higher-level spatial aggregation. For more information about aggregating census tracts into larger geographies, see the vertical aggregation documentation page.
Conversion of 2010 tract-level estimates and MOEs occurs in the AggregateGeography
class for year 2010_to_2020
. This class contains a method ct2010_to_ct2020
, which takes a DataFrame of 2010 tract-level data and returns a DataFrame of 2020 tract-level data (as estimated using the proportional allocation of total population).
Consider the example tract split described above, along with the following example 2010 tract-level estimates:
2010 Tract | Workers Under 16 Estimate | Workers Under 16 MOE |
---|---|---|
1 | 1000 | 100 |
In order to estimate the number of workers under 16 in 2020 tacts 1.1 and 1.2, we assume that the spatial distribution of workers under 16 is well-approximated by the spatial distribution of total people within the tract.
First, we merge the 2010 data with the ratios described in the previous section, yielding:
2020 Tract | 2010 Tract | Ratio | Estimate (2010) | MOE (2010) |
---|---|---|---|---|
1.1 | 1 | .75 | 1000 | 100 |
1.2 | 1 | .25 | 1000 | 100 |
2020 tract-level estimates are simply the 2010 estimate multiplied by the ratio:
2020 Tract | 2010 Tract | Ratio | Estimate (2010) | MOE (2010) | Estimate (2020) |
---|---|---|---|---|---|
1.1 | 1 | .75 | 1000 | 100 | 1000 * .75 = 750 |
1.2 | 1 | .25 | 1000 | 100 | 1000 * .25 = 250 |
Calculating 2020 MOEs depends on an empirically-derived formula, convert_moe
:
- If the ratio is 1 (not a tract split), 2020 MOE is the same as 2010 MOE
- If the 2020 estimate is 0 (prior to any rounding), the 2020 MOE is NULL
- If
((ratio * 100)^(0.56901)) * 7.96309 >= 100
, the 2010 MOE is the same as the 2020 MOE - Otherwise, the 2020 MOE is equal to:
((((ratio * 100)^(0.56901)) * 7.96309) / 100) * (2010 MOE)
This formula comes from an empirical model capturing the relationships between published block group MOEs as a percent of published tract MOEs and block group estimates as a percent of tract estimates, with R-squared of 0.81:
(block group MOEs as a percent of tract MOEs) = 7.96309 * (block group estimates as a percent of tract estimates)^0.56901
This formula is based on 10 selected variables, for 314 random NYC block groups.
- Males 85 years and older
- Non-hispanic of 2 or more races
- Single female household with children
- 65 years and older living alone
- Household income $200,000 or more
- Worked from home
- Employed civilians 16 years and older
- Occupied housing with a mortgage
- Vacant housing units
- GRAPI 30% to 34.9%
The nested relationship of block groups within tracts mimics the relationship of 2020 tracts within 2010 tracts in cases of a tract split.
Using the example above, MOE is calculated as follows:
2020 Tract | 2010 Tract | Ratio | Estimate (2010) | MOE (2010) | Estimate (2020) | MOE (2020) |
---|---|---|---|---|---|---|
1.1 | 1 | .75 | 1000 | 100 | 750 | 7.96309 * (75)^0.56901 = 92.8988 |
1.2 | 1 | .25 | 1000 | 100 | 250 | 7.96309 * (25)^0.56901 = 49.7191 |
MOEs and estimates are rounded in the final cleaning and rounding step.
Cases of tract merges are much simpler, and generally follow the same logic as other small-to-large spatial aggregation.
Consider the following example, representing a complete merge of 2.1 and 2.2 into 2:
2020 Tract | 2010 Tract | ratio |
---|---|---|
2 | 2.1 | 1 |
2 | 2.2 | 1 |
In this case, joining with an example 2010 tract-level dataset would produce:
2020 Tract | 2010 Tract | ratio | Estimate (2010) | MOE (2010) | Estimate (2020) | MOE (2020) |
---|---|---|---|---|---|---|
2 | 2.1 | 1 | 100 | 10 | 100 * 1 = 100 | 10 |
2 | 2.2 | 1 | 200 | 20 | 200 * 1 = 200 | 20 |
At this point, rows of the joined table are aggregated to get 2020 tract-level data, following these steps. Estimates are summed, and MOEs are aggregated using the square root of a sum of squares, in agg_moe
.
2020 Tract | Estimate (2020) | MOE (2020) |
---|---|---|
2 | 100 + 200 = 300 | SQRT(10^2 + 20^2) = 22.3607 |
If the requested geography type is not a tract, but is instead another 2020 geography type (NTA, CDTA, etc.), other methods in the AggregateGeography
class first call ct2010_to_ct2020
to estimate 2020 tract-level data. From there, aggregation proceeds using the same techniques as other data years. For example, the method tract_to_cdta:
- Converts 2010 tracts to 2020 tracts using the workflow described above
- Joins the resulting 2020 tract-level data with the
2010_to_2020 lookup_geo
(for more information about spatial lookups, see here) - Groups by 2020 NTA field,
nta_geoid
, using aggregation techniques defined increate_output
- Renames
nta_geoid
asgeoid
and setsgeotype
to "NTA" to standardize format
The following image shows the entire workflow for converting 2010 to 2020 geographies. The example shown is transforming 2019 ACS data into 2020 NTA-level data for the PFF variable capturing the population of South Asian origin, or asnsouth
.
Input data for all population factfinder calculations come from the US Census Bureau's API, as accessed using the census
python wrapper package.
The Download
class accesses input data for all population factfinder calculations by formatting a geoquery and calling the appropriate US Census Bureau API endpoint. When initialized, this class contains the following properties, all necessary for selecting endpoints and creating queries:
- The census API access key, contained in a .env file
- The year of data to access (In the case of 5-year ACS data, this is the final year. For example, 5-year data from the 2015-2019 rolling sample would correspond with
year = 2019
) - The source type (i.e. acs, decennial)
- Necessary state and county FIPS codes, set by default to the five NYC counties within NY state
The geoqueries
method uses state and county FIPS codes to generate an appropriate query for the requested spatial unit. For example, calling geoqueries('tract')
will return the string query expected by the US Census Bureau API (via the census
python wrapper) to download all tracts within the five NYC counties.
The download_variable
method then calls either download_e_m
or download_e_m_p_z
. These methods set the census
client based on the specified source, identifies the census variable codes associated with the pff_variable name using Metadata, identifies the appropriate geoquery for the requested geotype, then calls client.get
to store data in a pandas dataframe. Upon download, types are enforced (set to float64), outliers are replaced with NULLs, and MOEs for zero estimates are set to zero.
In order to improve performance, the Download
class writes results of each call to a cache (via utils.write_to_cache). Prior to re-downloading, the Download
class checks the cache for previously-stored results.
Calculate class method calculate_c_e_m_p_z
is the entry function into all logic for calculating non-rounded c, e, m, p, and z for a given PFF variable.
This method first creates an instance of the Variable class, so that all of the metadata associated with the PFF variable is easily accessible. For more information about metadata, see the metadata documentation page.
The most straight-forward workflow for calculating estimates and MOEs of a given PFF variable is defined in the calculate_e_m
method. This is the workflow for non-median, non-special variables.
-
Determine from & to geography types: In cases where the requested geography type is not a standard Census geography, block group-, or tract-level data get aggregated to produce the requested estimates and MOEs. Logic and lookups necessary for geographic aggregation are year-specific python files in the geography directory. Each of these files defines an
AggregatedGeography
class, which contains anoptions
property of the form:
{
"decennial": {
"tract": {"NTA": self.tract_to_nta, "cd": self.tract_to_cd},
"block": {
"cd_fp_500": self.block_to_cd_fp500,
"cd_fp_100": self.block_to_cd_fp100,
"cd_park_access": self.block_to_cd_park_access,
},
},
"acs": {
"tract": {"NTA": self.tract_to_nta, "cd": self.tract_to_cd},
"block group": {
"cd_fp_500": self.block_group_to_cd_fp500,
"cd_fp_100": self.block_group_to_cd_fp100,
"cd_park_access": self.block_group_to_cd_park_access,
},
},
}
These lookups determine the necessary geography to download from the Census API in order to produce output for the requested geotype. For example, calculating ACS data at the NTA-level (the "to geography") requires raw data at the tract-level (the "from geography"), while calculating ACS data for the irregular park access region within each community district (cd_park_acess), requires raw data at the block group-level.
-
Download input data: Once the necessary raw data geography format is identified, all necessary census variables are downloaded using the
Download
class. For more information on downloading data from the Census API using this class, see the "Downloading data from the API" documentation page. -
Aggregate horizontally: If a pff_variable is a sum of multiple, mutually-exclusive census variables, the data downloaded in step 2 gets aggregated "horizontally." For example, if
PFF Variable = Input 1 + Input 2
, horizontal aggregation first combines the two input census variables to calculate a PFF variable estimate and MOE for each row of the input data. For more information on this form of aggregation, see the "Horizontal aggregation" documentation page. -
Aggregate vertically: In cases where the requested geography is not a Census geography, the results of step 3 undergo "vertical" aggregation. For example, rows containing tract-level estimates and MOEs for a given PFF variable get combined to produce NTA-level estimates and MOEs. For more information on this form of aggregation, see the "Vertical aggregation" documentation page.
Several variables require slight modifications to the workflow above.
-
Medians: For medians, estimate and MOE calculations occur in the
calculate_e_m_median
method, rather thancalculate_e_m
. When downloading data (step 2 above), all necessary variables of counts within bins get downloaded. Horizontal and vertical aggregation (steps 3 & 4) are handled by theMedian
class. For more information about medians, see the "Median calculation" documentation page. -
Special variables: Several PFF variables are combinations of census variables, but are not simple sums. In these cases, horizontal aggregation relies on variable-specific formulas contained in special.py. For more information about special variable calculation, which occurs in
calculate_e_m_special
, see the "Special variables" section of the Horizontal aggregation documentation page. -
Profile-only variables: For some PFF variables, estimates and MOEs are available both in reference to a count, and a percent of the larger population. In these cases, the downloading step also includes the download of associated percent estimate and percent MOE data. Estimate and MOE calculations for profile-only variables occur in
calculate_e_m_p_z
. There is no vertical aggregation associated with these cases. For more information about profile-only calculations for non-aggregated geography types, see the exceptions section of the "Percent Estimate and Percent MOE" documentation page.
In order to improve performance, both raw data and calculated estimate and MOE data get cached locally. When downloading data from the Census API, the Download
class first checks to see if the same variables for the same geographies exist in the local cache, implemented here. If so, the raw data is read from the cache and is not re-downloaded. Otherwise, raw data is obtained via the API and saved to the cache for future calls, using the write_to_cache
utility function.
Caching also occurs after raw data is transformed into PFF variable estimates and MOEs. The method calculate_e_m
, described above, first checks to see if previously calculated data are saved in the cache, in these lines. If so, estimate and MOE data are read from local files rather than being recalculated. First time calculations (ones not already in the cache), are added to the cache here.
In cases where a single PFF variable is a combination of multiple input PFF variables (such as the binned data used to calculate medians), inputs are calculated in parallel. The method calculate_e_m_multiprocessing
is a wrapper function that calls either calculate_e_m
or calculate_e_m_special
over a list of input PFF variables.
After e and m are calculated in calculate_c_e_m_p_z
, c is calculated, using the function get_c
.
If the estimate is 0 or the MOE is NULL, then c is NULL. Otherwise, c = m / 1.645 / e * 100
.
After c, e, m, p, and z are calculated using calculate_c_e_m_p_z
, values are rounded and cleaned based on the rules contained in the methods cleaning
and rounding
.
The utility function rounding
rounds estimates and MOEs to the number of digits specified in the metadata. All c, p, and z are rounded to a single decimal place, regardless of the number of digits specified in the metadata. Note that the logic used to clean data (described in the next section), refer to the rounded values rather than the raw values.
The following rules modify the rounded results of calculate_c_e_m_p_z
, in the order listed. The purpose of these cleaning steps is to remove invalid values.
Invalid values
- If c, e, m, p, or z are negative, they are overwritten by NULL
- If p is greater than 100, it is overwritten by NULL
- If p is 100 or NULL (including values overwritten by the above rule), z is set to NULL
Zero estimates
- If e is 0, c, m, p, z is set to NULL
Base variables
- If the variable is a base variable, the geography type is either borough or city, and c is NULL, c is set to 0
- If the variable is a base variable, the geography type is either borough or city, and m is NULL, m is set to 0
- If the variable is a base variable and the variable is not a median variable, p is set to 100
- If the variable is a base variable and the variable is not a median variable, z is set to NULL
Inputs to median variables (binned variables)
- If the variable is an input to a median (with the exception of median rooms inputs), m is set to NULL
- If the variable is an input to a median (with the exception of median rooms inputs), p is set to NULL
- If the variable is an input to a median (with the exception of median rooms inputs), z is set to NULL
- If the variable is an input to a median (with the exception of median rooms inputs), c is set to NULL
Special variables
- If the variable is a special variable, p is set to NULL
- If the variable is a special variable, z is set to NULL
The method labs_geoid
translates census geoids into the format displayed in the PFF application. Because the list of geography types changed in 2020, this method relies on year-specific functions, format_geoid
, contained in the AggregatedGeography
classes.
These translations primarily involve replacing FIPS county codes with borough abbreviations or codes.
In order to calculate estimates and margins of error for pff variables, input census variables undergo up to two forms of aggregation. We describe these as "horizontal" and "vertical" aggregation, referring to summing tables over either columns or rows. For example, refer to the simplified example below. The two input variables represent data as downloaded from the Census API. If PFF Variable = Input 1 + Input 2
, the estimate columns need to be combined to derive the PFF variable estimate, and MOE columns need to be combined to derive the PFF MOE. Each of these steps are described in the following sections.
geoid | Input Estimate 1 | Input MOE 1 | Input Estimate 2 | Input MOE 2 |
---|---|---|---|---|
tract 1 | 10 | 1 | 20 | 2 |
tract 2 | 30 | 3 | 40 | 4 |
Horizontal aggregation (excluding aggregations described in the "Exceptions" section below) happens in the aggregate_horizontal
method of the Calculate
class.
Several population factfinder variables are sums of more granular, mutually exclusive inputs. For example, counts representing a population under 18 might come from the aggregated counts of several childhood age bins (i.e. 0-4, 4-10, 10-15, 15-18). Or, a variable might reflect a sum over male- and female-specific counts. This form of aggregation is "horizontal." In our simplified case, the tract-level counts for a variable comprised of two inputs would be:
geoid | PFF Variable Estimate |
---|---|
tract 1 | 10 + 20 = 30 |
tract 2 | 30 + 40 = 70 |
In general, PFF variable estimate (for a row) = Sum of Input Estimates (for that row)
The margin of error for aggregations are a simple root sum-of-squares of input margins of error. This is based on an assumption that input variables are independent.
geoid | PFF Variable MOE |
---|---|
tract 1 | sqrt(1^2 + 2^2) = sqrt(5) |
tract 2 | sqrt(3^2 + 4^2) = sqrt(25) |
In general, PFF variable MOE (for a row) = Square root of the sum of squared input MOEs (for that row)
Not all pff variables are simple sums of census variables. There are two types of non-sum combinations of census variables: medians and special variables.
PFF variables that are non-sum, non-median combinations of census variables are referred to as "special variables". These include:
hovacrtm
percapinc
mntrvtm
mnhhinc
avghhsooc
avghhsroc
avghhsz
avgfmsz
hovacrt
rntvacrt
wrkrnothm
Estimate and MOE calculation for special variables occurs in the method calculate_e_m_special
of the Calculate
class. After downloading the estimates and MOEs of necessary input variables, this function then calls one of the pff variable-specific functions in special.py
to combine inputs.
would more documentation elaborating on special variables be useful, like you did with Medians?
Several PFF variables are medians, rather than counts. These include:
mdage
mdhhinc
mdfaminc
mdnfinc
mdewrk
mdemftwrk
mdefftwrk
mdrms
mdvl
mdgr
Estimate and MOE calculation for medians occurs in the method calculate_e_m_median
of the Calculate
class. This method calculates medians by:
- Extracting ranges, design factors, and booleans indicating whether top and bottom coding are appropriate from the metadata class (see metadata documentation for more information).
- Downloading and calculating the estimate and MOE for all input variables. For medians, input variables are counts within a given bin. For example, a count of people ages 5 to 9 is an input for median age.
- Pivoting the outputs of step 2 to create a table with each row representing a geoid, where each input pff variable corresponds with two columns (one estimate column and one MOE column).
- Combine columns (a form of horizontal aggregation), using formulas contained in the
Median
class.
For more detail on median calculation, as implemented in the Median
class, see the median calculation documentation page.
Methods for calculating the median estimate and MOE for a given geography are in the Median
class.
Median estimates are calculated from count estimates of binned data. For example, median household income estimates are calculated from estimated count of households with incomes in various ranges (under 10k, 10-14k, 15-19k, etc.).
Below is an example of tract-level input variable estimates for median household income:
Input variable (range) | Estimate of count in range |
---|---|
mdhhiu10 (0 to 9999) | 20.0 |
mdhhi10t14 (10000 to 14999) | 0.0 |
mdhhi15t19 (15000 to 19999) | 12.0 |
mdhhi20t24 (20000 to 24999) | 9.0 |
mdhhi25t29 (25000 to 29999) | 0.0 |
mdhhi30t34 (30000 to 34999) | 0.0 |
mdhhi35t39 (35000 to 39999) | 0.0 |
mdhhi40t44 (40000 to 44999) | 0.0 |
mdhhi45t49 (45000 to 49999) | 0.0 |
mdhhi50t59 (50000 to 59999) | 0.0 |
mdhhi60t74 (60000 to 74999) | 0.0 |
mdhhi75t99 (75000 to 99999) | 0.0 |
mdhi100t124 (100000 to 124999) | 0.0 |
mdhi125t149 (125000 to 149999) | 0.0 |
mdhi150t199 (150000 to 199999) | 0.0 |
mdhhi200pl (200000 to 9999999) | 0.0 |
This, in turn, corresponds with a cumulative count distribution of:
Input variable (range) | Cumulative count |
---|---|
mdhhiu10 (0 to 9999) | 20.0 |
mdhhi10t14 (10000 to 14999) | 20.0 |
mdhhi15t19 (15000 to 19999) | 32.0 |
mdhhi20t24 (20000 to 24999) | 41.0 |
mdhhi25t29 (25000 to 29999) | 41.0 |
mdhhi30t34 (30000 to 34999) | 41.0 |
mdhhi35t39 (35000 to 39999) | 41.0 |
mdhhi40t44 (40000 to 44999) | 41.0 |
mdhhi45t49 (45000 to 49999) | 41.0 |
mdhhi50t59 (50000 to 59999) | 41.0 |
mdhhi60t74 (60000 to 74999) | 41.0 |
mdhhi75t99 (75000 to 99999) | 41.0 |
mdhi100t124 (100000 to 124999) | 41.0 |
mdhi125t149 (125000 to 149999) | 41.0 |
mdhi150t199 (150000 to 199999) | 41.0 |
mdhhi200pl (200000 to 9999999) | 41.0 |
And a cumulative percent distribution of:
Input variable (range) | Cumulative percent |
---|---|
mdhhiu10 (0 to 9999) | 48.78048780487805 |
mdhhi10t14 (10000 to 14999) | 48.78048780487805 |
mdhhi15t19 (15000 to 19999) | 78.04878048780488 |
mdhhi20t24 (20000 to 24999) | 100.0 |
mdhhi25t29 (25000 to 29999) | 100.0 |
mdhhi30t34 (30000 to 34999) | 100.0 |
mdhhi35t39 (35000 to 39999) | 100.0 |
mdhhi40t44 (40000 to 44999) | 100.0 |
mdhhi45t49 (45000 to 49999) | 100.0 |
mdhhi50t59 (50000 to 59999) | 100.0 |
mdhhi60t74 (60000 to 74999) | 100.0 |
mdhhi75t99 (75000 to 99999) | 100.0 |
mdhi100t124 (100000 to 124999) | 100.0 |
mdhi125t149 (125000 to 149999) | 100.0 |
mdhi150t199 (150000 to 199999) | 100.0 |
mdhhi200pl (200000 to 9999999) | 100.0 |
Calculating median estimates from binned data occurs in the median
method of the Median
class. This method
calculates the median estimate by:
- Calculating the sum of all counts (N) within all input bins
- Using the cumulative distribution of counts within bins, identifies which bin contains N/2
- Using linear interpolation to estimate where within the bin identified in step 2 the median lies
- First, the difference between N/2 and the total count in all lower bins represents how far within the median-containing bin the median lies.
- The median is assigned as that difference, times the width of the bin divided by the count in that bin.
Median = (Lower boundary of the median-containing bin)
+ (N/2 - (Total count in all bins below median-containing bin))
* (Difference between min and max value of the median-containing group) / (Count within the median-containing group)
For a video demonstration of median linear interpolation, see here
Some medians undergo top- or bottom-coding, as described in the top_coding and bottom_coding sections of the median metadata.
If top_coding
is True, medians falling within the bottom bin are set to the max value of the bottom bin. For example, if a geography's median household income is between 0 and 9999 based on the calculations described in the previous section, the median gets set to 9999.
Similarly, if bottom_coding
is True, medians falling within the top bin are set to the min value of the top bin. For example, if a geography's median household income is above 200000 based on the calculations described in the previous section, the median gets set to 200000.
Margins of errors for medians are estimated by calculating a 1 standard error interval around a 50% proportion estimate. First, the Median
class calculates the standard error of a 50% proportion (se_50
) as:
(Design Factor) * ((93/7 * Base) * 2500)) ^ .5
where the Base is the sum of counts in all bins. Design factors are values that account for the fact that the ACS does not use a simple random sample. These values are a ratio of observed standard errors for a variable to the standard errors that would be obtained from a simple random sample of the same size, and come from the Census Bureau.
This standard error is added to and subtracted from 50, creating a 1SE interval around a 50% estimate (with boundaries p_lower
and p_upper
).
Then, p_lower
and p_upper
are compared to a cumulative percent distribution (see above), cumm_dist
, to determine which bins contain the boundaries for a 1SE interval around a 50% proportion. These bins are saved as lower_bin
and upper_bin
.
For both lower_bin
and upper_bin
, the next step is to get the following values using the cumulative percent distribution of all input bins:
- A1: The min value the bin
- A2: The min value of the next highest bin
- C1: The cumulative percentage of counts strictly less than A1 (total counts in bins up to the one containing the boundary)
- C2: The cumulative percentage of counts strictly less than A2 (total counts in bins up to and including the one containing the boundary)
Calculation of A1, A2, C1, C2 for a given p occurs in the method base_case
.
A1, A2, C1, and C2 get calculated relative to both lower_bin
and upper_bin
by calling base_case
where _bin = lower_bin
and then _bin = upper_bin
. These calls happen in lower_bound
and upper_bound
methods, respectively.
There are several exceptions in which A1, A2, C1, and C2 do not follow the base case. To account for exceptions, the methods lower_bound
and upper_bound
subsequently modify the base case results according to the following:
-
lower_bin
is in the bottom bin: C1 of thelower_bin
is 0, and C2 of the lower bin is the percent of counts in the lowest bin. -
lower_bin
is the first bin with a count more than zero: A1 of thelower_bin
is 0, A2 oflower_bin
is the lower boundary of the second bin -
upper_bin
is the top bin: A1 ofupper_bin
and A2 ofupper_bin
are both the lower boundary of the top bin -
upper_bin
andlower_bin
are both in the first bin with a count more than zero: A1 ofupper_bin
is 0, A2 ofupper_bin
is the lower boundary of the second bin
Once A1, A2, C1, and C2 are set, the method get_bound
converts these values into a boundary for the confidence interval around the median.
CI boundary = (p - C1) * (A2 - A1) / (C2 - C1) + A1
This equation is similar to the linear interpolation used in estimate calculation, but uses percent cumulative distributions rather than count cumulative distributions. In estimate calculation, we determined where within a given bin an estimate lies, assuming that all frequencies within that bin are uniformly distributed between the min and max values of the bin. If the median was in the bin 1000 to 1499, which contained 45 counts, we assumed that these 45 counts were evenly distributed between 1000 and 1499.
Estimating where within a bin the boundary for a median confidence interval lies is similar. We first identified which bin contains percent 1SE away from 50%. From here, we assume that the cumulative percentage of counts contained within that bin is evenly distributed between its two extremes, i.e. if the bin 1000 to 1499 contains accounts for 30% to 40% of the cumulative counts, we assume that those 10% of total counts are evenly distributed between 1000 and 1499.
The various components of the CI boundary calculation are:
- (p - C1): The difference between the 1SE boundary for the 50% proportion and the percent of counts that are in all bins below the one containing this boundary
- (A2 - A1): The width of the bin containing the 1SE boundary for the 50% proportion CI
- (C2 - C1): The percent of counts that are in the bin containing boundary for 50% proportion CI
- (A1): The lowest value of the bin containing the 1SE boundary for the 50% proportion
The method get_bound
is used to calculate both the upper and lower boundaries of the confidence interval. When calculating the lower boundary of the median confidence interval, p refers to p_lower, and A1, A2, C1, C2 are all in reference to p_lower. Similarly, when calculating the upper boundary of the median confidence interval, p refers to p_upper.
The median MOE is calculated from the median confidence interval determined above. This occurs in the median_moe
class property.
MOE of the median = (Width of CI around the median) * 1.645 / 2
In the following exceptions, the median MOE is set to NULL:
- The median is in the top bin
- There are no non-zero counts in any bin
- The SE of 50% is greater than 50, meaning p_lower is negative
lower_bin
is the top bin
The pff-factfinder package relies on a series of json metadata files. The primary function of these files is to relate a given fact finder variable to input census variables. Because ACS and census tables can change slightly from year-to-year as the Census Bureau adds, drops, or modifies included variables, the metadata files are specific to a release year. The metadata also contains any other pff-variable level information necessary for calculations.
Metadata for a given factfinder variable are structured as follows
{
"pff_variable": "lgoenlep1",
"base_variable": "lgbase",
"census_variable": [
"C16001_005",
"C16001_008",
"C16001_011",
"C16001_014",
"C16001_017",
"C16001_020",
"C16001_023",
"C16001_026",
"C16001_029",
"C16001_032",
"C16001_035",
"C16001_038"
],
"domain": "community_profiles",
"rounding": 0,
"category": "Language Spoken at Home"
}
pff_variable: The field name, as it appears in the final Population FactFinder data files (not a display name) and as it gets called by the main Calculate
class
base_variable: The pff_variable name of the associated base variable. This is the denominator when calculating percent estimates and percent MOEs
census_variable: A list containing all input census variables for a given factfinder variable. These are listed without an "E" or "M" suffix (these suffixes are included in the Census API variable documentation and column headings of downloaded ACS data).
domain: Used for filtering of final outputs. For variables used in Population FactFinder, these are "housing", "demographic", "economic", or "social". For variables used only in the Community Profiles datasets, the domain is "community_profiles".
rounding: Variable-specific number of digits for rounding final output estimates and MOEs
category:
Variables that are medians have additional metadata, as follows.
"mdage": {
"design_factor": 1.1,
"top_coding": true,
"bottom_coding": true,
"ranges": {
"mdpop0t4": [
0,
4.9999
],
"mdpop5t9": [
5,
9.9999
],
...
"mdpop85pl": [
85,
115
]
}
design_factor: design factor values that account for the fact that the ACS does not use a simple random sample. These values are a ratio of observed standard errors for a variable to the standard errors that would be obtained from a simple random sample of the same size, and come from the Census Bureau.
top_coding: if True, medians falling within the bottom bin are set to the upper bound of the bottom bin. For example, if a geography's median age income is between 0 and 4.999 based on the example above, the median gets set to 4.999.
bottom_coding: if True, medians falling within the top bin are set to the lower bound of the top bin. For example, if a geography's median age income is between 85 and 115 based on the example above, the median gets set to 85.
ranges: The upper and lower values associated with each input pff_variable, where the inputs are counts of either people or households with a characteristic falling in a particular range
The Metadata
class parses and reads the metadata json files as a whole. This class contains properties differentiating different types of population factfinder variables. These lists inform which methodology is appropriate when aggregating census variables (either horizontally or vertically) to calculate a pff_variable.
-
median_variables
is a list of all pff_variable names referring to medians -
median_inputs
is a list of all pff_variables that are inputs to median calculations -
median_ranges
is a dictionary containing the value ranges associated with each median input variable -
special_variables
is a list of all pff_variables that require special calculations upon aggregation. These calculations are variable-specific functions contained in special.py. -
profile_only_variables
is a list of pff_variables for which percent and percent MOE are available from the Census API, and so are downloaded rather than calculated for geotypes available in the census API (i.e. not NYC-specific geographies). These variables are ones pulled from the census profile tables. The census variable names of profile variables have the prefix "DP". -
profile_only_exceptions
is a list of pff_variables pulled from profile tables (their census variable names have a "DP" prefix), but for which percent and percent MOE are calculated for all geotypes. -
base_variables
is a list of all pff_variables that serve as the base (denominator) for the percent and percent MOE calculation of another pff_variable.
Other methods include:
-
get_design_factor
, which returns the design factor associated with a given median pff_variable. -
get_special_base_variable
, which returns a list of the input pff_variables needed to calculate a given special pff_variable.
The Variable
class reads and parses the metadata files with reference to a particular pff_variable. The census_variables
method returns the census variable names associated with estimate (E), margin of error (M), percent estimate (PE), and percent margin of error (PM) of a given pff_variable. The create_census_variables
method splits a given a list of partial census variable names (i.e. ["B01001_044", "B01001_045"]) into a tuple of estimate and MOE census variable names (i.e. ["B01001_044E", "B01001_045E"], ["B01001_044M", "B01001_045M"]). Other Variable
properties include the domain of the variable (i.e. "economic"), the base_variable (the name of the pff_variable serving as a denominator when calculating percent and percent MOE, where applicable), the number of decimal places to retain in the final rounded estimate and margin of error, and the category assigned to a variable by Labs' front-end application.
Maintaining the metadata files is a largely manual process. Metadata undergoes several updates between yearly releases of data. These include:
- If a Census Bureau table has a change in schema resulting in shifted columns, the
census_variable
portion of metadata likely need updates to reflect new column numbers - If Census Bureau tables containing median inputs change to include either more or fewer binned counts, the
ranges
portion of median metadata will need to get updated - If the Census Bureau releases new design factors associated with median input tables, the
design_factor
portion of the median metadata will need to get updated - If PFF variables are either discontinued or introduced (due to upstream Census Bureau changes or otherwise), these variables will need to get either added or removed from the metadata
The Calculate
class method calculate_c_e_m_p_z
first calculates the estimate and margin of error of a variable, then calculates percent and percent MOE as appropriate.
Calculating percent estimates and MOEs require the estimate and MOE of the numerator (the PFF variable of interest), as well as the estimate and MOE of the denominator (the variable representing the population of which the PFF variable is a subset). For example, if we were to calculate the estimated percent of workers 16+ who commute by walking, the PFF variable is the the estimated count of workers 16+ who commute by walking, while the denominator is the estimated count of workers 16+. In our metadata and code, the denominator variables are referred to as "base variables". Base variables for each PFF variable are stored in the metadata, and are accessible as the base_variable
property of an instance of the Variable class.
In the most basic case, calculating the percent and percent MOE occur by first calling calculate_e_m
to calculate the estimate and MOE of the pff_variable. Then, calculate_e_m
gets called again, instead calculating the estimate and MOE of the associated base variable.
As with other pff_variables, a base variable can be a special variable or a median. If the base variable is a special variable, calculate_e_m_special
gets called in place of calculate_e_m
. Similarly, if the base variable is a median, calculate_e_m_median
gets called in place of calculate_e_m
. For more information on exceptions when calculating estimates and MOEs, see here.
Once estimates and MOEs are calculated for both the pff_variable and its base_variable, the two resulting DataFrames get merged, with e
and m
of the base variable renamed as e_agg
and m_agg
.
Once merged, the percent estimate is calculated using get_p
:
Percent MOE = (Estimate of PFF variable) / (Estimate of Base Variable) if (Estimate of Base Variable) is not NULL
The percent MOE is calculated using get_z
, which is based on the methodology outlined in the Census Bureau's guidance on calculating MOEs of derived estimates.
If the PFF percent estimate is 0 or 100, percent MOE is NULL
If the base variable estimate is 0, percent MOE is NULL
Otherwise,
Percent MOE = (Square root of
(Squared PFF MOE - Squared (PFF estimate * Base MOE / Base Estimate))
) / Base Estimate * 100
In cases where the value under the square root is negative, the ratio MOE formula is used instead:
Percent MOE = (Square root of
(Squared PFF MOE + Squared (PFF estimate * Base MOE / Base Estimate))
) / Base Estimate * 100
Profile-only variables: There are several variables where percent and percent MOE are available directly from the Census API. The Census Bureau variable documentation and table column headers indicate these with suffixes "PE" and "PM". If available, we do not calculate the base estimate and MOE, and instead pull estimate, MOE, percent estimate, and percent MOE from the API directly using calculate_e_m_p_z
, as called by calculate_c_e_m_p_z
. This is only possible for census geography types (i.e. non-aggregated geography types), and for variables from the DP-prefixed profile tables. The variables that have percent and percent MOEs available directly from the Census API are listed in the profile_only_variables
property of the Metadata class. There are 10 DP-prefixed profile variables that do not have percent estimates and MOEs available, which are listed in the profile_only_exceptions
property of the Metadata class. For aggregated geography types, the percent and percent MOE are calculated using a base variable as described above.
Poverty variables: There are three poverty-related variables where the percent and percent MOE are available directly from the Census API (after 2010): "pbwpv", "pu18bwpv", and "p65plbwpv". Unlike the profile-only variables above, the percent and percent MOE are not indicated with suffixes "PE" and "PM", but are instead stored as estimates ("E" suffix) and MOEs ("M" suffix) of a separate variable. These separate, percent variables are in the PFF metadata as {pff_variable}_pct
. The function calculate_poverty_p_z
calls calculate_e_m
on {pff_variable}_pct
, renames e
and m
of the results as p
and z
respectively. The function calculate_poverty_p_z
is called by calculate_c_e_m_p_z
in place of calling calculate_e_m
for the base variable. For aggregated geography types, the percent and percent MOE are calculated using a base variable as described above.
Several variables do not have p
or z
values. This is indicated in the metadata where base_variable
is "nan". These include variables that are means, rent/cost values or burdens, and variables that already represent a percent of a population. When calculate_c_e_m_p_z
is called for these variables, p
and z
are set to NULL.
For base variables, p
is set to 100 if the geography type is either a city or borough. Otherwise, p is NULL. In both cases, z
is NULL.
If the requested geography type is not a Census geography, tract- or block-group level PFF data are aggregated to calculate e, m, p, and z for larger geographies. For example, rows containing tract-level estimates and MOEs for a given PFF variable get combined to produce NTA-level estimates and MOEs.
In general, aggregate geography types include:
- NTA
- CDTA (post 2020)
- Portion of Community Districts within 100 year floodplain
- Portion of Community Districts within 500 year floodplain
- Portion of Community Districts within walking distance of a park
Note: when converting data published in 2010 geographies to 2020 geographies, 2020 census tracts function as an aggregate geography type.
Relationships between geographic areas are maintained in the directory data/lookup_geo
. Lookups are specific to the latest decennial census year, since tract boundaries change each decade. When the Calculate
class is initialized, the specified geography year determines which spatial lookup is referenced.
Each decennial year corresponds with a different version of the AggregateGeography
class. These classes are defined in year-specific python files in the geography directory.
lookup_geo
: The AggregateGeography
class (for both years) contains a property lookup_geo
. This property is a DataFrame with parsed columns from the geographic lookups in the data directory.
options
: This property contains a lookup between Census geography types, aggregate geography types they can be combined into, and the function necessary for converting raw geography types to aggregate geography types. For example, one record in the 2010 lookup is:
"tract": {"NTA": self.tract_to_nta, "cd": self.tract_to_cd}
Both NTA- and CD-level data are built from tract-level raw data. The function for aggregating tract-level data into NTA-level data is tract_to_nta
, and the function for aggregating tract-level data into CD-level data is tract_to_cd
.
aggregated_geography
: This property is a list of all aggregated geography types for a given year, e.g. ["nta", "cd", "cd_fp_500", "cd_fp_100", "cd_park_access"].
format_geoid
and format_geotype
: These methods convert FIPS census geoids and types into the format displayed in Planning Labs' application, as implemented in labs_geotype
. See the final output cleaning documentation page for more info.
The majority of methods in the AggregateGeography
class aggregate tract- or block group-level data into larger geographies. While the methods are specific to the geography type, they follow a similar structure. Consider the example:
Tract-level data
geoid | Estimate | MOE |
---|---|---|
tract 1 | 1 | 2 |
tract 2 | 3 | 4 |
tract 3 | 5 | 6 |
tract 4 | 7 | 8 |
Geo-lookup
tract_geoid | nta_geoid | ... |
---|---|---|
tract 1 | NTA 1 | ... |
tract 2 | NTA 1 | ... |
tract 3 | NTA 2 | ... |
tract 4 | NTA 3 | ... |
- Join tract/block-group data with lookup_geo (which defines how small geographies nest within larger ones) on geoid_tract or geoid_block_group. Following the example data above, this would produce:
geoid | Estimate | MOE | nta_geoid |
---|---|---|---|
tract 1 | 1 | 2 | NTA 1 |
tract 2 | 3 | 4 | NTA 1 |
tract 3 | 5 | 6 | NTA 2 |
tract 4 | 7 | 8 | NTA 3 |
- Call the function
create_output
in order to group by the aggregate geography geoid. Within each group, estimates get summed. MOEs are aggregated using the square root of sum of squares, defined inagg_moe
.
nta_geoid | Estimate | MOE |
---|---|---|
NTA 1 | (1 + 3) = 3 | SQRT(2^2 + 4^2) = SQRT(20) |
NTA 2 | 5 | 6 |
NTA 3 | 7 | 8 |
- Rename GEOID column to standardize output
census_geoid | Estimate | MOE |
---|---|---|
NTA 1 | (1 + 3) = 3 | SQRT(2^2 + 4^2) = SQRT(20) |
NTA 2 | 5 | 6 |
NTA 3 | 7 | 8 |
When converting 2010 input geographies to 2020 outputs, the methods described in the previous section contain an additional step. Prior to step one, in which tract-level data are joined with lookup_geo, 2010 tracts are converted to 2020 tracts using the method ct2010_to_ct2020
. The following steps proceed as described above, using 2020 tract-level data as the input to further vertical aggregation.
For more information about tract-to-tract conversion, see the 2010 to 2020 geography conversion documentation page.