Methodology
In our Migration Patterns dataset we combine both USPS data and GPS location data with our proprietary algorithms consisting of several elements which we describe here.
There are six steps in our pipeline:
- Step 1: We ingest near real-time data about change-of-address (on the ZIP-code level) from the US Postal Service.
- Step 2: To achieve high accuracy when deriving migration estimates on a set of selected geographies, we first interpolate the raw data to H3 hexagons before we aggregate up.
- Step 3: In this step, we compute inflows, outflows and net flows on a selected set of geometries (ZIP areas, counties, CBSAs, states).
- Step 4: To build the graph of typical migration flows, we use historical GPS data to train a model that estimates the probabilities of moves between all origin-destination pairs.
- Step 5: By combining the probabilities of origin-destination flows (step 4) and observed inflows (step 3), we compute the number of people migrating between all origin-destination pairs.
- Step 6: We enrich Migration Patterns with additional context that:
1) quantifies the demographic shifts;
2) provides useful information about the locations.
Combining USPS data with GPS location data
In our model, Area Migration (Step 1 to Step 3) is based on USPS data, one of the most reliable sources of information about a person moving from one address to another in the US. Eg. 500 people moved into Palm Beach County, Florida and 300 moved out.
Origin-Destination and additional context like demographics (Step 4 to Step 6) are based on GPS location data. This allows us to provide insights like 250 people moved from Orange County, California to Palm Beach County, Florida.
To understand the origin and destination of moves, we use our proprietary home assignment algorithm and historical data in a machine learning model.
Home Assignment Algorithm
On a weekly basis, for each device, we aggregate time spent in a given Uber H3 hexagon. Then, looking back over an eight-week observation window, we estimate which hexagon most likely contains the home of the device based on several criteria that are evaluated during this observation period.
Our historical GPS data (years and terabytes), together with the home assignments, contains information about home moves that we use to train a robust ML model to estimate the flow probabilities between all origins and destinations of moves.
Assigning demographics to moves
We use US Census ACS data to understand the makeup of different neighborhoods including the age and income profiles. All the moves from specific origins are tagged with these demographics and the median income and age of moves are added to the Area Migration table.
Mapping between geographics
Not all geographic boundary systems are compatible with each other. Our interpolation system based on H3 hexagons helps us to transform between geometries without loss of precision.
Compatible Geographical Hierarchies
Some geographies, such as the geographic units used by the US Census Bureau, are very easy to work with due to their standardized format, and can be easily aggregated from more granular to less granular: Tract < County < MSA < State.
Incompatible Geographical Hierarchies
Other geographies, such as ZIP code areas, are not compatible with the Census-defined areas. Their purpose and design criteria were different. As a result, ZIP code boundaries can often cross over city, county, and even state lines, and may not align with Census geography boundaries. This makes it difficult to use, for instance, ZIP-level migration data and use it as a base for calculation of migration on a county level.
How we solve for this:
We leverage a hexagonal grid to build our migration patterns and then aggregate up to various spatial dimensions: State, CBSA, County, Tract and ZIP Code. We account for various population density by weighting individual hexgons based on their population. We use data from WorldPop 100m Population Grid (Bondarenko et al.).
Incompatible geographies
We wish to calculate the unknown value for Geo C based on known values for Geo A and Geo B.
Step 1. Introduction of hexagonal grid
We divide the space of interest into a uniform grid of H3 hexagons.
Step 2. Interpolation
We map the known values of Geo A and Geo B to each unit of the underlying hex grid.
Step 3. Aggregation
Once we know the value that is assigned to each hexagon, calculating the final value for Geo C is a matter of aggregating values that are associated with this geometry.