The Visitation datasets includes visitation, trade areas and demographic insights based on location data in the US. Two methodologies are utilized to to build the datasets.
- Our latest machine learning model to estimate visitation. We combine multiple data sources in a machine learning model.
- Aggregation model that aggregates and extrapolates GPS data for all other metrics. We gather GPS data from smartphones and mobile apps, with opt-in consent from users (all of our data is privacy-compliant). We then analyze and contextualize the data for purposes of providing accurate location insights.
We wholeheartedly believe that privacy is a concept to which every human that uses a connected device has a fundamental right — and it’s built into every part of our business.
One of the drawbacks when working with location data is the possibility of reverse engineering and, thus, violating someone's privacy.
To avoid such a scenario, Visitation Datasets are using a machine learning model trained on population data.
This approach makes reverse engineering almost impossible.
In addition, metrics like Trade Areas or Cross Visitation are aggregated over a longer period of time (one quarter) and, soon, will be replaced by a machine learning model as well. With this, Unacast ensures the highest standards of an individual's privacy with SmartMobility. If you are interested in a more detailed walkthrough of how Unacast handles privacy, please take a look at our privacy statement.
Visitation Datasets & Metrics
Dataset: Foot Traffic
Foot Traffic is a collection of metrics providing additional context to what is happening at a given location. This can come in various time aggregations (weekly, monthly, or quarterly) and is provided on a four day lag.
Visitation describes a time-series of visit counts for various aggregation periods - weekly, monthly, and quarterly. The visitation metrics estimates are derived from Unacast's proprietary machine learning model.
Visitation estimates the distinct count of people who visited a location. Available weekly, monthly and quarterly.
Here a story of how machine learning can change the location data industry for the better.
Visitation Methodology (machine learning methodology)
We estimate visitation to a location by using a machine learning model. The model learns the relationships between various types of context and visitation of that location.
Unlike typical aggregated products that rely solely on aggregating the underlying GPS device-level supply, our machine learning model is more robust and less dependent on GPS data fluctuations because it is based on a magnitude of different context it uses to estimate visitation.
Using machine learning, we are able to overcome these supply problems and create a more robust product.
Underlying contexts in the model
We use multiple sources of contexts as features in our machine learning model to estimate foot-traffic. Some of these are:
- Number of people in the vicinity
- Venue square footage
- Local demographics (such as income)
- Day of the week
- Industry codes
- Historical data
- People at a location
In total, the model comprises more than 50 features to train those relationships based on our long history of high-quality location data.
For our demographics context, we determine demographics of people observed at a location based on US census data.
Demographics (age, income, education and race) can help identify what type of people visited a given location.
Demographics Methodology (aggregation methodology)
To determine the demographics of a location's visitors, we use a home-derived methodology based on Unacast's Home & Work algorithm. Simply put, we look up home areas for every device that visits a location. Based on census data of the given home area for devices seen in the CBG, we can determine the demographics of people visiting that location.
Metric: Return Rate
Return Rate provides a deeper understanding about the loyalty of visitors. Hence, it helps to understand the share of people who are returning to a given location month-over-month.
Estimated fraction of total visitors, without a work or home location in this location, seen in the previous quarter which are also seen this quarter.
Return Rate Methodology (aggregation methodology)
The logic behind the Return Rate is fairly straightforward. We take all devices seen at a given location in a specific month and correct for supply and population bias. Then, we take the previous month and do the same (get all devices seen at the location and correct for supply + bias). Lastly, we compare these two months and calculate the visitor overlap describing the share of visitors this month compared to the previous month.
Metric: Capture Rate
Capture Rate estimates the pull of a location. This metric informs how many people visit a given location compared to the total traffic within a 150m or 300m radius.
Capture Rate describes the percentage of people visiting a location in relation to the total traffic in the surrounding area.
Capture Rate Methodology (aggregation methodology)
To calculate the Capture Rate, we calculate the person count within a 150m and a 300m radius surrounding a location. The person count allocated to the location is then transformed into a fraction based per radius. Simply put, Capture Rate describes the person count within a location as a function of the person count in the surrounding area.
Metric: Visit Length
Visit Length allows users to estimate how long visitors stay on average at a given location. This metric is especially of interest when combined with additional context, like transactional data (because length of stay correlates better with cash flow than absolute traffic).
Visit Length represents the median dwelling time across all visitors at a given location.
Visit Length Methodology (aggregation methodology)
Visit Length is based on the median dwell time derived from Unacast's potential duration estimate. In detail, the potential duration is the estimated duration between the previous dwell and the next dwell (without having data in-between those dwells). This is a useful logic to avoid cases where our data density of dwell events is sparser and we don't have a full picture of how long a dwell actually happened at a location. By taking probabilities of travel between dwells into account, we can more accurately estimate the time spent at a given location.
We do allow dwelling events to span across 12am UTC. However, since this can create edge cases of dwell durations longer than 24 hours, we cap individual dwell events in that case to 24 hours.
Dataset: Dynamic Trade Areas
Dynamic Trade Areas describe the origin of visitors to a specific location of interest based on their home or work location. The origin area is defined as a Census Block Group (CBG) and the metric is person_fraction, which shows the percentage breakdown of visitors from each CBG.
Dynamic Trade Areas describe the home and work CBGs for all the visitors to a location.
Dynamic Trade Areas Methodology (aggregation methodology)
Dynamic Trade Areas are based on our sophisticated aggregation algorithm.
To define the person_fraction, we calculate the distinct amount of devices seen at each location on a given day.
As a next step we utilize our Home & Work algorithm to get the home and work origins for the visiting devices.
Unacast's proprietary supply correction and extrapolation is applied to the data to get the number of visitors from each corresponding home and work CBGs. Thereafter we divide the visitor count from each home or work CBG by the total number of visitors to a location to create the person_fraction.
Dataset: Dynamic Trade Area Distance
Dynamic Trade Areas Distance describe the distance of visitors to a specific location of interest based on their home or work location. The origin area is defined as a Census Block Group (CBG) and the metrics are person_count and person_fraction.
Dynamic Trade Areas Distance describes the distance visitors travel from their home or work CBGs to a location.
Dynamic Trade Area Distance Methodology (aggregation methodology)
We derive the distance between a location and home or work CBGs of visitors. We provide this distance in three aggregations:
- _p25: 25th percentile distance from habit area
- _p50: median distance from habit area
- _p75: 75th percentile distance from habit area
Dataset: Cross Visitation
Cross Visitation helps you understand the customer behavior in terms of their brand shopping preferences. It provides insights into how many visitors to a specific location also visit different competitive brands or other locations. The metrics of interest here are the absolute person count, the fraction of people that visit other brands, and the resulting rank.
Cross Visitation indicates which other brands or locations visitors of a specific location go to.
Cross Visitation Methodology (aggregation methodology)
To determine Cross Visitation, we aggregate the person count for all possible combinations of locations to other locations. For example, we take a specific location A and calculate the person count based on devices who have been seen at location A and in location of all other brands or points of interests. This leads to a total person count for each location-to-location combination.
As a next step, we calculate a fraction based on the location-to-location combination normalized to the total cross-visitation to that location.
More formally, for every given location and brand , the cross-visitation fraction between and is defined as:
where denotes the cross-visitation person count between and , and denotes the total number of brands.
Dataset: Visitation Patterns
Visitation Patterns represent the signal or periodicity aggregated over a longer time interval for dayofweek per hour combinations. The pattern is represented as a fraction of total observation.
Visitation Patterns indicate the traction over time (e.g., for days of the week and hours) for a given location.
Visitation Patterns Methodology (aggregation methodology)
Visitation Patterns are calculated over longer aggregation windows (quarter). Over that period, the average person count is calculated per day of week and hour. This person count is then normalized by the total person count and, thus, reflects a fraction of the total.
Analysis of Machine Learning Model
Validation of Target Value
One important concept of machine learning is that a model can only be as good as the target it is trained on. Previous validations show that Unacast's historical data is of high quality and sufficient to use as a target for modeling:
Training our model resulted in the following evaluation scores on the validation set:
- Median Absolute Error:
- R squared:
Analysis of Estimated Visitation
We evaluated our estimated visitation by comparing Unacast Visitation Dataset with ground truth from a major sporting goods retailer. For that retailer, we have ground truth data based on sensors at physical locations counting how many people entered the store.
Comparing ground truth to Unacast visitation shows a high correlation: ().