The SmartMobility dataset includes visitation, trade area and demographic insights based on location data in the US. Two methodlogies are utilized in SmartMobility to to build the dataset.
- Our latest machine learning model to estimate Visitation - we combine multiple data sources in a machine learning model.
- Aggregation model that aggregates and extrapolates GPS data for all other metrics - we gather GPS data from smartphones and mobile apps, with opt-in consent from users (all of our data is privacy-compliant). We then analyze and contextualize the data for purposes of providing accurate location insights.
We wholeheartedly believe that privacy is a concept to which every human that uses a connected device has a fundamental right — and it’s built into every part of our business.
One of the drawbacks when working with location data is the possibility of reverse engineering and, thus, violating someone's privacy.
To avoid such a scenario, SmartMobility Visitation is using a machine learning model trained on population data.
This approach makes reverse engineering almost impossible.
In addition, metrics like Trade Areas or Cross Visitation are aggregated over a longer period of time (one quarter) and, soon, will be replaced by a machine learning model as well. With this, Unacast ensures the highest standards of an individual's privacy with SmartMobility. If you are interested in a more detailed walk through of how Unacast handles privacy, please take a look at our privacy statement.
Foot Traffic is a collection of metrics providing additional context to what is happening at a given location. This can come in various time aggregations (weekly, monthly, or quarterly) and is provided on a 4 day lag.
Visitation describes a time-series of visit counts for various aggregation periods - weekly, monthly, and quarterly. The visitation metrics estimates are derived from Unacast's proprietary machine learning model.
Visitation estimates the distinct count of people who visited a location on a given day. It is available on weekly, monthly and quarterly aggregations.
Here a story how machine learning can change the location data industry for the better.
Methodology Visitation (machine learning methodology)
We estimate visitation to a location by using a machine learning model. The model learns the relationships between various types of context and visitation of that store.
Unlike typical aggregated products that rely solely on aggregating the underlying GPS device level supply, our machine learning model is more robust and less dependent on GPS data fluctuations because it is based on a magnitude of different context it uses to estimate visitation.
Using machine learning, we are able to overcome these supply problems and create a more robust product that is based on a magnitude of different context that it uses to estimate visitation.
Underlying contexts in the model
We use multiple sources of contexts as features in our machine learning model to estimate foot-traffic. Some of these are:
- Number of people in the vicinity
- Venue square footage
- Local demographics (such as income)
- Day of the week
- Industry codes
- Historical data
- People at a location
In total, the model comprises of more than 50 features to train those relationships based on our long history of high-quality location data.
For our demographics context, we determine demographics of people observed at a location based on US census data.
Demographics (age, income, education and race) can help identifying what type of people visited a given location.
Home-derived Methodology (aggregation methodology)
To determine demographics of location visitors, we use a home-derived methodology based on Unacast's Home & Work algorithm. Simply put, we look up home areas for every device and based on census data of the given home area for devices seen in the CBG, we can determine the demographics of people visiting that location.
Metric: Return Rate
Return Rate provides a deeper understanding about the loyalty of visitors. Hence, it helps to understand the share of people who are visiting a given location month-over-month.
Return Rate estimates the percentage of visitors in the current month who have also been visiting that location in the previous month.
Return Rate Methodology (aggregation methodology)
The logic behind the Return Rate is fairly straightforward. We take all devices seen at a given location in a specific month and correct for supply and population bias. Then, we take the previous month and do the same (get all devices seen at the location and correct for supply + bias). Lastly, we compare these two months and calculate the visitor overlap describing the share of visitors this month compared to the previous month.
Metric: Capture Rate
Capture Rate estimates the pull of a location. This metric informs how popular a given location is compared to the total traffic within a 150m or 300m radius.
Capture Rate describes the share of people visiting a location in relation to the traffic in the surrounding area.
Capture Rate Methodology (aggregation methodology)
To calculate the Capture rate, we calculate the person count within a 150m and a 300m radius surrounding a location on a daily basis. The person count allocated to the location is then transformed into a fraction of the radius. Then, median and lower / upper percentiles are calculated across the quarter per radius and venue.
Metric: Visit Length
Visit Length allows users to estimate how long visitors stay on average at a given location. This metric is especially of interest when combined with additional context, like transactional data (because length of stay correlates better with cash flow than absolute traffic).
Visit Length represents the median dwelling time across all visitors at a given location.
Visit Length Methodology (aggregation methodology)
Visit Length is based on the median dwell time derived from Unacast's potential duration estimate. In detail, the potential duration is the estimated duration between the previous dwell and the next dwell (without having data in-between those dwells). This is a useful logic to avoid cases where our data density of dwell event is sparser and we have not a full picture of how long a dwell actually happened at a location. By taking probabilities of travel between dwells into account, we can more accurately estimate the time spent at a given location.
We do allow dwelling events to span across 12am UTC. However, since this can create edge cases of dwell durations >24h, we cap individual dwell events in that case to 24h.
Dynamic Trade Areas
Dynamic Trade Areas describe the origin of visitors to a specific location of interest based on their HOME or WORK location. The origin area is defined as a Census Block Group (CBG) and the metric is person_fraction, which shows the % breakdown of visitors from each CBG.
Dynamic Trade Areas describe the home and work CBGs for all the visitors to a location.
Methodology Dynamic Trade Areas (aggregation methodology)
Dynamic Trade Areas are based on our sophisticated aggregation algorithm.
To define the person_fraction, we calculate the distinct amount of devices seen at each location on a given day.
As a next step we utilise our Home & Work algorithm to get the home and work origins for the visiting devices.
Then, Unacast's proprietary supply correction and extrapolation is applied to the data to get the number of visitors from each corresponding home and work CBGs. Thereafter we divide the visitor count from each home or work CBG by the total number of visitors to a location to create the person_fraction.
Dynamic Trade Area Distance
Dynamic Trade Areas distance describe the distance of visitors to a specific location of interest based on their HOME or WORK location. The origin area is defined as a Census Block Group (CBG) and the metrics are person_count and person_fraction.
Dynamic Trade Areas Distance describes the distance visitors travel from their home or work CBGs to a location.
Methodology Dynamic Trade Area Distance (aggregation methodology)
We the derive the distance between the location and home or work CBGs. We do provide this distance in 3 aggregations:
- _p25: 25th percentile distance from habit area
- _p50: median distance from habit area
- _p75: 75th percentile distance from habit area
Cross Visitation helps you understand the customer behaviour in terms of their brand shopping preferences. To do so, Cross Visitation provides relationships about how many visitors to a specific location also visit different competitive brands. The metrics of interest here are the absolute person count, the fraction of people compared to other cross-visit brands, and the resulting rank.
Cross Visitation indicates which other brands visitors of a specific location go to.
Cross Visitation Methodology (aggregation methodology)
To determine Cross Visitation, we aggregate the person count for all possible combination of locations to brands. For example, we take a specific location A and calculate the person count based on devices who have been seen at location A and in location of all other brands. This leads to a total person count for each location-to-brand combination.
As a next step, we calculate a fraction based on the location-to-brand combination normalised to the total cross-visitation to that location.
More formally, for every given location and brand , the cross-visitation fraction between and is defined as:
where denotes the cross-visitation person count between and , and denotes the total number of brands.
Traffic Patterns represent the signal or periodicity aggregated over a longer time interval for dayofweek per hour combinations. The pattern is represented as a fraction of total observation.
Patterns indicate the traction over time (e.g., for days of the week and hours) for a given location.
Visitation Patterns Methodology (aggregation methodology)
Traffic Patterns are calculated over longer aggregation windows (quarter). Over that period, the average person count is calculated per day of week and hour. This person count is then normalised by the total person count and, thus, reflects a fraction of the total.
Analysis of Machine Learning Model
Validation of target value
One important concept of machine learning is that a model can only be as good as the target it is trained on. Previous validations show that Unacast's historical data is of high quality and sufficient to use as a target for modelling:
Ground Truth Analysis of Unacast Visitation Data
After training our model resulted in the following evaluation scores on the validation set:
- Median Absolute Error:
- R squared:
Analysis of estimated visitation
We evaluated out estimated visitation by comparing Unacast SmartMobility visitation with ground truth from a major sporting goods retailer. For that retailer, we have ground truth data based on sensors at physical locations counting how many people entered the store.
Comparing ground truth to Unacast visitation shows a high correlation ().
As we develop and improve our machine learning model throughout the year, we will replace metrics based on our aggregation methodology to the new ML methodology. This ensures a robust and qualitative product moving forward while the location data space is changing (e.g., less and less data will be available in the following years).