Skip to main content

Frequently Asked Questions

Data Source

How does Unacast obtain location data for machine learning products?

The data that we process is sourced by our first party data partners via SDKs embedded within multiple phone applications, with opt-in consent from all users. This data is then processed according to Unacast specifications and passed to us as anonymized and aggregated person counts.

What percentage of the US population do you observe in your GPS data (market share)?

The 1st Party GPS aggregates at present are based on between 800k-1.5M daily active users. It is important to remember the following when taking this number into consideration:

  1. Because our data comes from a first party source, we have extremely low occurrence of fraudulent, spoofed, or otherwise junk data.
  2. GPS is not the only data source feeding our model. The Unacast machine learning (ML) algorithm takes into account many different data sources.
  3. We have an understanding of visitation to locations going back to 2018. The daily GPS aggregates are used to understand how today’s visitation reflects or alters historical trends, not as the sole predictor of person count

Methodology

What machine learning algorithms do you use for your predictions?

Our ML predictions are produced by using ensemble learning, which is the combination of machine learning models. We call our ensemble SIRIUS after the extremely bright binary star system. SIRIUS is made up of two models, each using a different algorithm:

  1. The main contributor is a Boosted Decision Tree model.
  2. The secondary contributor is a Neural Network. After extensive experimentation, we found that a wide, shallow network combined with the SELU activation function was the best fit. Model training uses the Adaptive Moment Estimation (ADAM) optimizer, and a stepwise learning rate (LR) scheduling algorithm. The advantages of an ensemble approach are that each piece covers for the weaknesses of the other. The boosted tree algorithm was found to be the most accurate in terms of Mean Squared Error (MSE), however Decision Tree models have a large but finite number of possible solutions. This can result in the model predicting the same value given very similar inputs. Because it is composed of continuous functions, the Neural Network can differentiate between very similar inputs, giving it more flexibility. The tradeoff in this case was that it was slightly less accurate in terms of MSE.

What machine learning features or variables are used along with location data?

We analyze and contextualize our aggregated location data using machine learning models. Our models incorporate more than 150 features, and are trained on our long history of high-quality location data. The data falls into five general categories:

  1. 1st Party GPS Related: Aggregated person counts at a H3 hexagon level Example: Sum of the Person Counts observed for all hexagons intersecting with polygon
  2. Auto-Regressive: Features relating to historical person counts Example: The average person count observed in this polygon over the preceding 7 days
  3. Time Related: Variables related to the date Example: Day of the Week
  4. Geometry Related: Features related to the shape of the polygon Example: Average distance from the vertices of the polygon to the center
  5. Area Related: Features related to the place where the polygon is located Example: Percentage of polygon assigned to different land use codes

What are the data preprocessing steps involved in your machine learning pipeline?

Before the features are fed into the model, there is a final normalization and transformation step depending on the data type. Numeric features are scaled using a Standard Scaler. Boolean features are mapped to integers. String features are transformed to categorical inputs using One Hot Encoding. The main preprocessing of our data is the clustering and dwelling. This process is complex and technical, but in short, the data is cleaned, grouped, categorized, and aggregated. The most trustworthy and consistent data we have is then used to build foot traffic counts on a hexagon grid level.

How do you handle missing or incomplete data in your machine learning training process?

We have a very low incidence of missing or incomplete data. Given that every prediction request must contain a date and a polygon, we can guarantee that these two classes of features (time related and geometry related) will always be present. Likewise, we have a complete map of all of our area related features across the United States. When it comes to the GPS related and the dependent autoregressive features, missing data actually represents valuable information for our model. It tells us that we did not observe any devices in these hexagons. Therefore, when we have missing GPS features, we fill in the missing values with a zero.

How do you address bias concerns in your machine learning predictions?

Bias is always a concern when dealing with machine learning models. Because they can often be a black box, it can be difficult to detect biases after training, and it is important to consider these problems beforehand. We have identified a few potential biases, and attempted to handle potential concerns where possible.

  1. Sampling Bias: GPS data is by its very nature, biased towards people with cellphones, and specifically smartphones. It can therefore have a sampling bias towards younger, more affluent groups. This has the potential to be reflected in poor predictions for older or more low-income areas. Unacast has attempted to address this bias in two ways. Firstly, our 1st party data provider has SDKs with more than one application, meaning that we are more likely to capture a wider array of users. Secondly, we incorporate features other than simply daily GPS pings into our model. For example, using autoregressive features means that we consider data over a larger time window, and are therefore more likely to capture activity from rarer groups. That being said, this bias can never be fully eliminated, and the potential impact of sampling bias should always be accounted for.
  2. Location Bias: Unacast has identified a few different kinds of location bias. One location bias is attention, where some types of locations do not require the full attention of a visitor. An example can be waiting in line at the Department of Motor Vehicles. In this situation, a user is more likely to access their phone and therefore generate more GPS signals than in a location that requires the user to be engaged at all times. Unacast attempts to address this by getting data from both foreground and background collecting apps. Second location bias is length of stay, specifically around short stay locations. For instance, we know that gas stations are particularly hard to predict, as customers often do not stay long enough to generate data. This is addressed via longer collection windows. Finally, larger polygons provide a larger collection area, meaning there is a bias towards larger locations. We tackle this by excluding very small polygons, although our model still performs well on these polygons. Like sample bias, location bias cannot be fully eliminated.
  3. Population Bias: The final bias we have identified is a bias towards more populated areas. For privacy purposes, we remove identifiers with homes inside of hexagons that have too few residents. This is to prevent the re-identification of devices even after aggregation. This means our model does not perform as well in areas with extremely low, widespread populations. This bias is intentional, and will not be removed to protect the identities of the identifiers in our sample. As with any model or aggregation method, there are likely many more biases. Unacast’s attitude towards biases is to treat them as problems to be identified and mitigated, rather than concealed. If you think you have detected a bias in our data, please reach out and our data team will attempt to create mitigating measures.

What is the dwell time to be counted as a visitor in machine learning products?

In order count as a dwell, we must have:

  1. Multiple pings from the same identifier
  2. Within a 50m radius
  3. Lasting longer than 60 seconds This is about the speed one might walk around a grocery store. Dwell assignment is the same no matter the polygon (small venue, big venue, CBG etc).

How does Unacast identify home vs. work locations?

We identify home and work locations via a confidence based method. Every week we assign each individual identifier a home location based on where they resided overnight most often, and a work location based on where they spent most of their time during weekday working hours. These initial assignments are given a confidence score based on the following:

  1. How often were they in the assigned home or work location during the appropriate hours? Were they home every night, or only 4/7 nights?
  2. How often did we see the identifier? Is the assignment based on only one observation or many?
  3. Do the assigned home and work locations match the home and work locations this identifier was given in previous weeks? As a rule of thumb, we must observe the same identifier with the same home and work location for multiple weeks before we confidently assign them a home and work location.

Validation

How does Unacast validate the accuracy of machine learning powered datasets?

We have developed a three step system to validate our models.

  1. Model Loss: The first step in ensuring we have an accurate model is to minimize the mean squared error loss during model training.
  2. Exit Criteria: This refers to additional testing we perform after a model is done training, but before it goes into production. These are tests that cannot be used as a model loss function, but still need to be passed before we can deploy a model. For example, in an ill fitting model the best option for a venue might be to predict the mean value for that venue every day. We call these flat-lines and we require less than 1% flat-lines before moving to production. Another example would be the ranking of zip codes in our Trade Areas dataset. Beyond just the difference between target and prediction, we also require that the relative ranking of zip code importance is strongly correlated between our predictions and the target values.
  3. Story Based: Story based validation is performed after a model has reached the production ready stage, and involves testing to see how that model performs in specific situations where we know what sort of behavior to expect. We may not always know the exact foot traffic numbers, but we know how the values should behave. For example, we should show a clear and obvious drop in visitation at the start of the COVID-19 pandemic. In locations that are only open on specific days, such as stadiums and concert venues, we should see visitors on those days and no others. One of our favorite internal tests is the “Chick-Fil-A test” (Chick-Fil-A is notoriously closed on Sundays). Finally, when available, Unacast employs truth sets. When validated against ground truth data, such as the foot traffic to a popular nationwide clothing retailer, Unacast’s models recorded an R-Squared of 91.6% or higher, widely considered to be best-in-class. It is possible to use more than just foot traffic as ground truth. For example, sales data, card swipes, and loyalty program information can all be used as ground truth. If you have a dataset you would like to see evaluated against Unacast data, please reach out and we will be happy to perform the analysis.

How do you ensure the reliability of machine learning models?

Unlike typical aggregated products that rely solely on aggregating the underlying GPS device-level supply, our machine learning model is more robust because it is less dependent on GPS data fluctuations.

Why not just use a GPS based approach?

While privacy-friendly 1st party GPS data is an important input, GPS data alone has some weaknesses:

  1. Sampling Group: Some location data providers are overly reliant on one app only. If that data is high quality and dense, that could be enough to power a product, but if that app happens to mostly be used by one specific group then you will end up with sampling error. For example if your data comes from a family security app, you might see more observed visitation to schools, less to bars and clubs.
  2. Supply Fluctuation: Even if you have a broad supply base, there are still fluctuations in the number of devices seen on a daily basis. It can be difficult to tell from pure GPS data if what you are observing is a drop in actual visitation, or just a drop in supply for that day. As more and more states begin to pass consumer privacy laws, this problem will only grow.
  3. Fraud: In recent years there has been an increase in untrustworthy or fake GPS data in the marketplace. This often takes the form of “replay” data, or data which reflects real activity from real people, but shifted in time. This data can be hard to detect, but shows up in heavily seasonal locations. A spike of foot traffic on a ski slope in July for example. (Fun Fact: Unacast was the first to discover this new trend and to develop ways of detecting it). Because of these issues, Unacast believes that a machine learning approach, combined with 1st party data, is the best option. Because our data comes from a 1st party, we know that it comes from a broad selection of apps. We can also be sure that our data is not fraudulent, since we know the exact source. Because we process all the data on the app owner’s side, we can apply supply correction algorithms and aggregate the data. This circumvents many privacy laws, because we do not receive precise individual locations. The downside of this approach is a lower market share, as you do not purchase from large aggregators. Luckily our ML approach allows us to operate with a much smaller, more trustworthy, dataset.

How do we ensure accuracy in dense or stacked locations? (dense urban areas/malls/etc.)

  1. In our Safegraph dataset, we fence individual stores within shopping malls and similar shopping venues. We also fence shopping venues in entirety so that we can report on visitors to the mall in general. Note: For non-Safegraph customers, we only report total to the mall
  2. We assign a polygon confidence score to each location to provide a measure for whether our predictions is likely to be specific for the context described by the Place or for the general locations Most store locations exist outside of multi-floor buildings, so we are confident that our audiences still accurately capture visitors while eliminating potential noise. Unacast is continually enhancing its platform with new capabilities; as technologies supporting improved vertical accuracy become available, Unacast may evaluate and add such technologies to its platform.

What is our location signal precision/accuracy for the machine learning product?

The following are the most important aspects in relation to the quality of our predictions:

  1. We source signal data only from trusted partners. This means we have good visibility into details of collection methodology on devices, which is important for us in interpreting the precision and accuracy of each signal observation.
  2. Location data collected on mobile devices provide probabilistic information of signal accuracy which we utilize to ensure is compatible with the resolution/fidelity of the specific behavior we are to detect. We never assume this to be of higher resolution than 10-20m.
  3. We use proprietary models on signal data to detect details about the location and context in which a visitation/dwelling activity was related to.
  4. The number of people doing an activity impacts precision/accuracy, as with lower activity levels we are more likely to absorb larger shares of false positives in our predictions.The likelihood of noise increases with density and uncorrelated behaviors.
  5. Our precision/accuracy for understanding what happens inside a polygon will be higher than for understanding actions related to a specific context within the polygon. Example went into Starbucks vs only talked to someone outside the entrance.

Privacy

Does Unacast have the right to use this data?

Yes, we have full rights to use this data. We abide by all applicable laws and regulations and hold consumer data privacy in high regard. All of Unacast’s data providers have warranted that the data they collect and provide to Unacast is done in accordance with all applicable laws.

What is Unacast’s Privacy Policy?

Unacast processes and aggregates data from multiple providers. This assures the breadth and veracity of the data we supply to our customers, who use Unacast’s data to power everything from advertising, consumer insights, and competitive intelligence, to business development strategies and operational efficiencies. We take consumer privacy seriously and ensure that our data platform remains fully transparent and compliant with industry and legal requirements. Critical to this endeavor is ensuring that our data suppliers comply with all applicable privacy laws.

How does Unacast protect consumer privacy?

  1. Data in aggregate: All Unacast products are built on aggregated data from multiple sources, location signals, and devices.
  2. No real-time tracking: Unacast processes its location data with a 48-72 hour delay. No location data is received (or processed) in real-time.
  3. No universal tracking: Unacast creates and maintains latitude/longitude based geo-fences around commercial locations of interest only – like shopping, dining, entertainment or event destinations.
  4. No sensitive places: Unacast applies its own privacy-enhancing technology, PrivacyCheck, to all of its products to ensure that any data generated by consumer mobile devices while visiting sensitive locations (like schools, healthcare facilities, and churches) is never used, shared, or resold.
  5. Consumer opt-out: Unacast honors mobile users’ requests to not accept location data from their mobile devices – and we share those requests with our partners.
  6. CCPA: The CCPA refers to the California Consumer Protection Act, which went into effect on January 1, 2020. The CCPA grants California residents certain rights with respect to the collection and sale of their Personal Information.
  7. GDPR: The GDPR, or General Data Protection Regulation, is currently in effect. It applies to non-EU organizations to the extent they offer goods or services to EU residents or monitor the behavior of EU residents.