Data Normalization & Best Practices

📘

Research Visits Feeds Only

The information provided in the following document pertains to Foursquare's Research Visits Feeds only.

How Normalization Works

Foursquare projects ‘normalized visits’ based on observed visits from the millions of consumers in our always-on foot traffic panel.

We apply a weighting to each observed visit. That weighting is based on state, age, and gender and thus referred to as a ‘SAG’ Score. By accounting for any age, gender, and regional skews within the panel, we are able to estimate real world trends.

We recommend using normalized visits rather than ‘raw’ observed visits. Weightings are adjusted to account for fluctuations in the size of our panel, as well as other technical factors (such as changes to how we calculate a visit in our SDK due to an OS update) that occur periodically. Normalization will inherently de-bias our data and control for fluctuations in panel size (since the raw panel visitation in our feeds will show demographic skew and is susceptible to step changes in the scale of our panel).

Why It's Important

Normalized visits should always be used in compiling visits trend analyses to mitigate the effects of changes to our panel user base (both positive and negative). Raw visits will show significant changes in the volume of data over time, due to a variety of factors including:

  • Changes to our check-in methodology and the attributes used in our Pilgrim SDK
  • Monthly active user (MAU) changes in our partners’ apps due to new version releases, new features, user acquisition efforts such as press and marketing, etc.
  • OS updates such as iOS 13
  • Changes in our panel (e.g. new apps contributing to our panel)
  • Changes in consumer behaviors related to world events like COVID-19

Example

On a given day, we see three panel visits to a Starbucks in New York by users who are female, age 20-24 living in New York State (Cohort A), and four visits from users who are male, 35-39, also living in New York State (Cohort B).

We have 500 active panelists in Cohort A, and the United States Census tells us there are 50,000 people in that demographic. All panelists in that demographic will carry a SAG score of 100 (50,000 / 500). 200 active panelists are in Cohort B, and the Census tells us there are 100,000 in that demographic. Panelists in Cohort B will each carry a SAG score of 500.

Each of Cohort A’s visits will constitute 100 population visits, and each of Cohort B’s will constitute 500.
So, total normalized visit volume on that day to that Starbucks in New York would be
(100 3) + (500 4) = 2,300

Because of how SAG scores are derived, raw visits and normalized visits will not necessarily always follow the same trend at a given point in time. For instance, it is possible for normalized visits to increase or remain stable, while raw visits decrease. SAG scores not only debias our data; they scale and stabilize total normalized visit volume. For instance, with any backend panel changes, we may see an increase/decrease in raw active user volume. This is usually met with a corresponding increases/decreases in raw visit volume. Accordingly, remaining users’ SAG scores will increase/decrease in order to stabilize our projected normalized visit volume and counteract the panel-specific changes in visit volume.

Best Practices for Monthly Address-level Analysis

Thresholds and Aggregation

Normalized visits may show some volatility due to the nature of our normalization process. Foursquare recommends analyzing visits by venue and month, rather than daily or weekly metrics, for reasonably stable patterns. For example, we may see a visit from a user with a large weighting (SAG score) because that user is a part of a highly underrepresented demographic in our panel, and this can lead to occasional anomalous spikes in visits, sometimes even for venues with a large number of raw visits. This volatility is particularly evident at granular reporting levels (e.g. venue-level, daily) where visit volume may be sparser.

Minimum Raw Visit Thresholds

Minimum raw visit thresholds should be used to determine feasibility. At low raw visit volumes, normalized visit trends can become noisier and less reliable. By assessing variation in venue level patterns against overall chain and category level patterns, we can further expand the number of feasible venues with a lower raw minimum, while excluding venues with potentially volatile, anomalous monthly trending over time.
The following thresholds are derived by assessing deviations of foot traffic indices at the venue-level against broader chain-level indices, flagging venues that have unreasonably high average deviations from chain patterns over the last 12 months.

Monthly Visit Feeds*Minimum Raw VisitsMaximum Variation
High Confidence>75 average raw visits/ monthRaw Visits: Monthly Standard deviation < 100% of mean SAG Score: Standard Deviation <350
Medium Confidence>30 and <=75 average raw visits/monthRaw Visits: Monthly Standard Deviation <70% of Mean SAG Score: Mean <300 & Std <300
Low ConfidenceVenues that do not meet above criteria

These thresholds are provided as a general guideline for a broad set of large chains and categories. When analyzing a specific set of venues, we may recommend applying different thresholds, depending on the vertical and geographic skews of the specific venues. As a final filter, Foursquare also recommends omitting any individual SAG scores that exceed 1,000 prior to aggregating to a venue, month level.


Did this page help you?