Documentation

Correlate Labs

Google Correlate is an experimental new tool on Google Labs which lets you use the same methodology and data as Google Flu Trends.

Google Correlate is like Google Trends in reverse. With Google Trends, you type in a query and get back a series of its frequency (over time, or in each US state). With Google Correlate, you enter a data series (the target) and get back queries whose frequency follows a similar pattern.

Trends

 “mittens” →

Correlate

 → “mittens”

## Correlated Queries

When you upload a data set (a time series, for instance), Google Correlate will compute the Pearson Correlation Coefficient (r) between your time series and the frequency time series for every query in our database. Correlation coefficients range from r=-1.0 to r=+1.0. The queries that Google Correlate shows you are the ones with the highest correlation coefficient (i.e. closest to r=1.0).

For example, say your data set was a sine wave from 2003-2010, so that 0.0 is the summer solstice and 1.0 is the winter solstice.

 Week     Value1/05/03  0.983 1/12/03  0.9651/19/03  0.939 1/26/03  0.9072/02/03  0.8692/09/03  0.826 2/16/03  0.778...

This could be generated in Excel using the SIN or COS function. You can download a complete spreadsheet for this data set here.

To get your data into Google Correlate, you have two main options:

1. Copy/paste from a spreadsheet program (Excel, OpenOffice, …)

We’ll use the copy/paste technique here. To get started, select all the cells of interest in your spreadsheet program and copy them. Then click “Enter your own data” on Google Correlate and switch to the “Time Series” tab. Click on the spreadsheet there and hit Control-V (Command-V on a Mac) to paste in the data:

 Copy data from a spreadsheet program... ...and paste it into Google Correlate. Be sure to leave off the header row and give your data series a name!

Then click “Search correlations” at the bottom to find queries whose time series are correlated. Here are the top queries that come back when you search for this time series:

0.9483  alpine touring

0.9439  nordica

0.9381  volkl

0.9339  colds

0.9329  hockey arena

0.9329  fritschi

0.9290  obermeyer

0.9289  telemark boot

0.9270  wedding soup

0.9267  ski boot

The numbers are the Pearson Correlation Coefficients. r=0.9483 indicates a very good fit between the target data and the query time series. Here’s what ‘alpine touring’ looks like next to the time series that we uploaded:

A few things to note here:

• Any query containing ‘alpine touring’ will contribute to the time series for ‘alpine touring’. This includes queries like ‘alpine touring skis’, ‘alpine touring vacations’, etc.
• The data is aggregated to weekly counts. Each week goes from one Sunday to the next. The points for 2006/01/01, for example, include queries from the start of Sunday, January 1, 2006 to the end of Saturday, January 6, 2006. Google Correlate contains data starting from January 5, 2003 (the first Sunday of 2003).
• The vertical grid lines mark the beginning of each year.
• The units on the y-axis are standard deviations away from the mean. Each time series is normalized so that its mean is 0.0 and its standard deviation is 1.0. This puts all series on the same scale so that they’re easier to compare. This also explains why the ‘Winter Wave’ time series ranges from -1.4 to +1.4, even though the input series only ranged from 0 to 1.

## Negative Correlations

Google Correlate only shows you positive correlations. But sometimes the negative correlations can be just as interesting. If you want to see queries which are negatively correlated with your data, just multiply your input data by -1 in your spreadsheet program before uploading it to Google Correlate.

Here are the negative correlations for the seasons time series:

0.9729  boat trailer

0.9664  trumpet vine

0.9630  golf course

0.9626  rotary mower

0.9618  gary fisher

0.9603  deck railing

0.9597  used bikes

0.9590  pig roast

0.9578  bike carrier

0.9577  course rating

So the time series for the query ‘boat trailer’ had a correlation of r=-0.9729 with the original ‘Winter Wave’ time series. As you might expect, the queries which are negatively correlated with winter are summer queries.

## Holdouts and Missing Data

Sometimes you don’t have a complete time series or would prefer to hold out a portion of your data for testing. You can accomplish this in Google Correlate by putting blank values in your data when you upload it:

For example, here is the Winter Wave time series with 2006 and 2007 withheld:

If you look closely, you can see that the blue line has a gap between the end of 2005 and the start of 2008. When computing correlations, these weeks will be ignored in the time series for candidate queries. This means that, if you build a model for your time series using query data, you can use this held out portion of the time series as a test set.

Removing selected weeks from uploaded data sets is a general technique which can be used for other purposes as well. For instance, if your uploaded data has a large spike over a small time period, that spike may have a large (and unwanted) influence on the results. If you withhold the spiking weeks from your data set, you can remove their influence entirely.

## Building a Model with Query Data

Note: Statistical modeling is a fine art. This example is presented simply as a demonstration of what’s possible, not as a demonstration of good modeling techniques.

Having found queries which are correlated with the winter, we can use them to build a model. Using the Winter Wave with holdout, we get a list of queries whose time series is correlated with the winter. If you click “Export data as CSV” on that page, you’ll get a CSV file containing weekly time series for the top few results.

You can import this data into a spreadsheet or your favorite numerical analysis tool to do the modeling. For example, in this spreadsheet, we built a very simple model by summing up the time series for the 20 most highly-correlated queries. We then computed the Pearson Correlation Coefficient between the target time series and the model estimates on the holdout period (2006-2007), which was r=0.979. This indicates that the query data was able to predict previously-unseen real-world data.

Of course, there are better ways to model whether it’s winter in the United States. But it is interesting that we can do so exclusively with query data. A similar sequence turned influenza data from the CDC into Google Flu Trends and there are no doubt other time series which can be modeled in a similar way.

## Correlate by States

The examples thus far have worked exclusively with time series. Google Correlate can also find queries whose popularity correlates with a data set across space rather than time.

As a simple example, let’s create a data set which is 1 for every state in New England but 0 for all other states:

Here are the queries whose popularity is most highly-correlated with this New England data set:

0.9903  gorges grant hotel

0.9850  neasc

0.9846  boston dirt dog

0.9829  new england association of schools and colleges

0.9815  new england map

0.9805  hood ice cream

0.9800  map of new england

0.9799  new england inns

0.9794  new england recruiting report

As before, these are Pearson Correlation values. But what does it mean for a query to be correlated with this US states data set? Let’s look at the maps for our New England set and the query “map of new england” side-by-side:

Left: our “New England” data set. Right: the popularity of the query “map of new england”.

The maps indicate that the query ‘map of new england’ is popular in states where our data set has a 1 and not popular in states where our data has a 0. Clicking the “Scatter plot” link on the result makes this more explicit:

The six points on the top right are the six states in New England. The smattering of dots on the lower left are the other 44 states and the District of Columbia. This makes it clear that the query ‘map of new england’ is popular in the six states in New England and nowhere else.

For the New England data set, Google Correlate brings back queries which are characteristic of the New England region. If you have a data set which can be broken down by state, uploading it to Google Correlate may give you insight into some of the driving factors behind your data.

The same techniques discussed for the time series examples also apply to states correlation. If you don’t specify a state then it will be held out. In particular, it is often useful to hold out the District of Columbia which is an outlier in many data sets.

## Filtering

Google Correlate makes an attempt to filter out queries which are unlikely to be interesting. These include:

• Queries with a low correlation value (less than r=0.6)
• Misspelled queries
• Pornographic queries
• Rare queries
• Queries which only correlate with a small portion of the time series