Google Correlate Tutorial
Google Correlate is an experimental new tool on Google Labs
which lets you use the same methodology and data as Google Flu Trends.
What is Google Correlate?
Google Correlate is like Google Trends in reverse. With
Google Trends, you type in a query and get back a series of its frequency (over
time, or in each US state). With Google Correlate, you enter a data series (the
target) and get back queries whose frequency follows a similar pattern.
Google Trends 
“mittens” →  

Google Correlate 
 → “mittens” 

Correlated Queries
When you upload a data set (a time series, for instance), Google Correlate will
compute the
Pearson Correlation Coefficient (r) between your time series and the
frequency time series for every query in our database. Correlation coefficients
range from r=1.0 to r=+1.0. The queries that Google Correlate
shows you are the ones with the highest correlation coefficient (i.e. closest to
r=1.0).
For example, say your data set was a sine wave from 20032010, so that 0.0 is
the summer solstice and 1.0 is the winter solstice.
Week Value 1/05/03 0.983
1/12/03 0.965 1/19/03 0.939
1/26/03 0.907 2/02/03 0.869 2/09/03 0.826
2/16/03 0.778 ...  
This could be generated in Excel using the SIN or COS function. You can download
a complete spreadsheet for this data set
here.
To get your data into Google Correlate, you have two main options:
 Copy/paste from a spreadsheet program (Excel, OpenOffice, …)
 Export CSV and upload it to Google Correlate.
We’ll use the copy/paste technique here. To get started, select all the
cells of interest in your spreadsheet program and copy them. Then click
“Enter your own data” on Google Correlate and switch to the
“Time Series” tab. Click on the spreadsheet there and hit ControlV
(CommandV on a Mac) to paste in the data:
Copy data from a spreadsheet
program... 
...and paste it into Google Correlate. Be sure to leave off
the header row and give your data series a name! 
Then click “Search correlations” at the bottom to find queries whose time
series are correlated. Here are the top queries that come back when you search
for this time series:
0.9483 alpine touring
0.9439 nordica
0.9381 volkl
0.9339 colds
0.9329 hockey arena
0.9329 fritschi
0.9290 obermeyer
0.9289 telemark boot
0.9270 wedding soup
0.9267 ski boot
The numbers are the Pearson Correlation Coefficients. r=0.9483 indicates
a very good fit between the target data and the query time series. Here’s what
‘alpine touring’ looks like next to the time series that we uploaded:
A few things to note here:
 Any query containing ‘alpine touring’ will contribute to the time series for
‘alpine touring’. This includes queries like ‘alpine touring skis’, ‘alpine
touring vacations’, etc.
 The data is aggregated to weekly counts. Each week goes from one Sunday to the next. The points for 2006/01/01, for example, include queries from the start of Sunday, January 1, 2006 to the end of Saturday, January 6, 2006. Google Correlate contains data starting from January 5, 2003 (the first Sunday of 2003).
 The vertical grid lines mark the beginning of each year.
 The units on the yaxis are standard deviations away from the mean. Each time series is normalized so that its mean is 0.0 and its standard deviation is 1.0. This puts all series on the same scale so that they’re easier to compare. This also explains why the ‘Winter Wave’ time series ranges from 1.4 to +1.4, even though the input series only ranged from 0 to 1.
Negative Correlations
Google Correlate only shows you positive correlations. But sometimes the negative correlations can be just as interesting. If you want to see queries which are negatively correlated with your data, just multiply your input data by 1 in your spreadsheet program before uploading it to Google Correlate.
Here are the negative correlations for the seasons time series:
0.9729 boat trailer
0.9664 trumpet vine
0.9630 golf course
0.9626 rotary mower
0.9618 gary fisher
0.9603 deck railing
0.9597 used bikes
0.9590 pig roast
0.9578 bike carrier
0.9577 course rating
So the time series for the query ‘boat trailer’ had a correlation of r=0.9729 with the original ‘Winter Wave’ time series. As you might expect, the queries which are negatively correlated with winter are summer queries.
Holdouts and Missing Data
Sometimes you don’t have a complete time series or would prefer to hold out a portion of your data for testing. You can accomplish this in Google Correlate by putting blank values in your data when you upload it:
For example, here is the Winter Wave time series with 2006 and 2007 withheld:
If you look closely, you can see that the blue line has a gap between the end of 2005 and the start of 2008. When computing correlations, these weeks will be ignored in the time series for candidate queries. This means that, if you build a model for your time series using query data, you can use this held out portion of the time series as a test set.
Removing selected weeks from uploaded data sets is a general technique which can be used for other purposes as well. For instance, if your uploaded data has a large spike over a small time period, that spike may have a large (and unwanted) influence on the results. If you withhold the spiking weeks from your data set, you can remove their influence entirely.
Building a Model with Query Data
Note: Statistical modeling is a fine art. This example is presented simply as a demonstration of what’s possible, not as a demonstration of good modeling techniques.
Having found queries which are correlated with the winter, we can use them to build a model. Using the Winter Wave with holdout, we get a list of queries whose time series is correlated with the winter. If you click “Export data as CSV” on that page, you’ll get a CSV file containing weekly time series for the top few results.
You can import this data into a spreadsheet or your favorite numerical analysis tool to do the modeling. For example, in this spreadsheet, we built a very simple model by summing up the time series for the 20 most highlycorrelated queries. We then computed the Pearson Correlation Coefficient between the target time series and the model estimates on the holdout period (20062007), which was r=0.979. This indicates that the query data was able to predict previouslyunseen realworld data.
Of course, there are better ways to model whether it’s winter in the United States. But it is interesting that we can do so exclusively with query data. A similar sequence turned influenza data from the CDC into Google Flu Trends and there are no doubt other time series which can be modeled in a similar way.
Correlate by States
The examples thus far have worked exclusively with time series. Google Correlate can also find queries whose popularity correlates with a data set across space rather than time.
As a simple example, let’s create a data set which is 1 for every state in New England but 0 for all other states:
Here are the queries whose popularity is most highlycorrelated with this New England data set:
0.9903 gorges grant hotel
0.9863 england basketball
0.9850 neasc
0.9846 boston dirt dog
0.9829 new england association of schools and colleges
0.9815 new england map
0.9805 hood ice cream
0.9800 map of new england
0.9799 new england inns
0.9794 new england recruiting report
As before, these are Pearson Correlation values. But what does it mean for a query to be correlated with this US states data set? Let’s look at the maps for our New England set and the query “map of new england” sidebyside:
Left: our “New England” data set. Right: the popularity of the query “map of new england”.
The maps indicate that the query ‘map of new england’ is popular in states where our data set has a 1 and not popular in states where our data has a 0. Clicking the “Scatter plot” link on the result makes this more explicit:
The six points on the top right are the six states in New England. The smattering of dots on the lower left are the other 44 states and the District of Columbia. This makes it clear that the query ‘map of new england’ is popular in the six states in New England and nowhere else.
For the New England data set, Google Correlate brings back queries which are characteristic of the New England region. If you have a data set which can be broken down by state, uploading it to Google Correlate may give you insight into some of the driving factors behind your data.
The same techniques discussed for the time series examples also apply to states correlation. If you don’t specify a state then it will be held out. In particular, it is often useful to hold out the District of Columbia which is an outlier in many data sets.
Filtering
Google Correlate makes an attempt to filter out queries which are unlikely to be interesting. These include:
 Queries with a low correlation value (less than r=0.6)
 Misspelled queries
 Pornographic queries
 Rare queries
 Queries which only correlate with a small portion of the time series
For more information about the filtering operations performed by Google Correlate, please refer to the Google Correlate Whitepaper.
Protecting User Privacy
At Google, we are keenly aware of the trust our users place in us, and of our responsibility to protect their privacy. Google Correlate can never be used to identify individual users because we rely on anonymized, aggregated counts of how often certain search queries occur each week. We rely on millions of search queries issued to Google over time, and the patterns we observe in the data are only meaningful across large populations of Google search users. You can learn more about how this data is used and how Google protects users' privacy at our Privacy Center.