Discovering what people really think has often proved elusive. Just ask those pollsters who assured us that Donald J. Trump had no chance of becoming the 45th president of the USA. I am fascinated by the gap between what people say and what they really think or do. Sometimes we lie to ourselves – we really do have good intentions, but we do lack self-awareness. Other times we hide what are socially unacceptable views. What social scientists need are ways to get around these biases. I have already blogged about big data and its potential. One key area is Google search. The omniscient Google is our friend, confidante and confessor. We have all googled something that we wouldn’t dream of asking somebody in the flesh. Such search queries are anonymous, or at least that’s how we feel. Every time we type in a search we reveal something about ourselves. It is like a societal x-ray of our collective hopes fears and desires. In particular, Google’s anonymous, aggregate data can also tell us about the dark sides of our thoughts and behaviours. This tool is the subject of a new book by Seth Stephens-Davidowitz, Everybody Lies: Big Data, New Data, and What the Internet Can Tell Us About Who We Really Are
Seth Stephens-Davidowitz created a map of racism based on searches with racial slurs and racist jokes. He then studied how this affected voting across the United States. He used it to analyse how it affected voting in the 2012 presidential election. What he discovered was that in those areas of the country with the highest number of racist searches Obama’s results were markedly worse than those of John Kerry, the unsuccessful white Democratic candidate in 2004. This variable was far more relevant than education levels, age, church attendance, or gun ownership was. Although Obama won, the effect was important. Obama lost roughly 4 percentage points nationwide just from explicit racism. He was, however, able to get back 1% or 2% from higher African-American turnout. In 2012 the conditions were favourable the the Democrats. But this data is also germane to what happened in 2016 and the rise of Trump. According to Nate Cohn, the biggest predictor of Trump support in the Republican primaries was the racist searches. We need to be careful – correlation is not causation. Nevertheless it does provide a partial explanation of the Trump phenomenon.
What can Google tell us about Sex lies and videos? One revealing fact is that in 2015 2.5 billion hours of porn were seen on Pornhub, the largest pornography site on the Internet. To put this number in perspective, it is more than the entire history of our species on Planet Earth. And in surveys 2.5% of men say they are gay. But Google tells another story; 5% of male porn searches are for gay porn. There are more gay searches in tolerant areas, such as California, than in places like Mississippi. But the difference is not that high 5.2% compared to 4.8%
We parents do tend to want to project things onto our kids. In fact, of all Google searches starting “Is my 2-year-old…” the most common next word is “gifted.” We like to think that as parents we have equivalent expectations and dreams for our sons and daughters. But the abovementioned question is not asked equally about young boys and young girls. Parents are two and a half times more likely to ask “Is my son gifted?” than “Is my daughter gifted?” They show similar biases when using other phrases related to intelligence. Stephens-Davidowitz asks if the parents are simply picking up on legitimate differences between young girls and boys. In fact, at this age girls tend to have larger vocabularies and use more complex sentences. In American schools, girls are 11% more likely than boys to be in gifted programs. Nevertheless, parents seem to find that their male progeny are the gifted ones. With their daughters their concerns are more about appearance. “Is my daughter overweight?” is googled approximately twice as much as is “Is my son overweight?” this is despite the fact that whereas 30% of girls are overweight the corresponding figure for boys is a little higher- 33%.
A team of researchers from Columbia University and Microsoft analysed data from tens of thousands of anonymous users of Bing, Microsoft’s search engine. They coded a user as having recently been given a diagnosis of pancreatic cancer based on unmistakable searches, such as “just diagnosed with pancreatic cancer” or “I was told I have pancreatic cancer, what to expect.” The researchers wanted to discover what symptoms were strong predictors of a diagnosis. They examined the searches that had been made before the actual diagnosis, comparing the few who were finally diagnosed with the cancer to those who weren’t. Here’s how Stephens-Davidowitz explains what were remarkable results:
“Searching for back pain and then yellowing skin turned out to be a sign of pancreatic cancer; searching for just back pain alone made it unlikely someone had pancreatic cancer. Similarly, searching for indigestion and then abdominal pain was evidence of pancreatic cancer, while searching for just indigestion without abdominal pain meant a person was unlikely to have it. The researchers could identify 5 to 15 percent of cases with almost no false positives. Now, this may not sound like a great rate; but if you have pancreatic cancer, even a 10 percent chance of possibly doubling your chances of survival would feel like a windfall.”
In Everybody Lies Stephens-Davidowitz talks about the digital truth serum. The truth about what we think is so hard to find that we need every tool at our disposal. As a sceptic of opinion polls and surveys I like the idea of using proxies. However, as I pointed out in my post about big data there is a danger of finding spurious correlations, with cherry picking on an industrial scale. Nevertheless, I do feel that this book has hit on something.