You Are Not the Average

How most user ratings websites get it all wrong

Amazon, Yelp!, Rotten Tomatoes, and Reddit are some of the most highly trafficked sites on the web. They also all fail at their core purpose: showing things you are most likely to want. They all fail because they rely on the inherently flawed technique of averaging user ratings. The inherent flaw behind all of these exceptionally popular websites could is both obvious and surprising. It’s this: you are not like the average.

After nearly two decades of thriving consumer ratings websites and perhaps billions of ratings collected, it's past time for consumers to better than the oversimplified average. That starts with an understanding of how we’re being duped by aggregated user ratings, how we got to this point, and a better way forward. An alternative that relies on ratings data that has already been collected, personalization algorithms, and a revaluing expert opinion would greatly improve content discovery and consumer satisfaction.

The rise and fall of the expert opinion model

Prior to the media dominance of the internet, consumers relied on the opinions of a narrow pool of experts published in centralized media such as newspapers, guide books, and TV. Before Rotten Tomatoes, Siskel and Ebert’s two thumbs dominated public perception of what to watch. While the early web followed this same one-to-many model, by the late 1990’s a new phenomenon had emerged. Dubbed “web 2.0”, the user-generated content revolution swept the internet, birthing most of the aggregated consumer ratings websites that still dominate the web today: Amazon, Yelp!, and Rotten Tomatoes. The relative newcomers, Reddit and Quora, follow the same model, taking advantage of the internet as a many-to-many medium.

While the old model of product and media discovery is almost offensively unfashionable today, it’s reliance on experts had some clear advantages. True experts are worthy of our trust because they have wide and deep experience within their field, they are trained to judge and differentiate, and they have learned to be articulate in explaining their opinion. Expert opinion is based on a wide base of information for comparison. An non-expert doesn’t have the time or resources to sample nearly as many restaurants, movies, or other product category as an expert. Expert opinion is cultivated through education, research, and practice. An non-expert hasn't cultivated the keen sense for nuance, or the ability to spot a diamond in the rough. Not least importantly, by virtue of having to find an audience to sell their opinions, experts develop the vocabulary needed to articulate and share those nuances, such as the comparison of wine to the flavors of other foods, or the comparison of a movie's editing style those of classic movies.

Yet the old model also has clear flaws, and the consumer backlash against these flaws, exposed by the advent of the internet, has led to the near extinction of expert guided media. Most strikingly, the expert media pool is incomplete. Consumers understand that the world contains many more educated and experienced experts than can be supported by centralized media. Prior to the internet, only so many TV time-slots or newspaper inches were available. Also, many individuals - people we all know - have reached an expert–level in evaluating some arena of consumer products or services, yet for various reasons haven’t made a career of sharing their advice. Additionally, published experts are vulnerable to certain kinds of biases that don’t effect unpublished-experts and non-experts, such as cultural elitism and pressure from sponsors. Lastly, it’s clear that taste is not universal, and that experts often strongly disagree with one another. The experts of the limited media pool maintain authority by posturing as the arbiters of taste. However, as aggregate user ratings emerged as credible, this weight of authority has vanished.

The "Wisdom of Crowds” - misunderstood and misapplied

With limitless “channels" and low operating costs, the web is famously said to have "democratized" publishing. Initially, it gave voice to many dedicated connoisseurs, in the form of blogs and small websites, who were free from the traditional biases of centralized media (yet vulnerable to others). But the value of the web publishing turned out to be much greater than simply greatly multiplying the one-to-many channels. The user-generated content revolution, exemplified by aggregated ratings websites such as Yelp! and Rotten Tomatoes, proved that online publishing could be as democratic as an election by functioning in the exact same way. Today, fifteen years later, most consumers eschew both expert and connoisseur opinion in favor of websites that express the majority opinion. But are they wiser for it?

Aggregated user-ratings are often referred to as the "wisdom of crowds”. By averaging dozens to thousands of real user opinions, this ostensible wisdom is revealed by smoothing all individual biases. Whether an expert with potential for elitism, a connoisseur who may be prone to sophomoric assumptions, or an amateur with unusual personal tastes, on today’s most popular websites, all opinions are given equal weight. In theory, with enough opinions, the individual biases in any one direction will counterbalance the others, and the resulting opinion at the center of the bell-curve will resemble the majority of website visitors seeking product and media advice. So why doesn’t this work?

In fact, “the wisdom of crowds” has been shown to be highly accurate in estimating an objective quantifiable property, but not in prognosticating an individual's subjective opinion. In his famous experiment, statistician Francis Galton observed that the aggregate guesses of 800 onlookers as the weight of a slaughtered ox was accurate within 1% of the actual weight. While any one guess was likely to be inaccurate or biased, the inaccuracies were equal on either side of the actual weight and therefor balanced each other out. However, the meme of crowd wisdom has since been misapplied to consumer ratings. Critically, the slaughtered ox has a single true weight, which will be the same for each person who weighs it. In contrast, a restaurant, film, or product has a different level of appeal for each person.

You are not like the average

This average of many opinions is misleading in two important ways. The first is that a single rating from each user is already itself an average of several opinions of various attributes. For example, we evaluate a restaurant based on taste, ambiance, service, and value among other attributes. We must create our own method of averaging these marks into a single number rating. If we feel the restaurant deserves very high marks for taste but very low marks for value, our final number must either ignore an important attribute or create a middling number which reflects neither of our true opinions. Even if we have a generally positive opinion about each attribute, each of us weights these attributes differently. Where one person values taste above all else, another cares only about value. While this first problem is also true for experts, an individual expert can explain to her audience her opinion for each attribute and her weighting methodology.

The second important way that the average is misleading is that the vital cultural make-up of respondents is lost to the average. Large minority groups of people have both strongly divergent opinions and strongly divergent weighting methods from other groups. For example, the large majority identify strongly as either men or women. Consumer opinions are often sharply divided on gender lines, especially when it comes to entertainment and fashion. Additionally, each poll respondent may belong to one or several specific minority cultural buckets related to age, wealth, ethnic background, education, religion, political leaning, etc. While a majority opinion exists within any one bucket, many buckets have opposite opinions from the others. We can count on Baby Boomers vs. Millenials, Christians vs. Atheists, liberals vs. conservatives, etc. to have opposing tastes. Seen together on a graph, these opposing opinions would form an inverse bell-curve, meaning that the majority of all respondents are at the extremes of like and dislike while a minority have a middling opinion. Yet the average of all opinions would be the middling number that few agree with.

The simple and wonderful fact is that even though, on average, your opinion is likely to be similar to the average, nevertheless you are not like the average in the individual cases that matter to you most. This is especially true with opinion and online-content discovery websites such as Reddit and Quora. When exploring these sites, visitors hope to learn whether or not they will enjoy their item of curiosity, but all they can learn is, at best, the majority opinion, and at worst, a nonexistent opinion resulting from the average of opposing viewpoints. Additionally, online ratings averages bely the fact that respondent demographics often skew heavily away from census demographics. The online world still skews towards non-hispanics, under 50 years old, who earn over $50,000 per year. Yet because opinions can be expressed numerically (e.g. 4 out of 5 stars) and because numbers seem scientific and absolute, online ratings are often misunderstand as being objective measurements of item quality.

A better approach – the wisdom of peers

The user-generated revolution freed us from the constraints of old media, but threw the baby out with the stale bathwater of the opinion elites. It brought us huge datasets of public opinion, but fails to mine that data for personalized insights. It’s time to change that by replacing aggregated ratings with personalized, data-driven, segmented, and expertly informed recommendations. NetFlix once took steps towards solving this complex algorithm, and Google has moved in the right direction with its acquisition of Zagat. In the offline world, we please special value in the opinions of our peers – the people we understand best and who understand us best. It’s time for the whole web to utilize peer advice, from e–commerce to online content discovery websites.

While you are not identical to the crowd, you do have many peers within the crowd. Using existing data, the members of each peer group be determined based on their voting similarities and their specific vectors of similarity. Peer groups are not static pools but rather a shifting network of relationships and weights. Unlike offline peers, peers in the sense aren't people who know each other, but people who are most similar to each other. This similarity can be determined not only by virtue of liking or disliking the same items within a category (be it websites, restaurants, or films), but also according to the attributes of an item that each person weights most heavily. For example, an individual’s own ratings of restaurants can reveal that she prizes value over ambiance. When she searchers for the best restaurant nearby, the recommendation can be personalized by giving more weight to peers who also prize value even when they haven’t rated the same items.

Our online peer groups contain not only other amateurs but also published experts & connoisseurs. The algorithms that link peers should also weight the latter more heavily. Expert opinion is lost to mere aggregation because amateurs far outnumber experts & connoisseurs. Yet amateurs aren’t practiced enough to identify subtlety nor experienced with a wide enough sample to make well informed reviews. Beyond numeric ratings, the text excerpts from reviews by experts should be highlighted because experts are much more able to articulate the reasoning behind their opinions.

This doesn’t mean that amateur opinion isn’t of value. Linking peers requires a robust data set. Fortunately, unlike mere aggregation, peer based recommendation algorithms give individuals a better reason to contribute their opinions and ratings. The motivation to provide only ratings is currently only self-expression and community clout (online karma). A powerful additional motivation is provided by knowing that the more opinions she provides the better her personalized recommendations. Peer based recommendations are self-reinforcing.

The wisdom of peers is especially relevant to content discovery websites such as Reddit and Quora. The old media of network television and mainstream cinema suffer from their limited distribution channels. The small number of channels necessitates that all of their selected content appeal the widest possible audience. This leads to, for example, the dull humor of many television sitcoms and the eye-rolling plot tropes of many Hollywood movies. Early on, online content aggregators claimed a more sophisticated and specialized audience, but that is no longer the case if it ever was. The aggregated ratings of Reddit work in the exact same way as Nielsen ratings. Users with both discriminating and non-discriminating tastes are blended together and only items that appeal to everyone will achieve top ranking. The fact that Reddit offers more channels and more content doesn’t mitigate the fact that content ranking isn’t personalized.

Content discovery websites face a vicious cycle of ever decreasing ranking quality as they grow. This is because the early-adopters of any aggregation website - or any sub-channel within that website – will tend to be experts and connoisseurs. Experts care most about any particular category, and are therefor most likely to discover and contribute new content to a website or channel dedicated to that topic. But as the website or channel reaches a wider audience, late–adopters will not only be necessarily less expert, but also from more diverse cultural backgrounds. As amateurs outnumber experts, the aggregated ratings of content shifts toward the opinion of the wider population and no longer matches that of the early adopters. Thus the people most likely to discover new content within a category are driven away.

Personalized ratings algorithms are not simple to create and maintain. Determining peer networks is a complex mathematical challenge. However, even small steps in the direction are a huge improvement of blanket averaging. The average waste size for US men is 36” (99cm), but no one would settle for 36” pants if their waist is 38”. In just the same way, the mere averaging of opinion isn’t helpful enough. With just one pants purchase, it’s possible to provide a much more accurate recommendation for appropriately sized pants for that individual. Since many users have already contributed a rich history of their opinion across the web, the data already exists to improve popular online ratings websites. I’m looking at you Yelp!, Rotten Tomatoes, and Reddit. With more complex algorithms such as Movielens, the personalized “smart online agents” theorized in the mid 90’s are now possible to realize. However, the leading websites with the most existing data don’t have enough incentive to revamp their recommendation engines because they earn money from pageviews rather than sales. Despite their head-start, startups that can offer accurate and personalized content discovery may soon topple their dominance.