Bill James Explains

My friend Bob Page spotted a great Q&A with one of the most interesting "numbers guys" around, Bill James. His specialty is baseball statistics, but what makes him special is a rare combination of quantitative depth and the ability to communicate that depth in accessible, interesting ways.

Bob's blog post pulled a few choice quotes, which I won't repeat here. However, I'll add a few more:

People who think that they know when a manager should bunt and when a manager should pitch out and when a manager should make a pitching change are amateurs. People who have actually studied these issues know that the answer disappears in a cloud of untested variables.

On quantification of defense versus offense:

The interesting question is why defense is so much more difficult to quantify than offense in all sports. Perhaps defense by its nature involves more interaction between individuals than individual actions, and perhaps the way to get past that is to embrace the concept and measure combinations of players.

And finally here's a quote from a recent article by James in Slate, about his theory for when a college basketball game is decided. It involves a percentage for how safe a team's lead is, based on time left, point differential, and ball possession. A 100% safe lead is one that cannot be overcome.

The theory of a safe lead is that to overcome it requires a series of events so improbable as to be essentially impossible. If the "dead" team pulls back over the safety line, that just means that they got some part of the impossible sequence—not that they have a meaningful chance to run the whole thing.

That's classic Bill James: explaining a theory of his own making, which involves a subtle statistical point, in two sentences that anyone can understand.

Correcting for the Human Factor in Movie Ratings

A recent Wired article, This Psychologist Might Outsmart the Math Brains Competing for the Netflix Prize, is about Gavin Potter, a retired management consultant who is singlehandedly yet effectively competing against corporate and academic research teams in the top tier of the Netflix Prize.

(The Netflix Prize is a $1 million challenge to anyone who can exceed the performance of Netflix's movie-recommendation algorithm by 10%. Netflix provides a big database of its users' movie ratings as grist for the contestants' mills. It also provides a means to test contestants' predicted ratings against users' actual ratings, thus measuring accuracy. Although a 10% improvement may not sound like much, I've previously discussed why it is not easy.)

The leading research teams are each exploring variations of statistical/machine-learning approaches, looking for new refinements to relatively well-understood algorithms. While Potter no doubt uses one or more of the standard algorithms, he has apparently gotten a long way with few resources by correcting for well-known behavioral quirks that affect how people rate things. As he puts it, "The fact that these ratings were made by humans seems to me to be an important piece of information that should be and needs to be used."

The article provides an example:

One such phenomenon is the anchoring effect, a problem endemic to any numerical rating scheme. If a customer watches three movies in a row that merit four stars — say, the Star Wars trilogy — and then sees one that's a bit better — say, Blade Runner — they'll likely give the last movie five stars. But if they started the week with one-star stinkers like the Star Wars prequels, Blade Runner might get only a 4 or even a 3. Anchoring suggests that rating systems need to take account of inertia — a user who has recently given a lot of above-average ratings is likely to continue to do so. Potter finds precisely this phenomenon in the Netflix data; and by being aware of it, he's able to account for its biasing effects and thus more accurately pin down users' true tastes.

Admirably, the article goes on to consider the obvious pushback:

Couldn't a pure statistician have also observed the inertia in the ratings? Of course. But there are infinitely many biases, patterns, and anomalies to fish for. And in almost every case, the number-cruncher wouldn't turn up anything. A psychologist, however, can suggest to the statisticians where to point their high-powered mathematical instruments. "It cuts out dead ends," Potter says.

Potter's approach reminds me of ELIZA, a computer program from the 1960s that used simple psychological tricks to impersonate a human—for example, repeating someone's statement back as a question ("My boyfriend made me come here." "Why did your boyfriend make you come here?"). Although ELIZA did not know what it was talking about, it often did better at engaging people than far more sophisticated programs that actually tried to understand and respond to what was being said.

While I'm not suggesting that Potter's work is the algorithmic sleight-of-hand that ELIZA was, he is nevertheless tapping the same success factor: exploiting the humanness of the humans in the system. Not only does it work, but in a contest like the Netflix prize, it is particularly effective because the other leading contestants apparently were not doing it.

[Update: The morning after I published this post, I see that the New York Times has an obituary for Joseph Weizenbaum, creator of ELIZA.]

Vampires versus Math

In an act of monster-slaying unlikely to make the movies or TV, physicists Costas J. Efthimiou and Sohang Ghandi show mathematically why vampires do not exist.

Their thesis:

Anyone who has seen John Carpenter’s Vampires, Dracula, Blade, or any other vampire film is already quite familiar with the vampire legend. The vampire needs to feed on human blood. After one has stuck his fangs into your neck and sucked you dry, you turn into a vampire yourself and carry on the blood-sucking legacy. The fact of the matter is, if vampires truly feed with even a tiny fraction of the frequency that they are depicted as doing in the movies and folklore, then humanity would have been wiped out quite quickly after the first vampire appeared.

The math is simple. Every time a vampire bites a human, the human becomes a vampire, reducing the human population by one and increasing the vampire population by one.

Let's say there are 99 humans and 1 vampire. The vampire claims its first victim. Now there are 2 vampires and 98 humans.

The two vampires each claim a new victim. That would make 4 vampires and 96 humans. The four vampires each claim a new victim, leaving 8 vampires and 92 humans.

Because the number of vampires doubles at each step, the vampires eliminate all the original 99 humans three steps later.

What if we started with 1 vampire and 99,999 humans? It would take 18 steps to eliminate all the humans. What about 999,999 humans? Only 21 steps.

The authors provide a scenario in which the first vampire appeared in 1600, and each vampire claimed one victim a month. The world population at the time would have been vampirized in less than three years.

A few comments:

  • The authors conveniently assert the year 1600's total population (humans plus one vampire) to be a number exactly in the 2^n series (536,870,912, which is 2^29). This enables a tidy last step where 268,435,456 vampires have 268,435,456 victims.
  • The authors define vampires by way of the movies. However, the authors do not model the fact that, in the movies, the humans usually fight back and vanquish the vampires. If we're using movies as the guide, perhaps this better explains why vampires do not exist. ;)
  • Where did that first vampire come from?

If you read the full article, the vampire section is half-way down the page, under the subheading "Vampires."

Analytics That Explain Themselves

As computers have gotten more powerful, so too has the complexity of analytics they can do. But this power often brings a paradox: Complex and interesting analytics can go unused because few people know how to interpret the results.

When it happens, this failure is rarely due to the people. It's usually due to a shortsighted view of analytics, a view that focuses on the underlying data processing at the expense of making the results understandable.

That's the bad news. The good news is, computers can be used not just to "do" analytics but to explain them. For example, long ago at Personify, reports had a footer called "How to Read This Report." It was a plain-English sentence that described the data using the top-left cell as an example. It was simple but effective.

The latest thing to remind me of this topic was one of the best display ads I've ever seen on the Internet. I was looking up Apple on a stock-quote site, and one of the ads was this:

Smarttext

The ad explains Apple's current performance on a popular technical-analysis indicator for stocks. Traditionally, technical analyses are rendered as charts that require expertise to interpret. With this ad, Scottrade is demonstrating its SmartText technology that interprets the charts for you, using current data. (The reason I think it's a great ad: Most ads promise something; this ad actually does it, in context.)

Would technical-analysis pros finds the explanations simplistic? Probably. But for the casual user, do the explanations begin to make sense of something that might be useful? I'd say yes.

I don't follow the field of technical analysis for stocks, so I don't know how unique or effective SmartText is—although it's apparently unique enough to warrant its own ad campaign. What I do know is this: In business analytics the equivalent of SmartText's functionality is a rarity. Analytics results that people don't understand are not so rare.

In other words: Analytics, you've got some explaining to do.

Yield Management for Metered Street Parking

What if parking meters were priced more like airline seats?

The backstory: "Yield management" is what most airlines do when they sell you a seat. The price you pay might be different from what it was yesterday, or will be an hour from now. It depends primarily on the current and expected demand for the seat (usually, the seating class) you want.

By increasing or decreasing prices with demand, the airline can maximize the revenue from a flight's inventory of seats. The goal is to avoid empty seats that generate no revenue while getting the highest rate possible on filled seats. Because many different, constantly changing factors are involved, managing the yield is a complex task.

With that intro, let's transition from airline seats to metered street parking. Like airline seats, metered parking spaces are perishable goods: If they go unused, the potential revenue is lost. Also, as a flight has a limited number of seats, a geographic area has a limited number of metered parking spaces.

Cities raise revenue from parking meters and thus have incentive to manage the yield upward. However, the usual rule is one price fits all. Some cities have different prices in different areas, but that's a long way from active yield management.

A major obstacle has been the the traditional parking meter. The closest it comes to measuring demand is a coin count when the meter is emptied. But even if it could continuously measure demand ("Hey, my space has been empty 35 minutes!"), the traditional meter does not have a way to adjust its pricing automatically.

Enter new technologies. Today, digital meters exist that can change pricing depending on the day and time. Also, a variety of technologies exist to detect when a car enters and leaves a parking space; as a result, demand is measurable not just by meter but by day of the week, time of day, and so on.

Using such new technologies, the Port of San Francisco recently ran a test to understand how its meters were being used. The test involved multiple vendors of next-generation parking meters, with measurement by Streetline Networks, a San Francisco start-up. Streetline has a wireless sensor system that tracks when cars come and go from spaces.

Here is an example of data collected by Streetline at a particular meter:

200_embarcadero_graph

The graphic shows December 2006 metered hours (in gray), occupied hours (in blue) and paid hours (in red) for meters along the even side of 200 Embarcadero, broken out by day of the week.

Among the findings of the study:

  • Demand varied significantly at the same meter during the day, often predictably so. (In the graphic above, note how uneven demand is within each day yet similar across weekdays.)
  • Demand could vary widely on a block-to-block basis.
  • Higher pricing did not affect usage. Meters priced at $3 an hour were used at the same rate as when they were priced at $2 an hour.

The test's one attempt at varying prices involved progressive pricing. Meters were $3 for the first two hours, $4 for the third hour, and $5 for the fourth hour. The idea was, instead of having parking cops enforce a two-hour limit, let the pricing system enforce it by making people pay more the longer they stay. (If you're trying to imagine who would pump $15 worth of quarters into a meter, you'll be relieved to know the test allowed payment by credit card.)

However, progressive pricing was a poor tool for managing yield at peak times, such as at lunch. Quoting from the minutes of a Port Commission meeting where the findings were presented:

What makes it a peak is that most people arrive just before it and most people leave just after it. What you end up having with a progressive rate system is that most people pay the lower rate during the highest usage hour. People [staying] a little longer ended up paying higher rates when demand is lower. This is the opposite of what you want to see if you're trying to balance usage over the day.

Going forward, the Port of San Francisco will be trying other pricing policies:

Block-by-block, there’s a huge variation in demand for on-street parking which means that we need to have pricing policies that are more specific to a specific block or a geographic areas as opposed to Portwide pricing....

When people are parking in the middle of the day between 11a.m.-2p.m and the pricing is just for two hours and the third hour and the fourth hour. We learned that if we want to deal with congestion we have to deal with time-of-day pricing as opposed to straight per hour rate.

In other words, they need to price parking meters more like airline seats. Other thoughts: Vary the prices of the meters near the ballpark around game time. Allow longer time limits in areas/times with low demand.

These pricing schemes are not as dynamic as airline-seat pricing, but they go a significant distance in that direction. Because parking meters are not reserved ahead of time, and checking the price requires some form of stopping, there are practical limits to how dynamic the pricing can be, notwithstanding future visions of parking meters auctioning their spaces wirelessly to cars cruising the area.

You may ask, is this all a good thing? No one likes paying more for parking, and unlike with airlines and airline seats, a city has a monopoly on metered parking spaces. What is in the public interest?

I don't have a definitive answer, but from time to time I've noticed the work of Donald Shoup, a professor at UCLA who specializes in public policy related to parking. Having studied the subject for decades, he says metered parking is usually underpriced, and for that matter, there's too much free street parking. In this New York Times opinion piece, he makes his case.

Independent of the public-policy debate, it's safe to say that elements of yield management will increasingly apply to metered parking, simply because it's now feasible and thus will be tried. Per the old adage "If you can't measure it, you can't manage it," I expect cities to find that if they can actively measure street-parking demand, they can manage pricing better for a wide range of public-policy goals.

The New York Times, Nielsen, and Margin of Error

The April 8, 2007, New York Times had an extraordinary self-indictment of numbers abuse:

Every Monday, a Times ranking of the top 10 prime time broadcast television programs uses a Nielsen rating that indicates how many households watched each show the previous week. On March 26, "60 Minutes" ranked No. 8 with a 9.2 Nielsen rating. (Each rating point represents 1.1 million homes.) With a margin of error of 0.3-rating point...there was no statistically significant difference between the rating of "60 Minutes" and any of the three programs above it in the ranking, or either of the two below it. With no mention of the margin of error, however, Times readers were left to believe the rankings really meant something.

Turns out omitting the margin of error is not new:

Over the past 25 years, only two of the 3,124 archived articles that mentioned Nielsen and "ratings" included a reference to the margin of error.

The piece was by Byron Calame, until recently the Times' Public Editor. As "readers' representative," Calame independently investigated reader questions and complaints. In this case, he contacted Nielsen and questioned Times editors responsible for running the numbers.

The Nielsen spokesperson said the numbers were "estimates," "should not be construed literally," and lacked margin of error data due to resource constraints on Nielsen's side.

Is that a problem? Paraphrasing a Times editor's response: No one else shows margins of error, so what's the problem?

Calame asked another editor why the Times did not at least tell readers that Nielsen does not provide the margin of error. The explanation is telling: "If we run a large disclaimer saying, in effect, this company is withholding a critical piece of information, I imagine many readers would simply turn the page."

Okay, thanks for clarifying the priorities.

Calame's piece called on the Times to do better, and if nothing else, the Times deserves credit for encouraging this criticism from within.

Is Predicting Hit Songs Futile?

I recently covered Columbia Professor Duncan Watts' "cumulative advantage" experiment, in which similar groups of people started with the same selection of songs but ended up with different choices for which songs were hits. If people were just judging the songs on content, the groups' choices for hits should have been similar. However, there was also a social factor: Except for a control group, each group's members could see the popularity of songs within their group but not within other groups.

Professor Watts proposed that the divergent choices for hits were due to each group's piling-on to whatever happened to be initially popular within that group. See the original post for details.

In my praising the experiment, I held back on some questions about the strongest claim in Professor Watts' New York Times Magazine article. In essence, he claimed that predicting hits was futile due to the inherent randomness of social systems like the word-of-mouth that affects entertainment choices.

Because the long-run success of a song depends so sensitively on the decisions of a few early-arriving individuals, whose choices are subsequently amplified and eventually locked in by the cumulative-advantage process, and because the particular individuals who play this important role are chosen randomly and may make different decisions from one moment to the next, the resulting unpredictability is inherent to the nature of the market.

This effect was true of Professor Watts' experiment, but is it realistic to have the early-arriving individuals "chosen randomly"? Isn't there a relatively small percentage of people who act as tastemakers: people who are into new stuff first and who influence others with their knowledgable opinions? If these people have non-random qualities, and in the real world they often end up in that key first-arriver role, shouldn't there be a lot more predictability?

The Limits of an Individual Influential
After some email back-and-forth with Professor Watts, I was surprised to find that the role of "influentials" is potentially a lot less than is commonly believed. In a draft of a paper due for publication later this year, Watts and collaborator Peter Dodds detailed their mathematical simulations of various scenarios involving influentials. The results were summarized in the Harvard Business Review's Breakthrough Ideas for 2007:

Our work shows that the principal requirement for what we call "global cascades"—the widespread propagation of influence through networks—is the presence not of a few influentials but, rather, of a critical mass of easily influenced people, each of whom adopts, say, a look or a brand after being exposed to a single adopting neighbor. Regardless of how influential an individual is locally, he or she can exert global influence only if this critical mass is available to propagate a chain reaction.

To be fair, we found that in certain circumstances, highly influential people have a significantly greater chance of triggering a critical mass—and hence a global cascade—than ordinary people. Mostly, however, cascade size and frequency depend on the availability and connectedness of easily influenced people, not on the characteristics of the initiators—just as the size of a forest fire often has little to do with the spark that started it and lots to do with the state of the forest.

The researchers' forthcoming paper makes a compelling case for these conclusions, exploring influentials' role under many different scenarios. However, its various social-network models all start with the single "spark" of an individual discovering and communicating something. It does not consider a scenario where a large number of simultaneous and non-random sparks occur throughout the network. That is, if a single, random spark can cause a forest fire under the right conditions, how about a bunch of sparks purposely set at once, across that same forest?

The Potential of Coordinated Influence
The coordinated, multi-spark scenario matters because it is how certain social-marketing companies supposedly work: unleashing a small army of "on message" people to tell their friends about some great new thing. One might argue that a favorable newspaper review, radio airplay, or other one-to-many media do something similar, simultaneously "sparking" many consumers at once.

The key point: Instead of having a single line of sentiment that needs to propagate enough times to reach critical mass, the multi-spark scenario has many lines propagating, each of which could randomly run into other lines, thereby accelerating toward a critical mass.

Bringing this all back to the original New York Times Magazine article and its assertion that hits cannot be predicted, the multi-spark scenario is a way for hits to be predicted. In essence, it increases the prediction reliability by manipulating the system.

You may say this is unfair, like loading the dice, but it's how entertainment marketing works. Companies spend marketing dollars in proportion to what they think will be popular, thereby making what they think will be popular more popular. Economically, the question is whether the cost of manipulating the word-of-mouth system is worth the increase in probability of a hit.

Note that predicting a hit doesn't mean being right all the time; it just means that across many attempts the gain is greater than the cost. Thus, even if you only went from a 3% hit rate to a 5% hit rate, predicting was worthwhile if it cost less than the benefit from those extra two percentage points.

So, I'm not ready to conclude that it's futile for entertainment companies to predict hits. If the companies were merely acting as pure observers, then Professor Watts' case would be strong enough for me. However, because entertainment companies' predictions are often entangled with manipulating the systems being predicted, there may still be reason to try to pick winners.

Whether the benefits outweigh the costs...well, that's another experiment to do.

Do You Like What You Like Because You Like What I Like?

An experiment:

  • Have a large group of people rate songs they've never heard before. Each person listens and rates privately so no one knows what others have done. If a person likes a song, he or she can download it. Call this group the "independent group."
  • Now have another large group of people do the same thing with the same songs, except members of this group can see how popular the songs are with others. Call it the "social-influence" group.
  • Split the social-influence group into eight subgroups ("worlds"). Every world has the same songs, but a song's popularity is counted only within that world. Thus, the social-influence group is split into eight parallel popularity contests.

The 4/15/2007 New York Times Magazine had a piece by Duncan Watts, professor of sociology at Columbia University, about this experiment. It was conducted via the Web with more than 14,000 participants. Professor Watts' summary of the expectations and results follows.

First, if people know what they like regardless of what they think other people like, the most successful songs should draw about the same amount of the total market share in both the independent and social-influence conditions—that is, hits shouldn't be any bigger just because the people downloading them know what other people downloaded. And second, the very same songs—the "best" ones—should become hits in all [eight] social-influence worlds.

What we found, however, was exactly the opposite. In all the social-influence worlds, the most popular songs were much more popular (and the least popular songs were less popular) than in the independent condition. At the same time, however, the particular songs that became hits were different in different worlds....

So does a listener's own independent reaction to a song count for anything? In fact, intrinsic "quality," which we measured in terms of a song's popularity in the independent condition, did help to explain success in the social-influence condition. When we added up downloads across all eight social-influence worlds, "good" songs had higher market share, on average, than "bad" ones. But the impact of a listener's own reactions is easily overwhelmed by his or her reactions to others. The song "Lockdown," by 52metro, for example, ranked 26th out of 48 in quality; yet it was the No. 1 song in one social-influence world, and 40th in another. Overall, a song in the Top 5 in terms of quality had only a 50 percent chance of finishing in the Top 5 of success.

And why did this happen?

[W]hen people tend to like what other people like, differences in popularity are subject to what is called "cumulative advantage," or the "rich get richer" effect. This means that if one object happens to be slightly more popular than another at just the right point, it will tend to become more popular still. As a result, even tiny, random fluctuations can blow up, generating potentially enormous long-run differences among even indistinguishable competitors—a phenomenon that is similar in some ways to the famous "butterfly effect" from chaos theory. Thus, if history were to be somehow rerun many times, seemingly identical universes with the same set of competitors and the same overall market tastes would quickly generate different winners: Madonna would have been popular in this world, but in some other version of history, she would be a nobody, and someone we have never heard of would be in her place.

I've quoted at length because I think it's an ingenious and compelling experiment, well explained by Professor Watts. Although we all intuitively know the bandwagon effect, this experiment quantifies its importance in the context of judging unfamiliar music. In this context, the results suggest we—the notorious average "we"—are quick to let what's popular tell us what's good.

The Joshua Bell Experiment

As "a test of whether, in an incongruous context, ordinary people would recognize genius," the Washington Post deployed violin virtuoso Joshua Bell as an anonymous street musician in a Washington, DC, commuter plaza. At his feet, open for donations, was the case of his $3.5 million Stradivarius.

In 43 minutes, Bell played six classical masterpieces. 1,070 people passed by with little to no affect. Seven people stopped for at least a minute. 27 people donated a total of $32, not counting the twenty-dollar bill Bell got from the single person who recognized him.

A fine piece of writing, the Post article describes the experiment and ruminates at length about what it might mean. I can't improve on that, but I'll add some commentary about a few of the numbers.

First, and this isn't pretty, Bell's response rate of around 2.5% is similar to response rates for direct-mail solicitations of credit cards, loan refinancings, and such. Ouch.

Second, the article tells us that Bell plays concerts where the cheap seats go for $100. It's easy to read that detail as an implied value for his commuter-plaza performance, as if the 97.5% of people who ignored him might as well have been walking past a hundred-dollar bill on the ground.

This presumes that because some people would pay $100 to see Joshua Bell, then that's the value. It's not. It's the value to the people who paid $100, not the average passer-by on the street. Based on the experiment, the value of seeing Joshua Bell to the average passer-by was roughly three cents. (Don't believe me? Divide the $32 Bell made by the 1,077 people that passed by.)

Of course, the people paying $100 are doing so for a formal performance, at a concert hall, with an admission fee, at a convenient time, knowing that Joshua Bell is the player. All of that is missing from the experiment. So how surprised should we be that most people ignored him?

Anyway, the article is still a good, thought-provoking read. It's gotten a lot of play in the blogosphere, suggesting that if the public can't recognize anonymized genius, it can at least recognize interesting commentary about the public's inability to recognize anonymous genius.

How Big Was That Squid?

Sometimes it takes a non-numeric explanation to make numbers hit home. For example:

A fishing crew has caught a colossal squid that could weigh a half-ton and prove to be the biggest specimen ever landed, a fisheries official said Thursday. The squid, weighing an estimated 990 lbs and about 39 feet long, took two hours to land in Antarctic waters, New Zealand Fisheries Minister Jim Anderton said.

I don't know about you, but I can't easily imagine a 990-lb, 39-foot squid. The numbers are so far off my conceptual squid scale that they don't mean much beyond "huge."

I could try to relate the numbers to something more familiar, but a squid expert saved me the effort with this deft analogy: "If calamari rings were made from the squid they would be the size of tractor tires."

Ah, that big.

[The quotes are from an AP article about the squid in question.]

CMU's TrafficSTATS

Carnegie Mellon University's Center for the Study & Improvement of Regulation recently introduced TrafficSTATS (STAtistics on Travel Safety), a Web tool for analyzing traffic-fatality data. It is based on two U.S. government databases, one that tracks all accidents that involved a fatality, and another that estimates the total amount of driving done in the U.S. by various characteristics such as region, vehicle type, and such. The time range covered was 1999 to 2004.

I took a quick look, and here's what caught my eye:

  • Accidents with fatalities were extremely rare—much rarer than I would have guessed—in terms of the amount of driving done. On average, an accident with a fatality occurred once in 100 million "person miles." (A person mile counts one mile for each person that traveled that mile; if a car with three passengers travels one mile, that's three person miles.) A fatality could be anyone involved, including nonmotorists.
  • That said, if you multiplied the number of people in the United States by the number of person miles they traveled, the result was on the order of 40,000 traffic fatalities per year.
  • A motorcycle was 30 times as likely to be in a fatal accident than the average personal vehicle. (That's in terms of person-miles traveled, so the metric is something like the risk per mile traveled.)
  • Females were less than half as likely as males to be in a fatal accident.
  • The average weekend day had 35% more fatal accidents than the average weekday (although Friday was a well-above-average weekday, enough to look more like Saturday and Sunday than its weekday peers).
  • In terms of person miles, people in the "East South Central" region (Alabama, Kentucky, Mississippi, Tennessee) were more than twice as likely to be in a fatal accident as people in the "New England" region (Connecticut, Massachusetts, Maine, New Hampshire, Rhode Island, Vermont). Those were the two regional extremes.
  • In terms of person miles, age groups 16-20 and 75-84 were each more than twice as likely as average to be in a fatal accident. 21-24 was slightly less than twice the average. Between 25 and 64, everything was under the average.

There are hints of messiness in the underlying data, but for simple analyses like the above, it's probably not a big deal. For those that want to work with the raw data sets, they are available.

Stock-Price Milestones

Most milestones have an arbitrary quality, relying on the roundness of a number to make 500 seem more meaningful than, say, 493. But stock-price milestones have an extra layer of meaninglessness. If you know why, you can stop reading. However, the huge amount of media coverage for Google's stock price breaking $500 last week suggests to me that some people might want to read on.

Here's the problem: A company's stock price is not comparable to other companies' stock prices, nor is it necessarily comparable to itself over time. This is because the stock price represents the market value of the company divided by the number of outstanding shares. So changing the number of outstanding shares can change the price without changing the value of the company.

An example: Microsoft's share price is currently about 6% of Google's share price, yet Microsoft's market value (share price x shares outstanding) is roughly twice that of Google's. The difference is, Microsoft has a lot more shares outstanding.

In fact, since going public in 1986, Microsoft has split its stock nine times. A split is when a company issues multiple new shares per existing share, often 2 for 1 but sometimes with other ratios.

At this point, a single share of Microsoft IPO stock—that is, before any of the splits—would equal 288 shares of current Microsoft stock. As of today's closing price ($29.36), that original share would now be worth $8,455.70. But something tells me that the press is not readying articles about Microsoft's breaking the $8,500 barrier.

By contrast, Google has never split its stock. Combine that with Google's fantastic run of financial performance, and you get $500 per share, a number rarely seen with tech companies. That sounds like news until you realize the number is rarely seen because other high fliers have chosen to split their stocks well before reaching $500.

And we haven't yet mentioned the reverse split, where a company reduces the number of outstanding shares, thus raising the price. Some of the dot-com-era survivors did reverse splits to get their stocks up to respectable-sounding prices. Reverse-split at a high enough ratio, and you've got a $500-per-share stock.

You get the idea. An actual milestone measures the distance of a mile. With a stock-price milestone, your mileage may vary.

Election Night 2006 Graphics

On election night 2006, all the U.S. news organizations were working the same story: Can the Democrats take the House of Representatives and Senate?

Each organization had more or less the same data. But how they showed the data was different.

I took the following screenshots around 8:15pm PDT on November 7th. All were from new organizations' home pages or otherwise accompanied headline-level information—that is, these graphics were meant to convey the essence of the story; they were not "drill downs" of detailed data.

The Washington Post

Washington_postsmall
The story is about a relatively small numbers of seats potentially changing hands, so the reference to net gains is good. The graphic is conceptually clever, but I don't know how many people will understand it at a glance. The white space in the middle of each bar represents undecided seats; thus, the pink versus blue bars are racing to be first across the centerline.

CNN
Cnnsmall
Let's start with the lower graphic, for the House: I get it, although the scale is odd. The bars are configured as if both sides are racing to 435. Yet as noted at the bottom of the graphic, 218 is the meaningful number. This awkwardness shows the wisdom of the Washington Post's "race to the centerline" approach, which more clearly reflects that a seat gained for one side is a seat lost for the other.

As for CNN's upper graphic, for the Senate: Why does each bar have two shades? It's not only inconsistent with the House graphic but it does not appear to be explained.

Fox News
Foxsmall

I had to shrink this one to fit because it ran most of the way across Fox News' home page.

On first glance, I'd assume that the distribution of red versus blue represents the relative percentage of seats held. However, the amount of red and blue is the same for both House and Senate, even though the numbers are different (Democrats well ahead in the House, Republicans slightly ahead in the Senate).

So apparently the red and blue don't move with the numbers, and thus do nothing beyond ornamentation.

MSN
Msnsmall

Perhaps in response to Fox's ornamental graphics, MSN went for an old-school table o' numbers. MSN's Microsoft heritage is evident in the design, which appears to be inspired by a PowerPoint 97 template.

The weird thing about the table is how it feels like it should add up to 100, but the "6 undecided" are not in there. Also, do I really care about "Seats not at stake?" I'm forced to care, because it's the only way to understand the rest of the table, which is a problem. Just tell me what's changing and how it affects the balance of power.

MSDN
Msnbcsmall

Now this is an interesting attempt. You need to perceive that the gauge's pointer can go left or right, and that leftward is Democrat, and rightward is Republican. If you get that, it's a good quick-glance view, assuming your eyes are sharp enough to read the gray-on-white numbers.

Nitpicking: Why does the Senate "decided" column have white in it and the House "decided" column does not? And what about independent candidates? For the Senate, where the balance of power turned on only a few seats, two independents won. Seems like that should be part of the visual story.

New York Times
Nytimessmall

Using a map as a visual metaphor is often a good idea, but not when you distort the map to the point where its lack of fidelity is a distraction. In addition, six color codes is probably too many.

In the Times' defense, this graphic was doing double duty as a user interface. You could click a square to get more detail on that district. Thus, each square arguably needed to be a minimum size for clickability. Or, counter-arguably, if the above graphic was the result of each square needing to be a minimum size, then they needed to do something different in the first place.

ABC News
Abc_small_1

This is my favorite. It's not about 100 Senate seats; it's just about the change in balance of power. It tells us the magnitude and direction of change, and it provides the context for how many seats are necessary for the Democrats to take control.

And that's all it does. Works for me.

Round-Up and Wrap-Up
If you scroll back up through the various graphics, I think you'll find that other than ABC's (and, to a lesser extent MSNBC's and the Washington Post's), they made the story more complex than necessary. They each did one or more of the following:

  • They gave nonessential numbers (for example, MSN's "Seats not at stake")
  • The numbers they gave were anchored in total seat counts when the real story was the change in a small number of seats (for example, CNN's race to 435)
  • They used graphics that confused more than enlightened (for example, Fox's unchanging red versus blue, the New York Times' abstract map)

All this goes to say, it's not easy to create these graphics, especially in the TV news field, where more information on the screen is often mistaken for better information.

Congratulations to those that managed to keep the numbers, as they say in Washington, DC, "on message."

One Person, One Vote, Many Voting Systems

In an election, winners and losers are sometimes determined as much by the voting system as the voters. For example, the United States' Electoral College allows a candidate to win the U.S. presidency without winning the popular vote, as happened in 1824, 1876, 1888, and 2000.

In San Francisco, we have a special voting system, Ranked Choice Voting, for certain local elections. Instead of voting for a single candidate, voters rank their choices.

Given a population of voter preferences, Ranked Choice Voting not only can lead to different results from traditional voting but it can also have different results among the various Ranked Choice Voting implementations.

The implementation of Ranked Choice Voting that San Francisco uses, Instant Runoff Voting (IRV), works like this:

  • You rank multiple candidates for an office, indicating your first choice, second choice, and so on.
  • If no candidate attains a majority of first-choice votes, the candidate with the fewest first-choice votes is eliminated.
  • Those who voted for the eliminated candidate have their second-choice votes added to the remaining candidates' totals.
  • If that reallocation does not create a majority for one candidate, the process continues until a majority is reached.

The process is called Instant Runoff Voting because it resembles a series of run-offs. Whereas traditional run-offs happen over time, IRV gets all the necessary information up front, allowing all elimination stages to occur immediately.

Wikipedia's entry on the subject gives an interesting example of how the same voter preferences can have different results depending on the voting system. (I've added some definitions, in brackets, to the Wikipedia text.)

Imagine an election in which there are three candidates: Andrew, Brian and Catherine. There are 100 voters and they vote as follows...

# 39 voters 12 voters 7 voters 42 voters
1st Andrew Brian Brian Catherine
2nd Brian Andrew Catherine Brian
3rd Catherine Catherine Andrew Andrew

In a plurality election [where the winner is the candidate with the most first-choice votes], Catherine would be elected.

In a [standard] runoff election, the voters would choose in a second round between Catherine and Andrew.

In [a San Francisco-like Ranked Choice Voting] election Andrew will be elected.

Under Condorcet's method [each ballot's rankings are converted into pairwise preferences, such as A beats B but C beats A, which are then tallied across all ballots] or the Borda count [each candidate gets points in proportion to his/her rank on a ballot, such as first-choicers get 5 points and fifth-choicers gets 1 point] Brian would win.

Don't worry about processing the details. Let's cut to the implications (also quoted from the Wikipedia article):

[Instant Runoff Voting] may be less likely to elect centrist candidates than some other preferential systems, such as Condorcet's method and the Borda count. For this reason it can be considered a less consensual system than these alternatives. Some IRV supporters consider this a strength, because an off-center candidate, with the enthusiastic support of many voters, may be preferable to a consensus candidate and that this candidate still must be accepted by a majority of voters.

IRV produces different results to Condorcet and the Borda count because it does not consider the lower preferences of all voters, only of those whose higher choices have been eliminated, and because of its system of sequential exclusions. IRV's process of excluding candidates one at a time can lead to the elimination, early in the count, of a candidate who, if they had remained in the count longer, would have received enough transfers to be elected.

You get the idea: same voter preferences, different results.

And for a final twist, does the scenario with Brian, Andrew, and Catherine work in the real world? It assumes that the voter preferences and the voting systems are independent—or, put another way, that different voting systems would elicit the same preferences.

But in a real-world election, candidates know which voting system will be used, and they target their campaign spending to shape the preferences of specific segments of voters. Depending on the voting system, it could make sense to target different voters, thus leading to potentially different preferences.

The takeaway: A lot of potential complexity lurks behind "one person, one vote."

The Netflix Prize: Research Project as Product

Several people have asked what I think of the Netflix Prize, a $1 million contest to improve Netflix's movie recommendations by 10%. For those expecting an "analyze the analytics" posting like Pandora vs. Last.fm, I'm going to throw you a curveball. I think the more interesting story here is about product marketing—and the Netflix Prize itself is the product.

Productizing a Research Project
From Netflix's perspective, better recommendations mean higher profits. For those interested in the economics, Chris Anderson (author of The Long Tail) explains them

But how do you make better recommendations? The usual approach would be to put some researchers on an internal project. Netflix had been doing that for years, but their researchers appear to have hit the point of diminishing returns.

Then somebody had the idea of throwing open the problem to the rest of the world—something to the effect of, "There must be thousands of people with the skills, motivation, and computing hardware to tackle this problem. We just need them to work for us."

There are indeed lots of experts in fields like statistical computing, machine learning, and artificial intelligence. There are even more dabblers who know just enough to be dangerous and could come up with answers the pros would never consider. The more people involved, the better the chance of success.

So from Netflix's perspective, the problem evolved from creating a better algorithm to creating something—the Netflix Prize—that in turn would create Netflix a better algorithm. In essence, they built the Netflix Prize as a product: The "customers" were the prospective researchers; the challenge was to design and market something that would get these customers' buy-in to participate.

Getting Attention: Eyes on the Prize
The $1 million prize is the most obvious feature. Having noticed the success (and now proliferation) of science-based prizes like the Ansari X Prize, Netflix no doubt liked the combination of free publicity such a prize generates along with the competitive dynamic that real money generates. In other words, don't just get thousands of virtual researchers, but put them in a race. The press and blogosphere were duly abuzz.

Making It Real: Heavy-Duty Data
Netflix offered up a huge, real-world data set of people's movie ratings. This alone would have been enough to get lots of smart people playing with the data, since the typical data miner (who does not happen to work at Netflix, Amazon.com, or other data-rich players) rarely if ever gets a crack at data like this.

That said, Netflix slightly tainted this feature by "perturbing" an unspecified amount of the data "to prevent certain inferences being drawn about the Netflix customer base." It's not a big issue because a built-in limit exists to Netflix's messing with the data: If the perturbed data ends up differing from the original data in important ways, Netflix could end up with a nightmare scenario where the winning algorithm exploits those differences and thus is not applicable to the original data. If that happened, Netflix would pay $1 million for an algorithm they can't use on their actual data. As a result, we can safely assume the perturbed data is faithful to the original.

Talking Right: The Web Site
The Netflix Prize has its own Web site with a voice that is well tuned to its "customers," the researcher types. The Rules and FAQ pages are not written in legalese, academic jargon, or various marketing dialects that no one speaks but that nevertheless appear in written form everywhere. The text is smart but informal, technical where necessary but not gratuitously so. To whomever wrote it, I salute you.

The Web site also includes a simple but effective leaderboard and community forum.

Giving Back: Winner Tells the World
Anticipating that most prospective researchers would immediately look for a catch—like what happens to the intellectual property you submit—Netflix summarizes the relevant terms in plain English: "To win...you must share your method with (and non-exclusively license it to) Netflix, and you must describe to the world how you did it and why it works." I expected something far more dire. Besides adding a touch of idealism to the proceedings, the bit about telling the world talks to the likeliest suspects for contestants: academics or corporate researchers who have strong professional incentives to publish their work.

Selling the Goal: It's Only 10%

"10% improvement" is a clever packaging of the goal, because it's a lot harder than it sounds. According to the FAQ, Netflix's own algorithm—the one you're trying to beat by 10%—is only 10% better than "if you just predicted the average rating for each movie." In other words, a naive approach works pretty well. And while there is still a significant amount of distance between Netflix's algorithm and perfection, anything close to perfection is impossible because people are not consistent raters, neither among each other nor individually over time. Thus, a major unknown is how much headroom exists to do better before one hits the wall of rating noise. Yet it is known that achieving the first 10% over a naive approach was far from trivial.

The Results So Far
Three weeks into the competition, more than 10,000 contestants have registered. Twelve contestants have cleared the 1% improvement mark, seven have cleared 2%, three have cleared 3%, and two have cleared 4%. The current leader is at 4.67% improvement, almost half way to the $1 million prize.

Given that Netflix was ready to let the contest run for ten years, and included yearly "Progress Prizes" for contestants that could exceed the best score by 1%, I'd say the Netflix Prize has exceeded expectations so far. And that does not factor-in the positive public relations and consumer awareness that came with the various press hits.

If the progress continues at the current rate, the contest will be over at the three-month minimum that Netflix has set. However, extrapolating from the current pace is risky. Every additional point of improvement will be harder, and we don't know where the practical limit is.

Why It's Different
There have been various other data-mining competitions. I'll hazard a guess that Netflix's is the first to be covered as a feature story in the New York Times and will easily be the largest ever in term of participation. (The New York Times story is already behind the pay wall, but a syndicated version is available at News.com.)

The comparison with previous competitions is not fair, because other competitions have tended to be academic affairs, providing a little collegial competition at conferences. Yet Netflix's success underlines how much more can be done when a data-mining competition becomes a means to do business.

By treating the Netflix Prize as a product, complete with features designed to maximize "customer" buy-in, Netflix created something far better than spending $1 million on its own researchers' salaries over time. In that sense, the Netflix Prize is more interesting as a business method—spearheaded by spot-on product marketing—than a "Which algorithm will win?" story.

So I say to Netflix: Great idea, great execution. And to the contestants: May the best algorithm win.

CarMax Does Data Better

The September 2006 issue of Business 2.0 has an article, "The Wal-Mart of Used Cars," about CarMax, an analytics-driven chain of superstores for used cars.

In the same way that Wal-Mart revolutionized the logistics of retailing, CarMax set out to nail the perfect mix of inventory and pricing through exhaustive analysis of sales data. Its homegrown software helps CarMax determine which models to sell and when consumer demand is shifting. Each car is fitted with an RFID tag to track how long it sits and when a test-drive occurs....

Without the data, stocking CarMax lots would be a logistical nightmare. Each store carries 300 to 500 cars at any given time, and unlike Wal-Mart, the company has no vendors to stock its "shelves." Instead, CarMax depends on 800 car buyers, who draw on the company's reams of data to appraise vehicles.

The article doesn't mention it, but I suspect that CarMax's situation is one where the analytics appear to be the competitive advantage yet the real advantage is the data feeding the analytics. That is, analyzing sales and inventory data a la CarMax involves a mature set of techniques and tools; it's highly unlikely that CarMax has found a new analytics secret sauce. Far likelier is that CarMax collects more and better data than the competition, allowing those mature analytical techniques to yield better results.

For example, consider two big advantages CarMax has in data collection:

  • It's a network of superstores, each of which carries many more cars than a typical dealership. This scale means CarMax can sample the marketplace better than other used-car dealers.
  • CarMax's car buyers act as a data-normalizing force, ensuring that the details of cars in CarMax's database are classified in a complete and consistent way. This advantage is key compared to the obvious alternative of scraping eBay and other online sources of used cars, which together would comprise a sample even better than CarMax's. The problem is, the greater quantity of data comes at the cost of much lower quality. There  would be no common definition of key attributes like "good condition" or, for that matter, no standards for what attributes to include. That means noisy, messy data—just the thing to make otherwise good analytics look bad.

So let CarMax be a reminder: Amid all the attention Internet-based businesses get for their unprecedented data opportunities, traditional businesses like used-car lots can be networked and data-intensified to compete in new ways as well.

Wine Ratings: Drunk on Numbers?

In "Wine Ratings Might Not Pass the Sobriety Test," Gary Rivlin of the New York Times examines the 100-point rating systems that have become pervasive in the wine business. Some highlights:

A rating system that draws a distinction between a cabernet scoring 90 and one receiving an 89 implies a precision of the senses that even many wine critics agree that human beings do not possess. Ratings are quick judgments that a single individual renders early in the life of a bottle of wine that, once expressed numerically, magically transform the nebulous and subjective into the authoritative and objective.

When pressed, critics allow that numerical ratings mean little if they are unaccompanied by corresponding tasting notes (“hints of blackberry,” “a good nose”). Yet in the hands of the marketers who have transformed wine into a multibillion-dollar industry, The Number is often all that counts. It is one of the wheels that keep the glamorous, lucrative machinery of the wine business turning, but it has become so overused and ubiquitous that it may well be meaningless — other than as an index of how a once mystical, high-end product for the elite has become embroidered with the same marketing high jinks as other products peddled to the masses.

Although four- or five-star rating systems for wine existed before, Robert Parker originated the modern 100-point system in 1978. Since then, it has inspired many imitators, to the point where a single wine may be rated by a dozen different 100-point systems.

Cork dorks say that even today, the only scores that count are those of the first two publications to embrace the 100-point score: Mr. Parker’s Wine Advocate and Mr. Shanken’s Wine Spectator. That has not stopped retailers from cherry-picking high scores no matter who comes up with them. Wine.com uses no less than seven sources when fishing for members of the 90+ club, including The Wine News, the Connoisseurs Guide and the International Wine Cellar. And in a pinch, Wine.com is not above turning to an eighth source.

When promoting Capcanes 2001 Costers del Gravet, a Spanish wine, for instance, Wine.com quoted a well-regarded publication, International Wine Cellar, written by Stephen Tanzer, in its review. But the source of the 91 that earned the 2001 Costers a place on its 90+ list was Wine.com itself. (The company did not return a call seeking comment.)

Not only are these systems open to overt manipulation, but even the most respected and systemic raters communicate their biases, if inadvertently:

Mr. Parker and the critics from Wine Spectator tend to save their highest ratings for robust-tasting, more intense wines....“That’s another way numbers are misguiding people,” said Mr. Tisherman, the former Wine Enthusiast editor who now calls himself a “recovering critic” and helps clients sponsor wine-tasting parties. “A 96 is better than an 86, but not if you want a light-bodied wine, and Americans tend to prefer light-bodied wines. Yet those are also the wines least likely to get a good score.”

Although I've provided several tastings from the article, I'd recommend you quaff the whole thing. It has precision, balance, concentration, power and finesse, with plush layers of currant, mocha, berry, mineral and spice—oh wait, that last part is not about the article; it's from the description of Wine Spectator's 2005 wine of the year, Joseph Phelps Insignia Napa Valley 2002.

Did I mention it scored a 96?

Vanity Sizing

As part of my day job, I receive various news about the retailing industry—from which, I bring you the following abuse of numbers, apparently particular to women's clothing.

ABC News' Good Morning America recently reported about "vanity sizes" in women's clothing:

[C]onsidering pop culture's obsession with thinness, for many women no size is too small.

"I had, one time, a client who said, 'I get into a 10 now,' " said Bridgette Raes, a fashion consultant. "She was originally a size 14. When she could get into a 10, and then into an 8, she was like, 'I know that it was a lie, I know that this really isn't a 10, but I love the fact that the label says 10.'"

That may be the thinking behind vanity sizing — which means clothes are cut bigger, but sized smaller.

"Manufacturers and brands are trying to really make women feel good about buying their brand," said Marshall Cohen, a retail industry analyst. "If you were worried about being a size 14 or 16, I can make you feel great by a size 10 or 12."

One size 0 could have a waistline of 28 inches, which is, according to American Society of Textile and Material, a size 10.

It's not a new topic. This article, from the Arizona Republic in 2004, indicates that vanity sizing has been around a long time, and when efforts periodically emerged to (re)standardize women's sizing, the apparel manufacturers ignored them. By contrast, men's clothing sizes have largely stayed the same over time.

I suspect most women understand vanity sizing, and per the article, many appreciate it. So among the sins of misusing numbers, stretching the standard-sizing truth is like a white lie that everyone's in on. After all, if the scale doesn't lie, clothes can at least fib.

Why You'll Probably Outlive the Average Life Expectancy

The average life expectancy in the United States is roughly 75 years for males, 80 years for females. Chances are, you will exceed the number that applies to you.

Because life-expectancy numbers are often based on recent mortality rates, you might be thinking that future advances in medicine will give you an edge. While that may be true, the surprise is that you already have an advantage over the original numbers just by being alive to read this.

Think of 100,000 people born the same year as you. A certain percentage of that original population will die each year, as represented by the distribution below. (The original numbers are from the U.S. Social Security Administration, from which I derived the measures and charts on this page; click any chart for a larger version.)

Death_rate_by_agesmall

It's not a happy thing, but each bar in the chart indicates the percentage of the original 100,000 people that died, or are projected to die, in each year. You don't know which future bar has your name on it, but you do know that all the bars to the left of your age no longer apply to you. As a result, your current life expectancy is computed against the average of the remaining population.

In turn, that means your life expectancy is always increasing and that you have exceeded the original average practically from the beginning, as illustrated below (based again on the same Social Security data).

Actual_vs_expected_agesmall

Note that if you are a 40-year old male, you're already up more than two years from the original average. If you are a 40-year-old female, you're up about a year and a half. And for those males that live into their mid-80s, they will have closed most of the gap in life expectancy versus females.

Of course, all this is based only on the male and female averages. Your life expectancy will rise or fall based on other important attributes. For example, if you are a chain-smoking alcoholic who lives on a Superfund site, you might want to lower your expectations.

Nevertheless, everything else being equal, this is a subject where it's nice to know the odds are with you.

Data Visualization as Art

Will there be a future Rembrandt whose medium is data visualization? I was thinking about this after encountering Jesse Bachman's "Death and Taxes: A visual look at where US tax dollars go."

According to the summary, Bachman spent close to a year researching and creating this visualization of where the U.S. government spends money. I have reproduced a small version below...

Death_and_taxes

...but I highly recommend you scroll around the big version to appreciate the piece's detail, clarity, and artistry.

I use the word artistry with the idea that some data visualizations qualify as art. Bachman's piece clearly has artistic intent, from its political message to the name of the site it's on (deviantART). And independent of the data's message, the visual design and rendering is...well, artistic.

By comparison, below is an infographic on a similar subject. It is nicely done but feels more like good craft than art. (See here for the full-page version.)

War_infographic

(Yes, I realize at this point that we are ankle-deep in the "What is art?" swamp. Maybe Bachman's stuff is really "graphic design"? Or can graphic design be art? And so on. For the rest of this post, I promise to restrict myself to sloshing around the edge of the swamp rather than going deeper.)

Bachman is selling posters of "Death and Taxes," so you can hang it on your wall, art-like. Similarly, data-visualization titan Edward Tufte's Web site has a "Fine Art" section where you can order large, high-resolution prints of his work.

And then there's Mark Lombardi, whose work I saw a few years ago in an art gallery. He researched and created highly detailed graphs showing the connections between people and events. Here's an example of one of his works, "george w. bush, harken energy, and jackson stevens c.1979-90, 5th version."

Lombardibig

Here is a close-up of one little part:

Lombardidetail

This piece is "only" 20 x 44 inches. Lombardi's work got as big as 5 feet by 12 feet, dense with connections. Everything he did was researched and drawn by hand. Despite working in the computer age (up until his death in 2000), he used index cards for the research and pencil/graphite on paper for the pieces. See here for more examples as well as, at the bottom of the page, Lombardi's commentary.

The schematic-diagram look of Lombardi's work was an artistic choice, a visual antiseptic that left only facts on the page. Because many of his pieces involved scandals, the connections often intersected the famous (George W. Bush and Bill Clinton each got caught in a Lombardi web) with the infamous, leaving the viewer to decide the significance.

I bring up Lombardi because his work and Bachman's "Death and Taxes" strike me as opposite ends of the "data visualization as art" spectrum. While both render data clearly and with a message—that is, they are not using data to drive abstract art (a whole other category)—Bachman does so with overt artistic technique whereas Lombardi employs the covert artistry of minimalism.

So if certain data visualizations can be art, we might as well ask whether history will judge a future data-viz artist as a master, on par with a Rembrandt. I think it could happen because, when anointing great artists, art historians often pick artists whose work is representative of their time. This being the information age, data-viz art looks suspiciously representative to me.

[I originally found "Death and Taxes" via a write-up on visualcomplexity.com.]

GM Gets Shifty With Numbers

General Motors (GM) gets shifty with numbers in a recent print ad titled "Change is in the air." I saw it on page 53 of the Economist magazine's U.S. edition dated May 13-19, 2006.

The ad begins:

We're changing a lot of things at GM these days. Even people's minds. Take the environment. Today we lead the industry in the number of models that get an EPA estimated 30 mpg or better on the highway. More than Toyota or Honda.

The ad does not mention that GM also leads the industry in the number of models that get an EPA estimated 29 miles per gallon or less. What? It turns out GM can win either side of this issue because GM has significantly more models than any other car company. In other words, "We have the most models above 30 mpg! We have the most models below 30 mpg! How? Because we have, by far, the most models!"

To make a more meaningful comparison, let's look at the percentage of each car company's models that get 30 mpg or better on the highway. Using the Environmental Protection Agency's data for model-year 2006 cars, we find that 14.1% of GM's models get 30 miles per gallon or better on the highway. That's less than half the 30-mpg+ percentage of either Toyota (36.7%) or Honda (36.4%). It's also less than the average 30-mpg+ percentage across all car models in the database (17.3%).

Thus, GM's "leadership" doesn't look so good from the more meaningful angle. (For those that remember the Arizona State University ad that claimed superiority over Stanford and several Ivy League schools in the number of freshmen who were top-10% high schoolers, this GM ad is abusing numbers in a similar way.)

Anyway, I have no problem with GM cars, just this particular ad. Next time, GM, don't rely on a dodgy number to make your point.

And for the record, below is a ranking of car brands by the percentage of models that get 30 mpg or better on the highway. You can create this analysis from the 2006 EPA data file using an Excel PivotTable. The original data represents each GM brand separately, but I have added a line at the end that totals the GM brands listed in the ad (Buick, Cadillac, Chevrolet, GMC, Hummer, Pontiac, Saab, and Saturn).

2006_mpg

Where the U.S. Poverty-Rate Metric Came From

How to measure poverty is a subject of continual debate. In the process of reviewing the issue, a recent New Yorker piece included an interesting tale about Mollie Orshansky, a statistician that invented how the U.S. poverty rate is calculated:

From 1945 to 1958, she worked in the Department of Agriculture's Bureau of Human Nutrition and Home Economics, where she worked on a series of diets designed to provide poor American families with adequate nutrition at minimal cost. In painstaking detail, the food plans laid out the amount of meat, bread, potatoes, and other staples that families needed in order to eat healthily.

In 1958, Orshansky joined the research department of the Social Security Administration, and decided to try to estimate the incidence of child poverty.... Orshansky used her food plans to calculate a subsistence budget for families of various sizes. For a mother and father with two children, she estimated the expense of a "low cost" plan at $3.60 a day, and of an even more frugal "economy plan" at $2.80 a day. Rather than trying to calculate the price of other items in the family budget, such as rent, heat, and clothing, Orshansky relied on a survey by the Agriculture Department, which showed that the typical American family spent about a third of its income on food. Thus, to determine the minimum income a family needed in order to survive, she simply multiplied the annual cost of the food plans by three. Families on the low-cost plan needed to earn at least $3,955 a year; families on the economy plan needed to earn $3,165.

Orshansky compared these figures with the Census Bureau's records on pre-tax family incomes and concluded that twenty-six per cent of families with children earned less than the upper poverty threshold and eighteen per cent earned less than the lower poverty threshold. In total, she estimated that between fifteen million and twenty-two million children were living in poverty, a disproportionate number of them in single-parent households and minority neighborhoods.

[In 1964], Congress created the Office of Equal Opportunity, which used Orshansky's method to determine eligibility for new anti-poverty programs, such as Head Start. Other federal agencies followed suit, and in 1969 the White House adopted a slightly modified version of Orshansky's lower threshold—the one based on the economy food plan—as the official poverty line.

The article notes that, to this day, the same general method is in use, not because it's the best—it has numerous, obvious problems—but because changing the measurement would change the poverty numbers, which would be politically unacceptable.

For the full article, see the New Yorker, "Relatively Deprived: How Poor is Poor?"

When Counts and Percents Contradict

I'm requesting an immediate dispatch of the Numeracy Police to Phoenix Sky Harbor Airport, Terminal 4, downward escalator to baggage claim—code 319: material contradiction between count and percent, in progress:

Asu

The sign says, "1,732 ASU freshmen graduated in the top 10 percent of their high school class. That's more than Harvard, Princeton, Stanford or Yale. And one more reason to choose ASU."

ASU is Arizona State University. Despite its many merits, ASU is rarely grouped with these other elite schools. So how does it happen here?

According to Wikipedia, ASU has 48,955 undergraduates. That compares to the other colleges mentioned as follows (again, with numbers from Wikipedia):

College Undergraduates
ASU 48,955
Harvard 6,655
Stanford 6,654
Yale 5,300
Princeton 4,365

I could not quickly find the numbers for each college's freshman class. So for the sake of simplicity, let's say that 25% of undergraduates at each college are freshmen (no doubt an underestimate but we're applying it across the board). That would be 12,239 freshmen at ASU and an average of 1,436 freshmen at the others.

At 1,732 out of 12,239, ASU freshmen that graduated in the top 10% of their high school class are a modest percentage of all ASU freshmen. But in absolute numbers, it is literally no contest: ASU's 1,732 top-10% freshmen outnumber any of the other schools' entire freshman class. And even if our numbers are off a little, it won't change the fact that this comparison's playing field is tilted heavily toward ASU.

That said, the sign is factually correct, and it has a legitimate underlying message: "If you want to be around top 10% students, ASU has a lot of them." The problem is, the sign invokes the elite schools in a way that implies ASU is in a group of colleges that not only have a lot of top 10% students but also have a vast majority of students that are top 10% students. It does not indicate that ASU is the exception in the group, achieving its large number of smart students by having a disproportionately large number of students overall.

In other words, the airport sign uses a count but implies a percentage. That's bad if the actual percentage would undermine the point, as is the case here: If the "1,732" in the airport sign was expressed as a percentage of ASU freshmen, the rest of the statement would not be true, or even in the ballpark of true.

I doubt ASU's advertising people thought twice about this; they simply wanted a catchy sign. I'd like to think it won't persuade many of this year's top 10% high schoolers.

How Not To Use Numbers: High Schools Withhold Student Rankings

A recent New York Times article, "Schools Avoid Class Ranking, Vexing Colleges," illustrates how not to use numbers—in both senses of the phrase.

The core problem is this: Grade point averages are subjective and minimally standardized. Some high schools give out "A"s easier than other high schools. This causes colleges to request not only a student's grade point average but also the student’s class rank. The rank provides a reference point for what the student's grade point average is worth at the student’s school.

But if we learn from student rankings that a school has double the percentage of "A" students as the average, should the lower half of those "A" students’ achievements be devalued? Maybe the school just has more smart students than average. The usual problems with using averages aside, many high schools' argument boils down to: "My school's 85th-percentile-ranked student might be better than another school's 95th-percentile student."

So far we have reasonable points on both sides. But where you'd think both sides would be working to agree on more and/or better measurements, an increasing number of high schools have concluded they should just withhold student rankings.

The schools have two justifications, both weak:

  • Withholding rankings will cause colleges to evaluate the "total child." Ignoring the Orwellian aspects of withholding information in the name of the total child, the article states that most colleges end up compensating for the lack of information by either attempting to estimate the student's rank or by adding emphasis to standardized test scores like the SAT. In other words, the colleges end up doing more of what the high schools are trying to prevent.
  • Withholding rankings will protect children from feeling bad. As a high school co-principal put it, "Only one person is happy when you hand out rank—the person who is No. 1." I won’t argue the pros or cons of this point, because it's a distraction: Withholding rankings from students is not the same as withholding rankings from colleges. The article notes that some high schools keep confidential rankings that are available to institutions that "absolutely" require them. If kids' self esteem was the crux of the matter, high schools and colleges could no doubt expand the practice of conveying confidential rankings.

I don't mean to imply that students should be evaluated only by numbers. Rather, in a world where college admissions are part qualitative and part quantitative, fixing the rankings issue should be about making the quantitative part better. Instead, the high schools that withhold rankings are simply undermining the quantitative part, in many cases hoping to make the numbers less important.

It’s not working, and the students are the losers. Although a "distinct minority" of colleges are OK without rankings, the article suggests that most are not, citing the following data point near the end: "[A]n internal review showed that the admission rate at Vanderbilt was highest for students with a class rank, and lowest for those whose schools provided neither a rank nor general data about grades."

BusinessWeek Does the Math

In "Math Will Rock Your World," BusinessWeek covers analytics and data mining efforts at various companies. The positioning of everything as "math" is strange, but maybe that's the hook the writer or editor needed. Whatever.

As a long-form commercial for my line of work, it's great. And those not into "math" might even find it interesting because it focuses on the applications, not the algorithms.

Data Visualization Lessons from Gapminder

Having been around business analytics for more than a decade, I have seen many attempts at innovative data visualizations: techniques for graphically representing data that go beyond the bar, pie, and other charting classics. By now, the typical business analyst was supposed to be flying through 3D datascapes. But alas, the virtual jetpacks have not yet taken off.

Where I have seen progress is in a simpler form of data visualization that extends the charting classics with animation. By making chart elements active, a story can be told as the elements change—for example, by showing how a stacked chart's layers build up.

A recent posting on TEDblog by June Cohen pointed me to a good example: Gapminder provides visualizations of United Nations data about various countries' income and health levels. The graphic below is from the first presentation, which you can view at the Gapminder home page. Click its title ("1 Income") in the green box in the middle of the page; when it appears, click the big arrow button at the bottom right to go forward through the screens.

Gapminder

Critics might dismiss these types of animations as eye candy, somehow below serious analytics. However, by that standard, charting itself could be called eye candy, since the underlying numbers are all you need—an argument that would find few takers.

What most critics actually fear is not animation itself but pointless animation such as PowerPoint transitions gone amok. Yet when it is done well, animation can do for charting what charting does for numbers: provide a more approachable and impactful view. That sounds a lot like the promise of better data visualization. So even if Gapminder-style animations seem like baby steps compared to 3D datascapes, the data-viz field may need to accept and build on baby steps to get to the long-promised leaps and bounds.

Anyway, decide for yourself. You already know that income and health are distributed unevenly throughout the world. See if Gapminder's presentation brings the point home in a stronger way than you've seen before.

By and Largely Smaller

"By and large, we brown-bag it: 44% of Americans bring lunch from home to eat at work." (Parade magazine, November 13, 2005, page 4)

I was disappointed that this statement did not have an accompanying graphic, so I made one.

Parade_brown_bag2

The Flaw of Averages

Nothing like a life-or-death issue to illustrate an analytical problem:

[A new paper concludes] there are fundamental flaws in the way researchers usually analyze and report the results of medical studies, especially randomized clinical trials that are seen as the "gold standard" method for studying the effectiveness and safety of new treatments....

"Most studies currently emphasize the average risk and average benefit found in the study, but the average trial participant might get much less benefit than average, or even be harmed," says lead author Rodney Hayward, M.D. "If nine people are in a room with Bill Gates, the average net worth of people in the room will be several billion dollars even if everyone else in the room is in serious debt."

The authors argue for a more sophisticated form of analysis, risk stratification, which they found in only 4% of papers reviewed from prominent medical journals. To make their point, they cite a major 1993 study that showed the clot-busting drug tPA to be, on average, significantly effective for heart-attack patients.

But when Hayward's colleague David M. Kent, M.D., M.Sc., now at Tufts University, analyzed the data from this study in a risk-stratified way, he found major differences in effectiveness of tPA. In fact, his analysis shows that 25 percent of the patients in the original study accounted for more than 60 percent of all the benefit in the entire study. Meanwhile, half the patients received little or no benefit — and some had such a high risk of brain bleeding from tPA that there was net harm.

The full write-up about the paper is here. Those who know marketing analytics will recognize that risk stratification is similar to segmentation. Just as smart marketers no longer pursue a singular, average customer, the paper's authors are urging the medical establishment to be wary of studies about the average patient.

My Photo

Bio

VP Analytic Products, CNET Channel (current); CEO and co-founder, ExactChoice; CTO and co-founder, Personify; researcher and co-founder, iVALS and Media Futures Program (both at SRI International); based in West Hartford, Connecticut, and San Francisco, California.

This is my personal blog. It speaks for me, not my employer.

Contact

email

Site Search