July 2, 2008

Data vs. Science

It has been awhile since I have posted on the DonorCast NewsWatch. Alex is so in-tune with the data mining world, I have had little to add. However, as a long-time Wired subscriber, I could not go without mentioning the latest issue, "The End of Science."

Chris Anderson sets a premise followed by several other contributors regarding the modern use of data. One of the most provocative points of the feature is that the scientific method can actually get in the way of data exploration. I believe Chris is correct.

However, as with most debates (endogeneity vs. exogeneity, in-house vs. outsourcing, prospect-based tracking vs. project-based tracking) it is not as simple as one or the other. James Cheng, the brilliant data miner at MIT, presented a compelling case for the scientific method at the APRA data mining symposium this past April. Before changing strategies based on analysis, I use control group tests whenever possible. Kate Chamberlin and Michelle Paladino at Memorial Sloan-Kettering very effectively use controlled study principles in testing the validity and effectiveness of both models and development strategies.

There are times when the method can get in the way, too. Chris Anderson points out that Google does not try and understand "why" before implementing the results of the analysis. It simply moves ahead with it. In the writers own words:


Google's founding philosophy is that we don't know why this page is better than that one: If the statistics of incoming links say it is, that's good enough. No semantic or causal analysis is required. That's why Google can translate languages without actually "knowing" them (given equal corpus data, Google can translate Klingon into Farsi as easily as it can translate French into German). And why it can match ads to content without any knowledge or assumptions about the ads or the content.

When I first began to build predictive models, I always started with a hypothesis. This would influence my data selection as well as my model selection. The more I build, the more I move to allowing the data to guide the process. In fact, the most challenging part of CRISP-DM is the "data understanding" step. If I let my data decisions be guided entirely by what I understand the business question to be, I might miss a hidden pattern.

What is the risk of letting the data guide the process? Well, let me use a major giving model as an example. As I have discussed before, if your goal is to predict giving likelihood to risk-manage a gift pyramid, you may wish to have a large degree of endogeneity. Like a credit score, you really would want to know the probability of the behavior. If your goal is to find new people that might be good major giving prospects, you might choose to minimize endogeneity. The result would be less predictive, but would serve to minimize the identification of names already known to you.

In this scenario, if you were to allow too much endogeneity in the identification model, the risk is small. You would likely exclude researched and assigned names before starting your qualification process anyway. What remains are not known names. But, you might have missed some names that would fit the profile if you had more data about them.

Sometimes, I have Marianne Pelletier's voice in my head, "Well, did you find more prospects?...That's good--isn't it!?" It is very similar to Google's "Did we make more revenue on that ad?...That's good--isn't it!?" Sometimes "why" can get in the way.

Maybe, this is my long way of saying, "Buy this magazine and read the feature." It can be confusing at times since it references "models" in the context of "ways of doing things" as opposed to statistical models. But, I think you will see Chris Anderson's point. It is worth the read.

Read The End of Theory
Link goes to the first essay from the feature by Chris Anderson. See the links on the left side of the page for the other brief essays.