Fish leg counts: What the web knows and doesn’t know

Michael Giberson

David Pennock hears another another tick of the clock in the countdown to web sentience.

[In 2003] we trained a computer to answer questions from the then-hit game show by querying Google. We combined words from the questions with words from each answer in mildly clever ways, picking the question-answer pair with the most search results. For the most part (see below), it worked.

It was a classic example of “big data, shallow reasoning” and a sign of the times. Call it Google’s Law. With enough data nothing fancy can be done, but more importantly nothing fancy need be done: even simple algorithms can look brilliant. When in comes to, say, identifying synonyms, simple pattern matching across an enormous corpus of sentences beats the most sophisticated language models developed meticulously over decades of research.

Our Millionaire player was great at answering obscure and specific questions … It failed mostly on the warm-up questions that people find easy — the truly trivial trivia. The reason is simple. Factual answers like the year that Mozart was born appear all over web. Statements capturing common sense for the most part do not. Big data can only go so far.

In 2003 their best example of a question that they could not answer via websearch was “How many legs does a fish have?

Now, on the other hand, Pennock said:

I was recently explaining all this to a colleague. To make my point, we Googled that question. Low and behold, there it was: asked and answered — verbatim — on Yahoo! Answers. How many legs does a fish have? Zero. Apparently Yahoo! Answers also knows the number of legs of a crayfish, rabbit, dog, starfish, mosquito, caterpillar, crab, mealworm, and “about 133,000″ more.

Pennock links to Lance Fortnow’s related comments on IBM’s effort to write a Jeopardy-playing computer, and Fortnow suggests something that is going to remain hard for computers for a while: making sense of natural language in context. Fortnow, part of the group that wrote the Millionaire paper, said:

Humans have little trouble interpreting the meaning of the “answers” in Jeopardy, they are being tested on their knowledge of that material. The computer has access to all that knowledge but doesn’t know how to match it up to simple English sentences.