“Got kitesurfing on the mind, mixed with some search & classification tech, and a dab of political ranting”

Search engines are stupid

Posted by direwolff on August 2, 2006

Have I got your attention with that title? Good. So why do I say this and what do I mean by this statement?

Today search engines don’t really do anything novel other than keyword matching. Google’s innovation wasn’t so much a search innovation as much as a sorting innovation, “what order should the results be displayed in?”. Pagerank was all about ranking pages to determine relevance on the basis of the number of references pages with keyword matched had received. OK, there’s some novelty in that, but when I think about search, sorting relevance is a “nice to have” but not necessarily a good proxy for a material improvement in search algorithms. What I mean by this is that if I’m doing a search and wanting to find “executives in Fortune 500 corporations” across a general search engine, that in theory (at least those of Google, Yahoo!, Ask, Microsoft, et al) have spidered the whole Web and have this information, I will only find results where the words “executives” and “Fortune 500” are found. Interesting, such a simple request totally stumping all modern day search engines. Boy, technology really is stupid if you imagine that it doesn’t really take long to teach a child what either of those two terms refer to before they could recognize appropriate matches to that simple five word request.

Part of the problem with modern day search engines is their lack of extensibility and their inability to be taught or have new concepts added to them (I’m being deliberately simplistic in my analysis here because it’s suitable). For search engines, it’s just about pattern matching, with some being able to do some basic stemming (ie. ability to match results of play to “plays”, “player”, “playing”, etc.). More importantly is the fact that where one could spend the time to carefully build a sophisticated query to address getting at this information (ie. (‘executive’ OR ‘president’ OR ‘CEO’ OR ‘vice president’ OR ‘VP’ OR…) AND (“Fortune 500” OR ‘IBM’ OR ‘Exxon-Mobil’ OR ‘Walmart’ OR…)), this would effectively be a post content indexing slow search. Not only that, it would have no sense for word proximity issues and so on.
Well, there’s a company called Readware which has built a flexible model that is not only extensible but provides a building block model that can do very sophiticated indexing without adding too much processing overhead to that operation. Out-of-the-box the system comes equipped with what Readware calls its ConceptBase which is the equivalent of synonyms groupings (though actually more sophisticated than that) with the groupings’ semantic relations already computed. This supports the idea of disambiguation (ie. the ‘bond’ between people vs. the bond as debenture) which allows it to return much more precise results with significantly less effort. While today the technology cannot be really compared to the Google & Yahoo!s of the world, since Readware isn’t currently spidering the Web, it does play a significant role for enterprises that have either purchased or are considering solutions like the Google Appliance or search and categorization products from Autonomy.

Readware enables even non-programmers to build taxonomies where they desire to do so for classification tasks for intranet or any external based content (HTML, PDF, .doc, etc.). The taxonomy components including categories and topics are not only easily created but can be reused as building blocks for the creation of ever more sophisticated components. One can imagine creating topics for baseball, football, soccer and basketball that encompass the various things associated with each. Once these topics are created, they could all be used as part of a more encompassing topic called “sports”. The beautiful thing about all of this, is that these topics are identified when the content is being indexed, in effect creating metadata immediately during the content processing phase. As you can imagine, this greatly accelerates the processing of end-user searches. Hence whether its a classification task or just simple search, Readware brings some real innovation that today even efforts like those of the Semantic Web only aspire to reach. I almost like to think of Readware as providing a means of identifying and providing handles for unstructured content manipulations.

Oh yeah, and Readware can do stupid keyword matching search too :-)

Tags: , , ,


2 Responses to “Search engines are stupid”

  1. A few notes.

    Cool stuff…

    It seems that in order for this to work once you define a new ‘semantic’ concept that you’d have to re-index your document corpus. This probably means you’ll have to maintain a backup of your crawl in an indexable format on disk so that you can process the blogosphere/intraweb locally. Then you can move all hits with the new semantic in to another index for searching.

    This might be a bit hard for the larger web… google is indexing 8B documents. The Internet archive would probably be able to help out Readware here. They have 15B last time I checked and they’re totally willing to partner with people (they’ve done it in the past).

    More and more OSS tools are starting to exist to build large/cheap distributed filesystems which can handle this amount of data.

    This is one area where the RDF and semantic web people can’t follow. It’s about information retrieval not data markup. I guess when you have a hammer everything is a nail.

    Tailrank for example doesn’t trust RSS and HTML based language categorization. People lie and sometimes don’t configure their blog correctly so a blog written in French comes across in english. We have a (very) robust language classifier which can categorize a document in one of any major language (I think it supports 25 right now).

    You can’t lie about your text… if you’re in Cantonese but the blog metadata says you’re in English we ignore this and assume cantonese.

    Readware is basically doing the same thing… they can figure out semantics from the raw text which while a bit more difficult technically provides a lot more flexibility.



  2. Ken Ewell said

    In the extreme case that one would define a semantic concept of such new-found importance that it becomes necessary to reform the analysis itself, the text would have to be retrieved to re-run the analysis and identify the new concept.

    If there were a local mirror that would make it easier. The Readware API supports keeping local content.

    The first question that comes to my mind though is what is the purpose of reading everything there is– who reads everything?

    There are many other pragmatic factors including whether it make sense to go into the archives to make such changes. If you are dealing with 8 billion documents it may not make much sense. On that same theme, it may not make sense to apply a deep semantic search over 8 billion documents.

    Search may be stupid, but I think Pierre meant in terms of the way you must deal with search engines for the job that they do. You have to simplify what you want into one word, preferably, and two or three at most.

    Search engines can handle more complex inquiries than that, but a name composed of one or two nouns works best for most search tasks. Not many people know how to interface with search engines in more advanced ways

    What makes the Internet search engines really great is that they make it possible to tap into those billions of documents scattered everywhere with a very narrow and very specific window.

    If you are looking for things like restaurants, or a store, any address, people, self-help, or to buy a service or product– then the search engines pretty much do everything you need. What’s more, they give you plenty of choices, no one can deny that.

    Now if you want *to learn* something substantial, you find that search engines have limits because they are stupid as Pierre has pointed out. They can let you down if you expect anything from them after they deliver those links.

    Like “Artificial Intelligence” the expression ‘Information retrieval’ is overloaded, but I think your point that, in general, it is not about and should not be about data markup.. is valid.

    Data markup is going wild and I predict right here that there will be an enourmous data explosion. Unlike the explosion of web sites and email with ‘data’ informative to people, the new data will be mostly meaningless to people.

    Research is different than search. Research implies study. Study implies the use of semiotic patterns comprising syntax, semantics and pragmatics, the identifiation of relations and the extraction of the entities objects of the study.

    If you are doing research and automating the processes of collecting research material, organizing it for better perception and understanding, and for collaboration with others, you are not about working with 8 billion documents.

    If you want to study and learn about the entities (people and organizations) involved in social politics; there is no need to crawl 8 billion sites or collect 8 billion documents.

    Readware makes use of the Internet search engines by performing a meta-search over several search engines at once. The results from all of them are merged and smartly re-ranked into a much reduced list without any duplicates.

    If we meta-search ‘social politics’ on seven search engines retrieving 100 hits from each, Readware meta-search would reduce the 700 to about 200 just by removing duplicates and irrelevant hits.

    Now we are talking about 200 documents that we are sure (Readware made sure) social and politics (or their forms) are in the documents. So up to here, Readware has not really improved on the search engine, except by making the user’s search more productive, or smarter. But that is how I see tackling 8 billion documents.

    Now once you have formed a research collection, a semantic search capability becomes necessary.

    To learn who and where the people and organizations are that are associated with social politics, you would normally have to click each link, look for and recognize the names of entities.

    You have to mentally sort more abstract concepts from the names of things, and locate them. Even though it’s not 8 billion, it is not easy to read 200 documents, to make lists of the names and organizations, or use the information in other ways. It would be tedious work and take a long time, to read 200 documents.

    To help us read the documents and recognize the names we now have Readware. Readware would only take a few minutes to read those two hundred documents, and then you could send a single command that would list all the documents.

    When you click on one of them, the source is retrieved and the names of people, place and organization names will be located and highlighted. Though unlikely, any of the 200 documents not in the Readware result list, would not have any entities mentioned.

    This is really a trivial example, but it is not stupid to be able to recognize the difference between a proper name and other abstract word types.

    -Ken Ewell

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: