OK, a long winding post about some half baked thoughts I’ve been having. There’s a ring of impossibility to it. I guess that’s what makes it fun to think about.
There’s this really interesting discussion taking place on Slashdot on “The Need For A Tagging Standard“. While I’d hardly call myself qualified to engage in a technical standards discussion, because of my involvement with several companies doing innovative work in the RSS aggregation and search spaces it’s something that I’ve had to do a fair bit of thinking about. It’s my opinion right now, that standardized tagging, much like the initiatives around topic maps in the Semantic Web, would only be an evolutionary change to the idea of normalized databases accessed between partners in a supply chain, or what companies like QRS (which developed products for global data synchronization and is now part of JDA Software) were providing to their customers. This need for data synchronisation is also visible in the efforts to standardize microformats. It’s all about how to make it easy for applications to determine what a piece of content is. Is it a restaurant recommendation, or a purchase order, or a product listing, or event information, or a legal notice? What is this document?… so that the application can (a) know whether it’s appropriate for its uses, (b) find the necessary data within the document to accomplish its task.
We see this manifest itself again with market intelligence applications that are trying to differentiate between commerical RSS feeds from mainstream media providers versus blogs from independent people versus professional blogs (ie. Seeking Alpha in finance). The problem of determining what the content represents, is multi-dimensional. Yikes.
It gets even nuttier when you figure that different companies may want the same information to apply in different ways. For example, they might want a customer name as well as contact name for their accounting and CRM systems, whereas another company is looking for a company name and a customer name both pairs effectively meaning the same thing…and wouldn’t you know it, both companies want to communicate this information between each other for other applications. Double yikes!!!
First a brief anecdote here. During the time of my first venture, a software development consultantcy, from 1987 through 2000, our expertise was in database applications. We supported development in Oracle, Sybase, dBaseIII, R:base, Foxbase, Clipper, and Informix. I remember a client asking us what we knew about warehouse and shipping operations to determine if we were qualified to develop the custom application they were seeking. We candidly responded that we actually knew nothing about warehouse and shipping operations, but that from a systems perspective, everything gets normalized to data flows and our analysis phase would bear those out. As it related to data flows, we knew a lot (hence our company name, DataWorks ;), and we felt comfortable that we could develop any system that the client needed. Sure enough, the inventory & shipping system we developed was for L’Oreal’s main distribution warehouse at the time located in New Jersey. The system could route packages to UPS versus U.S. Mail and interfaced to scanners similar to how this is now done for luggage routing. Indeed, it was a data flow issue, nothing more. The applications drove what happened to the data.
My point here is that applications were built on databases. The databases facilitated the logically structured storage of the data, and applications ran above these deciding what to store, what to look-up, what to change, what to remove and when to perform these activities. I now believe that search engines have emerged as the new application platform (as I have previously mentioned here), the new storage facility with much less structure because complexity requires a looser organization, but it also requires a more granular identification of the stored items. This identification, unlike the days of databases, where normalization was applied equally across the stored data, is now much more flexible and dynamic enabling us to have documents that are totally unrelated in format and otherwise, but all be stored in the same vessel, the search engine. By indexing documents, the decision has been made that words (minus stop words, ie. “a”, “the”, “he”, etc…) are what needs to be identified in those documents (though we also see that this is somewhat short sighted since keyword search sucks).
Other systems like word processors decided that formatting elements needed to be identified as well. Today MS-Word goes even further and is able to identify dates, addresses, letter formats, etc. When we start seeing how documents and spreadsheets can be moved between applications, we start to see location of elements being identified as well. HTML, much like Word, facilitated presentation identification. XML starts to go further by enabling users and systems to further identify entities within documents.
So perhaps it’s a naive perspective I’m taking in saying that categorization and standardization of tags or topic maps or what have you, is all about trying to avoid the application having to figure out what content it’s looking at by grouping the content into rigid structures. Now as I watch the efforts going on in personalization, I’m noticing something interesting. Specifically that what might be a query result for me, may not be one for you. In other words, a movie is not a movie is not a movie unless all of us do the exact same things and use the exact same services at the exact same times, online. A personalized search regarding a movie that is recommended to me on Amazon may not be recommended to you. As well, it’s quite possible that I don’t even get a movie recommendation that you do, especially since Amazon sells a lot of products and don’t have to restrict themselves to just movie recommendations. What is important is that they’re enabling a personal view of their content. The data is presented for my needs. My cookie or login to Amazon determines the behavior and the information that I will see.
Well, if it’s good enough for personalized content, why can’t that metaphor be extended to applications where these would make customized requests and be presented with that information? Two companies might request the same data elements from a search engine but name these differently (ie. “customer name”/”company name”). What’s important is having the tools for any application to be able to make the request for the specific data element it needs regardless of what it’s called in the respective application.
This starts to talk to the idea of creating a big search engine for business consumption and use only. One were a set of markers helps provide the identification underpinings for simple or complex requests to be addressed precisely. Because of the importance of the markers and their potential effect on applications, control of these (in terms of creation or removal) would remain with the search engine provider. But the ability to identify more sophisticated ideas, specific data elements, or document categories, would remain with the application developers. These would effectively be queries against the content corpus. Already we know that companies often like to creat their own taxonomies for managing information in an effective way for the business they’re in. Why not allow them to apply this to any content they want to interact with without forcing them to build up another search engine this content.
What I’m really starting to find myself talking about is similar to entity extraction, but going further and providing application tools for defining and locating entities that as simple as names, countries, companies and so on, but as complex as trends & forecasts, controversies, and government spending. Be able to define access to content discussing the “bank of a river” and treat this as different from a discussion of “plans for the new Wells Fargo bank being built by the river”. Those of you who pay attention to the search engine world may recognize this as one of the claims being made by the eagerly anticipated (some time in Q4 of 2007) Powerset. The subtle difference being that Powerset seems focused on addressing this as a natural language query issue for humans seeking information, and what I’m describing is a solution for application developers who are not likely going to write code to turn their request into grammatically formulated English queries. More importantly, it’s about allowing applications to interact with raw content and apply their own (personalized) perspectives, to derive information. This is more than simply the search engine model of spidering items at regular intervals, but it has to have inclusive an RSS search engine that is also keeping up with real-time information releases. All content is at once relevant and irrelevant to this system, as it’s the applications that access it that make the final judgment.
There would be fees associated with access to this search engine which most likely be through APIs. Even desktop or Web apps would utilize the APIs. This search engine could have a distributed architecture to address scalability, but the logial representation would be that of a single search engine containing as much data as Google or Yahoo! or MSN, except that its accesses would be programmatic versus human user interactions. Because usage would be metered, that would impose some inherent controls to prevent the most onerous abuse. Spam, like all other content would still need to be identified though not necessarily removed since it may be useful to some applications.
The access based fee structure a la Amazon’s EC2 and S3 would work because this sort of the premium any company would have to pay to conduct this sort of endeavor for their own proprietary uses. Effectively, it would be cheaper for companies to interact with this system for their content needs than to develop their own large scale search engine to meet their continuously evolving data needs. When I look around today at the number of search engine applications that effectively have the same data and just sort the results in novel ways, or apply proprietary applications to these, I realize that the real need is to simply gain access to information based on what the applications’ perspective requires. Some may needed information sorted by most viewed, others by date, and others might need to locate much more precise information. All of these differences, in my opinion, can be addressed at the application level.
Now who’s gonna build this thing? More importantly, did I just describe a search engine utility?