“Got kitesurfing on the mind, mixed with some search & classification tech, and a dab of political ranting”

Simple vs. Complex usually wins…are you listening W3C?!

Posted by direwolff on October 26, 2006

Nothing like bringing a large group of smart engineering types together in a standards group to really deliver on new orders of complexity. So it feels when I try to read the GRDDL Primer or the GRDDL Specification by the W3C. I’m not going to claim to be the most technical guy in the world, but I can hold my own reading and understanding technical manuals, protocols and the like. Not that I can implement any of this, just that I’m not uncomfortable reading and understanding APIs or other standards protocols. First off, the name GRDDL should be cause enough for concern. It stands for Gleaning Resource Descriptions from Dialects and Languages.

Let me digress for a moment, as this all reminds me of a start-up I was involved with named Kinecta (previously named ShiftKey, later acquired by Stellent [Nasdaq: STEL]), where in 1999 we were delivering on the vision of widespread syndication by content providers using the Information and Content Exchange (ICE) protocol back when RSS was looking like it might undergo still-birth due to its lack of robustness and security. HA! Look who’s laughing now. In those days, syndicators like Reuters, the Financial Times, Red Herring and other professional content creators would not be caught dead syndicating their content via something as inherently insecure as RSS (Really Simple Syndication). Go figure. ICE had everything one could want in a protocol, a client-server architecture, content delivery scheduling, the ability to secure the syndicated content, delivery confirmation, and a host of other must haves according to the large content providers of the day. Looking back on this, it really was quite humorous. At the time, I was able to construct a very large story about why syndication would take off and grow in a manner similar to the Web. Heck, I used the history of the Web as my example for why ICE would lead the way to the next iteration of content distribution platforms. But alas, I was wrong.

What I forgot in all my excitement about ICE and this brave new world of syndication, was something self-evident in the very example I was using, which was that HTML succeeded in becoming completely ubiquitous where SGML had not. In other words, the same type of complexities that plagued SGML in its fight for supremacy as the standard mark-up language, would also plague ICE in its fight against RSS. What this demonstrated to me is that simplicity always triumphs over complexity. I’d further submit, that simplicity has a way of finding its way to complexity at some point which is why starting with complexity quickly becomes unmanageable and inevitably fails.

My first point in all of this is that the W3C’s GRDDL solution, already sniffs of too much complexity to scale smoothly. The irony is not lost on me that W3C supported ICE too and Vignette was it’s corporate champion.

Seth Goldstein’s blog post today about APIs touches on some history that I want to address further. Specifically he talks about the following:

In a memorandum dated July 15, 1949, Warren Weaver, who held the position of director of the division of natural sciences at the Rockefeller Foundation from 1932 – 1955, wrote about the possibility of language translation by an electronic computer. It was the first suggestion most had seen that such a thing might be possible, and as he draws the memorandum to a close, his words preview the emergence of the API:

Think, by analogy, of individuals living in a series of tall closed towers, all erected over a common foundation. When they try to communicate with one another, they shout back and forth, each from his own closed tower. It is difficult to make the sound penetrate even the nearest towers, and communication proceeds very poorly indeed. But, when an individual goes down his tower, he finds himself in a great open basement, common to all the towers. Here he establishes easy and useful communication with the persons who have also descended from their towers.

Thus may it be true that the way to translate from Chinese to Arabic, or from Russian to Portuguese, is not to attempt the direct route, shouting from tower to tower. Perhaps the way is to descend, from each language, down to the common base of human communication – the real but as yet undiscovered universal language – and then re-emerge by whatever particular route is convenient. Such a program involves a presumably tremendous amount of work in the logical structure of languages before one would be ready for any mechanization.

What struck me about Weaver’s quote was that it really touches on a number of issues related to understanding meaning, that none of the classical theories for categorization technologies (including the efforts around the W3C) or linguistics seem to have ever considered or taken seriously. In those few instances where I’ve heard ideas similar to Weaver’s raised, the linguists have been quick to dismiss this as a possible solution for tackling the problem around identifying meaning in unstructured content.

Now, I’m even less of a linguist than I am a technologist, but I do speak three modern languages fluently and a dialect. OK, for qualifications sake, I also took two years of a 4th modern language, though my vocabulary there is severly stunted. There are a few things I can say. First, translation is not about words, it’s about ideas and concepts. Some times word-for-word translation is easier to do, when explaining to someone what a person speaking another language is saying, but the good translators take their time to understand what is being said because what they are trying to convey is the meaning of what was said not simply provide a word mapping. Second, what people like to talk about in one language, is similar to what people who speak another language like to talk about too. In other words, whether they be from different cultures, different countries, different backgrounds, people still speak about similar concepts.

So what does this have to do with Weaver’s quote? Well, it seems odd to me that way back in 1949, someone of his calibre could suggest the idea of finding the normalized language to tie all languages together, and still the academic, research and professional establishment decided to ignore this and pursue other paths to solving language understanding problems (ie. Bayesian frameworks, keyword search, synonyms mapping, etc.) that could be solved by further investigation into this method. Let me go further and tie this in with the beginning of this post about GRDDL. Today, one of the problems with all of this RDF stuff is that taxonomies address the needs of each problem or industry they were tailored for. In some cases, similar attributes are used but they mean different things relative to the different problems or industries (taxonomies) and in some cases the attributes are different but refer to similar things across the different taxonomies. How does all of this get reconciled? Today, it doesn’t.

Well, when I consider how I learn about new things, it starts with getting an explanation of the new domain in a chosen language based on where I am when I’m learning this. For example, when I worked for U.S. Steel (now part of USX) back in the late ’90s, most of what I learned about steel manufacturing happened in Mexico. There was some new vocabulary to learn, but not because the words were themselves new, but because they were being applied to a domain I was not previously familar with. I later learned a lot of this in English too. In steel mills there are furnaces and ladles and rolling mills and plates. All of these words are English words. In Mexico, they had the Spanish terms for these. What’s important to understand is that the ontology used for describing everything about the domain lied within the language used. In other words, it was tied to the meanings of these words. What’s also important is that ultimately, in either language, people talking in the steel business, were talking about the same things. What I guess I’m basically saying here, is that there is a meta meaning language that both English and Spanish could be linked to, that once understood, could become the ontology by which all things are interrelated.

Consider the idea that for all domains, language, the same language that we use to speak and greet each other with, is used to define the terms of those domains. It’s like the meanings being referred to in any language are the atoms from which any domain’s vocabulary is constructed. Since we already understand the semantic relationship of meanings inherently from our education, there’s no need to define semantics at the domain level as it’s already being taken care of at the atomic or language level.

OK, in a sense I’m teasing with all of this because I have already seen such a technology and am working with it. It’s called Readware, and we hope to soon be demonstrating it in various applications so that others may begin to understand the advantage of using and leveraging simple ideas to solve complex meaning problems. This will not be easy for people to buy into, much like back in the day there were plenty of people that didn’t buy into the fact that HTML and RSS would lead the way to new paradigms, but I do believe that simplicity will triumph over complexity and the W3C needs to (as the Apple commercial espouses) “think different”.

Tags: , , , , ,


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: