“Got kitesurfing on the mind, mixed with some search & classification tech, and a dab of political ranting”

Traditional Categorization Providers May Not be Ready for SemanticWeb prime time?

Posted by direwolff on December 2, 2006

There’s recently been a lot of “jumping on the [SemanticWeb] band wagon” by many of the old guard categorization technology providers. Many of these companies have traditionally had products priced so expensively, that the average developer or user has never really been exposed to these.

In the cryptography world, it’s common knowledge that companies who keep their algorithms secret are not to have their technology trusted. The reason being that knowing how the cryptography is applied should not make it more breakable (or defeatable if that’s even a term). Exposing it actually helps to make it more secure because the peer review process and attacks on the technology can unearth its weaknesses (before it’s put to serious use) or prove that it is indeed very secure. Well, in the categorization space vendors have used a different method of obscuring the validity of their technologies, and that’s called “make it real fucken expensive”. If you can afford a starting price of between $50,000 and $250,000 going up to millions of dollars, not to mention the expense of the labor resources required make it operational and useful, then you’re ready to use these. Many of the companies in this space are no longer growing as they’ve tapped out the market of companies willing or able to make the kind of investment required.

Convera has a product called RetrievalWare, but they have basically been on a sub-$20m annual revenue run rate with over 50% of their customers being in government. Convera has recently introduced a new direction given its lackluster performance to-date in its current markets. Government, as it turns out is quite a big buyer in this space because they’re happy to buy raw technology with which to experiment. This is so much the case that if you look at In-Q-Tel’s portfolio (In-Q-Tel is the CIA’s venture fund which is used to by-pass the bureaucratic process for bringing new technologies into government), it’s been a who’s who in the automated categorization space. A little known industry secret is that Autonomy made their fortune from selling categorization to corporate customers but these customers mostly deployed the search technology and abandoned the categorization projects after realizing its complexity and in many cases its poor results. Well, now some of these players are stepping into the (dare I say it?) Web 2.0 limelight.

While on the ProgrammableWeb blog, I noticed an ad for another company in this space, ClearForest. I hate to pick on them, but in its 5 simple lines of text in a small ad they claimed, “Semantic Web Services, Try our SWS to, Transform text into valuable information”…here’s the actual ad:

So I clicked on it to learn more about Doc Bob’s Miracle Cure. What they were really talking about with this service is content extraction. In their case, they claim to seek out people, organizations, geographies and events. Not quite the categorization dream come true or the SemanticWeb mantra that we’ve come to expect. After some well written text explaining the goodness behind their method and how this could help solve real problems, they provided a place to go test this out…yippee, that’s always my favorite part.

After going to Google News, I picked a story and clipped out the following text:

Taking a thrilling flight on a 4-year-old filly named Butterfly Belle, jockey Russell Baze made horse racing history Friday.

In fourth place at the top of the stretch, Baze and Belle charged through a hole on the rail and won the fourth race at Bay Meadows — the record 9,531st victory of Baze’s 32-year career.

“I’m not saying I’m the greatest rider ever, but I’m the winningest rider ever,” said Baze, 48, after his family and other jockeys joined him for a ceremony in the winner’s circle.

The jockey whose record he surpassed, Laffit Pincay Jr., said he made sure that Baze would win the fourth race by betting all of the other six horses in the field. “I’m a jinx, so I bet everybody else,” he said.

With Baze on the verge of tying the record, Pincay had traveled from his home in Southern California on Saturday to be on hand for the record-breaking race. Baze’s parents, Joe and Beverly, flew Saturday from their home in Montana. And Baze’s three grown daughters came up from Southern California. Of course, his wife, Tami, and 16-year-old son Gable were also on hand. All of them had a long wait this week because their man suddenly went cold.

After winning only one race Wednesday, Baze didn’t tie the record until Thursday. Then the likely odds-on favorite in the fourth race Friday, T’s So Shy, scratched because of an injury. “We got a break there,” said Baze’s long-time agent, Ray Harris.

Baze called the long-awaited victory “a big relief, a lot of weight off. I thought some of these horses this week would run a little bit better. It was a little aggravating that they didn’t perform as well as I hoped.

“Going into this race, I thought it looked like a good shot. The race looked like it set up well for us. I had a wall of horses in front of us turning for home, but they told me she’s got a good kick. I saw the hole developing on the rail.”

At first, Butterfly Belle — trailing Empress Justice, Normandy Princess and Out of Sugar — stumbled, startling Baze momentarily. “But she jumped right back up into the race,” he said. “When I asked her to bear down, she just ran by those horses easily. I was pretty sure at that point I was the winner.”

Nothing to crazy, and if the software is as good as their claims, this should be cake. So here were the results I got:

My reaction to the above screen shot was that the results were inconsistent. One of the hits was for “Butterfly Belle” which is a horse, not a person. Given that it didn’t highlight the other three horse names then we can presume they know the difference, or that they don’t know the difference and missed those other three horse names, “Empress Justice, Normandy Princess and Out of Sugar”, altogether. Separately, while the system nailed “Montana”, it missed California (see the two references above for “Southern California”). While these may seem like knits, when doing data extraction or categorization, such knits can become very important especially in business contexts, and it’s this lack of consistency that has often kept this sort of extraction technology for being applied more broadly. Except of course, in government applications that relate to intruding on our citizens’ privacy to save us from “potential terrorists” ;-) Note that this insight can help people understand why categorization technologies also frequently miss categorize unstrutured content. When you consider permutations of the mistakes here, the likelihood of accurate categorization starts to drop significantly. I did some other tests with other business texts and found the name of a company material to a transaction was missed in one case, events were missed in another.

The reality is that I’m not too troubled by some of these inconsistencies so long as the user of the technology can easily address extending the system to catch these misses in the future. However, my experience has been that this usually required a greater effort than was worthwhile hence creating further challenges for adopting these technologies. Let’s hope companies like ClearForest start addressing the maintenability issue for their systems if they’re hoping for prime time attention.

Tags: , , ,


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: