Back in late 2005, several book publishing stake holders decided to sue Google regarding the company’s Print Library Project. While at first I thought of this as just another old industry resisting change, it soon began occurring to me that much more was at stake here and worth further review. The implications become important as well, when we begin discussing other seemingly unrelated issues that are raised by entities like the AttentionTrust about who owns users’ clickstream exhaust and more recently issues raised by several news publishers in Belgium.
So first let me start with the idea that search engines have been spidering the Web pretty much since the mid-90s, as far as I can remember. Companies like Excite, Lycos Alta Vista, Inktomi, Google, etc. John Battelle’s book, The Search: How Google and Its Rivals Rewrote the Rules of Business and Transformed Our Culture, provides a good account of all of this. The role of spiders is to go out on the Net, and bring back the content on Web pages (though in some cases they also pick up content from Office documents and PDF files) and index these in their search engines. In some cases, all that the search engine saves is the indexed information, in others they actually keep an archived version of the page for future reference (I may be wrong on the specifics here, but that’s the general idea). However, from a search perspective, all of the search engines keep within the ‘fair use’ doctrine and only display a sentence or two where the search result occurred and provide a link to the actual content for access to the complete text. By adhering to ‘fair use’, this in effect keeps search engines on the right side of copyright law. As well, from the Web site publishers’ perspectives, because people use the search engines to find information and as a result so much of their visitor traffic comes from the search engines, it’s a mutually beneficial arrangement. My key point here however, is that although the entire Web site is being spidered in order to index the full content of a site, because the whole site is never rendered and so no substitute to the actual site is provided, search engines do not appear to be violating copyright.
On the Net, spidering is an automated process, but when dealing with physical books this requires a manual and semi-automated approach. There are now some robotic machines that handle the scanning and digitizing of bound books. These would in effect act in a manner somewhat analogous to spidering content, with the exception that a human being has to manually feed in the book(s) to be scanned. The purpose of doing this is to help people identify and find books that contain the content of interest to them. However, there are externalities to this.
For one thing, search engines can do more with the content than simply make it findable. They can analyze this content and determine is semantic relationship or relevance to other content. They aggregate analytics about the page, like how many people doing a search clicked on the search result for that content?…how many times did a particular piece of content come up in the search results?…how many links exist to that content page from other content pages (can you say Pagerank?)?…etc… These externalities could be regarded as the exhaust off of the content. This other stuff you can do and learn about the content by virtue of aggregating and analyzing it can also bring a tremendous amount of value all from the use of this copyrighted content, though none of that value goes to the copyright owner. Today, that value resides within the search engines like Google, Yahoo!, MSN Live, and Ask.com. It’s almost like the search engines are parasites to the aggregated content. They live on by the will of the content which is itself only found online by the will of the search engines. Talk about a conundrum.
Well, one important capability which is provided for online is that a site can have a robot.txt file indicating the site’s spidering policy, including not allowing for the site to be spidered at all. It is this simple capability for which an off-line or book world equivalent must be found. Effectively a way for book publishers or copyright owners to be able to “opt-out”. While this may not be a smart business decision for them, it should be a capability much in the same way that it exists as a capability for web pages.
This exhaust that I refer to from the aggregated content seems similar to that which is generated by users’ clicking activity and aggregated by advertising networks. Hence, where the AttentionTrust is promoting the idea of users being able to opt-out of being tracked (or cookied), this is functionally equivalent to users being able to indicate their robot.txt file for not wanting to be tracked. Note that this would create added incentive for the ad networks to offer real value for users to allow such tracking to continue. Hence the quality of the offerings made to tracked users should also be commensurately higher and the conversions for advertisers should then increase. All of this resulting in fees to the ad networks also increasing. Seems like a win-win-win all around.
Not being an attorney myself, nor having any authority to assess how the book publishers lawsuit might go, I’d say that if the courts rule indiscriminately for the plaintiffs (in this case the Authors Guild), without considering how these issues are addressed online today, then they’d in effect also be ruling that search engines could no longer spider content without the explicit consent of the copyright owners. I believe this could be a bad precedent to set as it could be very impractical to require this inclusion process if it were not automated as it is with the robot.txt file. Effectively, what should be facilitated are tools for copyright owners to put all of their works up online (even if not visible to browsers) and leaving them the option for these to be included in the search engines with a legally enforced robot.txt file. Where today enforcement of the terms of that file are more voluntary than required, these could be given more powerful legal standing.
Just thinking a loud here, but it seems like these issues are gaining some momentum and have to be addressed sooner rather than later.
Tags: google, authors guild, copyright, book search, spiders, robot.txt, attentiontrust, john battelle, print library project, yahoo!, excite, lycos, msn live, ask, inktomi