Wednesday, August 09, 2006

Comment Concerning AOL Data Release

You may have read on several blogs and in the mainstream press about the AOL data release. See http://www.nytimes.com/2006/08/09/technology/09aol.html., for example.

As researcher who has employed search engine transaction logs in research projects for nearly a decade, the concerns about the AOL data release are out of proportion to reality. Note, from the example in the NYT story, that even with 3 months of query data, including geographical data, the reporter wasn't sure that this was the person. (BTW, the reporter, Saul Hansell, obviously didn't mind publishing the lady's queries for the entire world to see -- with her name. I hope he adequately explained to the lady the ramifications of what she was agreeing to.)

It is VERY difficult using just query terms to identify a particular searcher, which is why researchers have been struggling with personalization for nearly two decades. In the DOJ vs Google case, which is mentioned in the story, Google had to provide the queries to the DOJ (a statistically significant sample of about 5,000 instead of the larger number the DOJ was asking for). The privacy concerns were weighted against other factors, which is what should be done here.

There is no other way to get real world interaction data from a significant sample of Web users unless the search engine companies provide it to academic researchers. Many search engine companies provide and have provided this type of data (including Excite, AltaVista, AlltheWeb, Lyco, AOL, Yahoo!, MSN, and Google, among others -- they all do it or have done it). Many search engine companies post this data on their Web pages, provide it to researchers, the government, or sell it to commercial research companies.

Are there potential privacy concerns with such data releases? Yes. Are there potentially great benefits with such data releases? Yes.

As good road ahead, both search engine companies and the research community need to work on ways to preserve privacy in such data releases and ensure a balanced voice is heard in these debates.

Thursday, August 03, 2006

Logging Traces of Web Activity: The Mechanics of Data Collection

I attended the WWW 2006 workshop Logging Traces of Web Activity: The Mechanics of Data Collection, organized by Andy Edmonds of Microsoft, Kirstie Hawkey of Dalhousie University, Melanie Kellar of Dalhousie University, and Don Turnbull at the University of Texas at Austin. Really nicely done workshop will some interesting papers presented.

Collecting data about users of Web sites and services is a difficult thing to do at an appropriate level of granularity. There were several papers on different approaches; unfortunately, many of the approaches were narrow and difficult to re-useable (i.e., if you are replicating the study exactly, fine. If not, than the tools would take a lot of re-work).

I presented on a tool that students (George Kroner, Chris Catalano, and Raghavan Ramadoss) and I have worked on -- the Wrapper -- for Web information studies . The Wrapper is easy to use, quick to install (you can be up and running in about two minutes), and collects most user - browser interactions. We have two versions available for download and a third (designed for naturalistic studies) in testing. Versions 1 and 2 available here.