Thursday, July 31, 2008

Isaac Asimov’s Psychohistory and the Data from Search Log

There is a lot of data recorded by search engines and one can do a lot with it. I've been working on predictive models, where one can use this data to infer what a person will do next based on records of previous and current actions.

This is really similar to Asimov’s psychohistory from his Foundation book series.

Basically, the idea of psychohistory is that one can’t predict what a single person will do. An individual is just too unpredictable. But for a really large number of people, a law of mass action takes over, and one can predict what this mass of people will do. This is a really intriguing concept. And interestingly, it is playing out with some pretty good success even at this incipient stage of research.

For some more background on psychohistory in Asimov’s book series, a mathematician named Hari Seldon develops a branch of mathematics known as psychohistory. Using psychohistory techniques, one can predict the future for a very large number of people. The more people, the more predictability. In the books, the number of people needed to be in the billions to make worthwhile predictions. However, research is showing that maybe one can built predictive models using numbers in the millions.

There are applications of this, of course, in Web search, such as determining user intent (I’ve other stuff in the works that is showing promise). Naturally, the search engine companies are doing work on this. See work by Google using large date sets and MSN on determining a variety of aspects of the user.

There is an interesting discussion that these large data sets are the end of theory, but maybe they will lead to newer psychohistory-like theories. The large social networking sites, like SecondLife, can make possible the study of economic, political, and social theories (if the data is available).

Certainly, there are some concerns with this ability to predict future actions. When one can predict, it is just a short step to control, which is what happens in the Foundation books. Using models from psychohistory, Seldon sets humanity on a predetermined course of events. And, naturally, governments are interested in psychohistory for their own purposes. So, I am aware of the possible risks of such predictive model development.

However, this is the direction that my research is taking, particularly with predictive approaches to reformulation, desired content, and future clickthrough. There are a host of potential upsides for Web search.

Wednesday, July 30, 2008

Effectiveness of Sponsored and Non-sponsored Results

Search engines offer two general category of search results, organic (a.k.a., algorithmic, natural, or non-sponsored) and sponsored (a.k.a., ads or sponsored links).

Certainly, the organic content is what brings folks to the search engine. But, it is the sponsored links that pay the bills for the search engines. Maintaining the massive infrastructure of modern Web search engines is not cheap. And, the search engines provide a wonderful service (for free). The impact has been incalculable but certainly the Web as we know it would look every different today without the services that the search engines provide.

Sponsored links get anywhere from 15 to 30% of the clicks, depending on who’s counting. I wondered why it was so low? Maybe the sponsored links were not as relevant? Maybe users just didn’t trust them. So, I conducted a research study to examine whether or not sponsored links were as relevant as non-sponsored links.

In the study, I used 108 ecommerce queries and 8,256 retrieved links for these queries from the three major Web search engines: Yahoo!, Google, and MSN. Blinded as to whether or not the link was organic or sponsonic, judges evaluated each link. The results show that average relevance ratings for sponsored and non-sponsored links are almost the same. The relevance ratings for sponsored links are even statistically higher than non-sponsored for the same query.

In addition to relevance evaluations, I qualitatively analyzed the e-commerce queries, deriving five categorizations of underlying information needs. Product-specific queries are the most prevalent (48%).

Title (62%) and summary (33%) are the primary basis for evaluating sponsored links with URL a distant third (2%).

To gauge the effectiveness of sponsored search campaigns, we analyzed the sponsored links from various prospects of marketing campaigns. Sponsored links from organizations with large sponsored search campaigns are more relevant than the average sponsored link.

Read the complete manuscript on research comparing sponsored and non-sponsored links

Tuesday, July 29, 2008

Defining a Session on Web Search Engines

The basic components of Web searching are hierarchical. One or more key terms form a query. One or more queries form a session. One or more sessions form an episode.

There is a lot of research work in the information searching and information retrieval areas at the term and query level. In fact, one can say that the whole construct of information retrieval is focused on the query (i.e., optimize algorithms to provide the best possible results for a given query).

However, from the user perspective, the most critical perspective is probably the session. Think about it. You use a search engine, submitted three queries looking for some particular content. Afterwards, do you think “Wow, the response to query 2 was great!”. Generally, no. Instead, it is more “Wow, I found what I was looking for.” The focus is on the session, not the individual query.

However, it is rather difficult to define a session using the data recorded in transaction logs. This may be why there is little research in this area and much more at the query level.

The de facto standard for a session is some temporal (i.e., time) cut-off, generally 30 minutes without any activity on a given system.

I lead a research study examining three different ways to define a Web searching session using data from a search log. We defined a session using (a) Internet Protocol address and cookie; (b) Internet Protocol address, cookie, and a temporal limit; and (c) Internet Protocol address, cookie, and change in terms.

The data set was 2,465,145 interactions from 534,507 users of Dogpile.com on May 6, 2005. The research results shows that defining sessions by a change in terms along with Internet Protocol address and cookie provided the best measure.

Interesting, regardless of the method used, the mean session length was fewer than three queries, and the mean session duration was less than 30 min.

Web searchers typically modified their query by changing query terms (nearly 23% of all query modifications) rather than adding or deleting terms.

The implications for search engines and advertisers interested in measuring searching traffic? Sessions may be a better indicator than the common metric of unique visitors.

Research the complete research paper on defining a Web searching sessions

Saturday, July 26, 2008

Factors relating to the decision to click-on a sponsored link

Sponsored search (a.k.a. pay-per-click) is the revenue engine that finances Web search as we know it. As in any pay-per-click (PPC) model, getting searchers to click is important, as well as ensuring the click is relevant. That is, an advertiser doesn’t want to pay for a click unless there is a reasonable chance of a convert (i.e., the visitor does someone of value once at the Website).

I ran a research project investigating what causes a searcher to click on a click, both organic and sponsored. In this research, 56 participants each engaged in six e-commerce Web searching tasks. This approach allowed for both investigating the bias toward sponsored links and controlling for quality of content. Data included 2,453 interactions with result page links, 961 utterances evaluating these links, and 102 results from a post-study survey.

The results of the research indicate that there is a statistically significant preference for non-sponsored links with searchers viewing these results first more than 82% of the time.

However, more than 73% of the searchers did view sponsored links at least once during the six searching sessions.

Interestingly, in post session surveys, most participants reported being unconcerned about whether or not the link was sponsored. They just wanted useful content.

See the complete research paper on factors affecting clickthrough of search engine results.

Friday, July 25, 2008

Expertise in Your Own Backyard

Some times one doesn’t have to travel far to find folks working on cutting edge projects. Had a meeting with Cole Camplese the other day. Cole is Director of Education Technology Services for Penn State University (http://www.colecamplese.com/) and is pushing several really cool projects. Cole is in that interesting space linking practitioner and academic researcher.

The reason for the meeting was the research work on search and branding that I am conducting, specifically we are leveraging Twitter as an instrument to mange brands

Cole is really pushing and leveraging Twitter to build closer communities, even within groups that do and can interact with each other in the ‘real world’. Cole has also used Twitter in the classroom to good results. Cole maintains an excellent blog at http://camplesegroup.com/blog/

Thursday, July 24, 2008

Impressions of the Googleplex

Recently visited Google headquarters. What can I say but really impressive. Beautiful campus and excellent working conditions. Dedicated people. I was there fairly late in the day, and there were still a lot of folks working, or at least there interacting (which also has advantages).

Some overall impressions
- Definite food culture. Since, I am a foodie, I really appreciated this.
- Office Assignments. Everyone that I saw shared an office. Think this is great. As an academic, having seen the ‘my office is bigger than your office’ child’s play too often. A shared office keeps folks humble and facilitates the sharing of ideas.
- View into the Mind of the World. I have heard a lot about the projecting of searching queries on the wall, as a snapshot into the mind of the world or the database of intentions. I've even talked about it before as Preserving the Collective Expressions of the Human Consciousness. Not so impressed with this. Maybe my expectations were too high. Was a real let down and not as insightful as I had thought.
- People. As I walked down the hallways, the names on the offices were some of the most well-known folks in their respective areas. The people that Google has recruited over a decade-long hiring spree are top-notch. Google has basically hired a collection of some of the best folks in the world in a variety of disciplines and brought them together in one organization.

Although I am fan of Yahoo!, certainly, Google is THE place to work today in the search engine business. Google has been the best company so far at developing its business model to leverage the Web as infrastructure. Their ad platform is stellar and easy to use. Plus, they swing for the fence on projects, which is great.

As for the competition, I have also been really surprised at the performance of Microsoft in the searching area. I like Microsoft, and they have made some great software for the desktop that makes individuals productive. When they entered the market with their LiveSearch, I really though that ‘the 800 pound gorilla is here’, look out! However, even after sinking a lot of resources into their product, they just haven't yet cut it.

I go back to Google's ability to hone their business plan to leverage the Web. Obviously, there were (are) some structural or organizational issues at Microsoft that prevented this same level of performance (the technology is similar). Which may be what resulted in this leadership shake-up at Microsoft Live Search.

Wednesday, July 23, 2008

E-Survey Methodology to Supplement Search Logs

Although transaction logs are really powerful for studying user – system interactions, it is best to triangulate with data collected from other methods. With computer networks, an e-survey is an excellent way to get more data (and different data than in transaction logs) to provide richer results.

Exploitation of these e-survey techniques requires careful consideration of conceptual and methodological issues associated with their use. I co-authored a chapter where we identify and explore these issues by defining and developing a typology of “e-survey” techniques for Web research.

The chapter examines the strengths, weaknesses, and threats to reliability, validity, sampling, and generalizability of e-survey approaches. The chapter also discusses issues of security, privacy, and ethics associated with the design and implications of e-survey methodology.

You can read the complete chapter on e-survey methodology.

Tuesday, July 22, 2008

The Effect of Information Technology: The 100

One of my favorite books is The 100: A Ranking of the Most Influential Persons in History by Michael H. Hart.

In this book, the Michael ranks the most influential people in human history. Influence is defined as the impact of the number of people multiplied by time span of that impact. Number 7 on the list is Ts'ai Lun, who is credited with making modern paper.

What is really interesting in this chapter on Ts'ai Lun is Michaels’ discussion of the geographical clustering of people on the list. The people on the list are clustered in both time and location.

The author credits this clustering to the ease that a given civilization or society is able to communicate written information. So, when clay tablets were the cutting edge in written communication, the Middle Eastern civilizations were the most advanced. When papyrus was invented, the Egyptians had the most advanced civilization. Paper comes along – the Chinese. Block printing – again, the Chinese. Moveable type – the Europeans.

This may smack of technological determinism to some; however, if one views technology as just one of the key elements in a given society, I believe Hart’s hypothesis has a lot merit. Naturally, technology interacts with other factors. With moveable type, for example, the Europeans had an alphabet that facilitated its use, while the Chinese did not.

An interesting analysis is what the impact of the Internet will have as a medium for communication in modern societies? Or, have we advanced beyond this into ‘one global communication culture’, so the Internet is not an advantage for any individual nation or region.

I am betting that the nations with the technology, policy, money, and drive to leverage the Internet to communicate and disseminate information will have a HUGE advantage. Within each nation, regions, states, cities, towns, and individuals who can leverage the Internet will also have an advantage.

Reference
Hart, Michael H. (1992) The 100: A Ranking of the Most Influential Persons in History, Revised and Updated for the Nineties. New York: Carol Publishing Group/Citadel Press.

Disclaimer: Michael H. Hart is a controversial figure. However, The 100 is still a good book.

Monday, July 21, 2008

The Societal Need to Preserve Search Logs from the Major Search Engines

Much of my research has focused on using Web search log, including co-authoring one of the first published studies in investigating Web searching. I have a real interest and appreciation for the benefits, as well as the shortcomings and risks, of analyzing such logs.

These logs are really the records of the collective expressions of the human consciousness. Search logs contain the immediate expressions of our society. These search submissions are records of our wants, desires, and interests, both big and small. John Battelle calls it the database of intentions and Peter Day calls search logs looking into the mind of the world.

Unfortunately, a lot of these records are disappearing at an alarming rate. There are only small snippets of logs remaining from some of the original and major search engines on the Web (Excite, AltaVista, AlltheWeb, AskJeeves, etc.). I make some of these search logs available for the research community

I understand the privacy thing, but certainly some archiving mechanism could be arranged.

I presented a position paper on preserving Weblogs at a World Wide Web Conference workshop on search logs along with two other panelists, Judit Bar-Ilan and John Morris. The other panelist took alternative positions. It was a really exciting panel, with some great discussion and Q&A!

See my position paper on preserving search logs and my presentation on preserving search logs

Sunday, July 20, 2008

Google Online Marketing Challenge Results

The results from the Google Online Marketing Challenge were just announced. The Google Online Marketing Challenge was certainly one of, if not the, biggest in class academic competitive endeavors ever undertaken.

There were 1,620 teams that finished all requirements (approximately 44% from EMEA [Europe-Middle East-Africa], 40% from the Americas and 16% from APAC [Asia – Pacific])

At the conclusion of the Challenge, Google first did an algorithmic cut of the campaign results, narrowing the field to 150 teams (9% of the 1,600 teams).

Then, expert Googlers in AdWords did a manual review of the campaigns, trimming the 150 to 15 (0.9% of the 1,600)

Then, an Academic Panel of Professors evaluated the campaign reports prepared by these 15 teams, selecting 3 teams from the 15 (0.2% of the 1,600), one from each region.

From these three teams, one was selected as the Global Winner (0.01% of the 1,600).

Really impressive! For winning, the students get numerous prizes including a laptop and a trip to visit Google HQ.

Congrats to the winners of the 2008 Google Online Marketing Challenge! I am also really proud of the team of Daehee Park, Caroline Furey, Joe Lewis, Matt Maisel, and Tonya Podkuiko for winning the Americas Region and proud of all the Penn State students who took the Challenge.

See the press release on the winning Penn State team in The Google Online Marketing Challenge.

A big thanks to Lee Hunter (from Google) and Jamie Murphy (University of Western Australia) who were the main leads on getting the Challenge conceived, developed, and executed.

Tuesday, July 15, 2008

Query Reformulation During Web Searching

One aspect of Web searching that generates a lot of interest is query reformulation. Not really surprising, given that the query is the primary expression that one can glen from the searcher - system interactions.

I collaborated on research examining query reformulation in Web searching, with some really interesting findings. We developed six basic classifications of any query. These six classifications are:
  • Assistance: the current query was generated by the searcher’s use of a system help feature.
  • Content Change: the current query is identical but executed on another content collection.
  • Generalization: the current query is on the same topic as the searcher’s previous query, but the searcher is now seeking more general information.
  • New: the query is on a new topic.
  • Reformulation: the current query is on the same topic as the searcher’s previous query and both queries contain common terms.
  • Specialization: the current query is on the same topic as the searcher’s previous query, but the searcher is now seeking more specific information.

We used these to classify a multi-million record transaction log. Our findings show that there are 3 three strong query reformulation transition patterns: (1) specialization to generalization, (2) video to audio, and (3) content change to system assistance.

The implications of these findings are that if one can predict future query reformulations the system can infer the future action and just do it for the searcher, thereby, improving searching performance (hopefully). This is especially true with the use of system help. If one can determine where in the search process the user is looking for help, then the system can interject itself into the process at this point, seemly at a time the user is looking for help. Otherwise, the system hangs back. This timed automated assistance reduces interruptions.

See the full research paper on Web query reformulation

Monday, July 14, 2008

The Unexplainability of the Long Tail

I’ve been thinking about issues concerning the ‘the long tail’ for some time now, as it a near-constant occurrence when doing empirical Web research.

The long tail is mathematically a power law distribution, and the power-law relations describe a near unbelievable number of patterns in a wide variety of fields. A power law is any polynomial relationship that exhibits the property of scale invariance (i.e., the scale remain constant throughout the relationship). If we take the log of both the x and y axis, the graph of the power law is linear with a constant slope. Most typically, a power law distribution is a rank – frequency plot of large set of data in a given context.

This relationship has been discovered numerous times in a variety of disciplines, including the Pareto distribution, Zipf's law, Bradford's law, Zeta distribution, and Benford's law. The idea has also noted in numerous other areas, so much so that I am surprised whenever someone says that they are surprised that the power law occurs. I am surprised when the long tail DOESNT descript a natural pattern!

In fact, the power law distribution explains so much, it actually explains too much. Which makes me think that perhaps the resulting distribution is just a methodological by-product of taking the rank – frequency of large sets of naturally occurring numbers. That is, there is some underlying mathematical pattern that we have yet to discover.

One interesting book that I believe looked at this from a novel aspect was The Long Tail by Chris Andersen. One of the most interesting insights from Chris was the extension of the long tail into a third dimension. Really insightful and will (or should) generate a lot of research.

There have been some counter arguments concerning Chris’s main conclusion that the Web is lowering the cost of distribution and therefore making the tail profitable. For example, Anita Elberse did some research that called the idea of the long tail lowering distribution costs into question. However, from my read of the article, we are splitting hairs. Generally, results from both works appear to be similar, at least to me.

Here is a link to a short version of Anita’s original piece and discussion, including a response from Chris.

Here is a paper that I did that shows the long tail in Web searching queries

Here is a book review that I did of The Long Tail

Saturday, July 12, 2008

Click through in Web Searching (Patterns in Viewing of Search Engine Results Pages and the Links off of these Pages)

One of the most important measures of search engine performance both from the perspective of organic and sponsored listings, is click through – the viewing of Webpages links off of search engine results page (SERP).

I co-authored a chapter reporting a comprehensive review of click through using search engine transaction logs, along with a temporal analysis of Webpage viewing that illustrates the searcher – Webpage interaction is extremely short – about 30 seconds!, most searchers don’t go beyond the first SERP and click on 8 or fewer links.

In general, results for SERPs show that approximately 60 percent of users enter one query and about approximately 60 percent of users viewed only one SERP.

As for total time on the search engine, 52 percent of the sessions spent less than 15 minutes with nearly 30 percent being less than 5 minutes.

For viewing the Webpages off of the SERP, while 10 links are typically displayed, the analysis shows that more than 66 percent of searchers examine fewer than 5 Webpages in a typical session and almost 30 percent view only one document in a given session. The average was 8.

You can read the complete chapter on viewing SERP and Webpages

Friday, July 11, 2008

Scirus Topic Page

As a scientific niche search engine, Scirus has started something called Topic Pages.

I was interested in the idea at first but didn’t really see what could be new about it. However, I have grown to like it.

Scirus Topic Page is a no cost, wiki-like platform service for the research community where experts summarize specific scientific topics. In addition to a write-up on a subject, the Topic Page also provides links to journal literature and Web sources.

On the page, Scirus also automatically presented relevant journal articles, Websites, and news articles from a search based off of the topic page. Really quite nifty!

Read my Scirus page on sponsored search.

Thursday, July 10, 2008

Sponsored Search as Information Searching

It is no surprise to most that the success of sponsored search (a.k.a., pay-for-click or keyword advertising) has radically affected how people interact with search engines, Websites, and product services on the Web.

For several years now, sponsored search has provided the necessary revenue streams for the major Web search engines and numerous third party sites. Sponsored search is critical to the success of many online businesses, plus a whole search engine marketing industry. However, there has been little academic examination of sponsored search, outside of research into online auctions.

I co-author an article that frames sponsored search as an information searching process, where we replace advertiser with provide to remove the label of sponsored search as just an advertising medium. The components of this framework are:

  • Provider. A person or organization interested in generating user traffic to a particular website for some specific purpose. We use the term ‘provider’ rather than advertiser to highlight that one can view sponsored search as a version of providing relevant content to a searcher, and not solely as an advertising medium.
  • Provider content. A set of keywords (representing concepts) along with the associated Uniform Resource Locators (URLs), titles, and descriptions, typically referred to as an advertisement or a sponsored link. Although, these terms have a heavy commercial interpretation, their use has become commonplace within the sponsored search domain. Therefore, we use them in this paper when referring to the provider content displayed on the SERP.
  • Provider bids. Bids for specified keywords that are a monetary valuation of traffic to a particular website by a provider.
  • Search engine. A search engine that serves the advertisement in response to user queries on SERP, relevant websites, or email.
  • Search engine review process. A method utilized by a search engine to ensure that the provider’s content is relevant to the targeted keyword on contextual material.
  • Search engine keyword and content index. A mechanism that matches provider’s keywords to user queries or to contextual material.
  • Search engine user interface. An application for displaying provider content as links in rank order to a searcher. Typically, the interface displays the sponsored links with non-sponsored links on a SERP, within email messages, or along side content on a web page.
  • Search engine tracking. A means of matching keywords to queries, gathering provider’s content, bids, metering clicks, and charging providers based on searcher clicks on their displayed links.
  • Searcher. An agent (i.e., human or automated surrogate) that actually clicks on a sponsored link that is deemed relevant.
Additionally, we provide a history of sponsored search and an extensive assessment of the technology that makes sponsored search possible.

It’s a good paper for those that want a complete (concept, technology, and marketing) overview of sponsored search.

See the full manuscript on sponsored search

Wednesday, July 09, 2008

i-conference workshop on publishing multi-disciplinary research

I was the lead on a workshop at the i-conference, a conference loosely for the i-schools. The workshop was on publishing multi-disciplinary research (note, the emphasis on ‘publishing’ as opposed to conducting or teaching).

I recruited panelists who hold intentionally offered diverse opinions concerning this issue (even if they really didn't hold these opinions), which made for a thought provoking session. The three panelists were Elizabeth D. Liddy (position: quit whining and just publish good research), Howard Rosenbaum (position: publishing multidisciplinary research is really hard), and Mark S. Ackerman (position: middle of the road). We had a really interesting and lively discussion.

My take away from the panelists’ conversation, audience comments, and discussion? Publishing multi-disciplinary research is hard for a number of reasons so don’t bother trying it. (Note: my take-away, not that of other panelists, audience members, etc.

The position paper is here - http://www.ischools.org/oc/conference08/pc/WC11_iconf08.pdf

Tuesday, July 08, 2008

Automated Assistance (i.e. system help or searching help) for Web Searching

One of the most potentially useful, yet difficult to achieve, technologies has been a sophisticated and useful system assistance during the searching process.

Certainly, there has been some system assistance that has been really useful. Probably the most successful and worthwhile has been the spelling suggestions for query terms.

Several Web search engines from Excite onward have attempted various forms of query reformulation or query suggestions. There have been some other attempts, such as relevance feedback. However, not too much advantage.

Much of the system assistance research has focused on personalization, usually employing some type of implicit feedback (i.e., drawing inferences from user – system interactions). Implicit feedback has some advantages over explicit feedback approaches (such as document ratings and profiles), namely the participation rates are much better.

However, the results from this line of research have been less than stellar. Most research results show little to no improvement in searching performance. Personalizing at the individual level may be just too difficult or a dead end.

A more worthwhile avenue of investigation may be to (1) identify what the user is seeking in terms of content and (2) personalized at this aggregate level. This approach seems to have promise.

Here are three papers that I have done on using implicit feedback for automated assistance / system help:

http://ist.psu.edu/faculty_pages/jjansen/academic/pubs/jansen_cacm_2006.pdf
http://ist.psu.edu/faculty_pages/jjansen/academic/pubs/jansen_assistance_jasist2005.pdf
http://ist.psu.edu/faculty_pages/jjansen/academic/pubs/jansen_assistance_IPM05.pdf

Monday, July 07, 2008

Mahalo - a human power search engine

Recently visited Mahalo (http://www.mahalo.com/) a ‘human powered’ search engine founded by Jason Calacanis. Gave an invited presentation to the Mahalo employees, including Jason, on search engine branding (see a short paper on the topic of search engine branding here).

It was a really interesting discussion, and the combined presentation and Q&A lasted a long time (about 2 hours!). Luckily, we did it over lunch! Also, I got to spend some time with several of the folks at Mahalo, including Eric Stephens, the director of user experience and Dan Zinngrabe, a research analyst.

The niche Mahalo is aiming for is really interesting with some obvious applications where it could be successful. Of course, they have a hard row competing in the search engine market, with several tough obstacles, including the branding effect of the major search engines, which was the topic of my talk. Plus, there is the long tail of Web queries and Web topics, so obviously, human power can't do it all.

When asked what I thought it would take to ‘break into’ the search engine market in a big way, I said some type of disruptive technology that improves search results so much that searchers can’t ignore the difference. We can see precedent for this with what AltaVista and Google did when they entered the search engine market. Of course, there is some ‘survivor bias’ in such a view point (i.e., there are some search engines that had disruptive technologies that didn’t make it).

Mahalo has an interesting idea that appears beneficial for the more popular topics, some good people, good financing, and, of course, Jason is a marketer who could sell you your own left shoe. :-)

Sunday, July 06, 2008

The Scientific Method of Research

This is one of my favorite quotes concerning the research and the scientific method:

For the true scientific method is this: to make no unnecessary hypotheses, to trust no statements without verification, to test all things as rigorously as possible, to keep no secrets, to attempt no monopolies, to give out one’s best modestly and plainly, serving no other end but knowledge (Wells, 1920).

I consider it a guiding principle for the conduct of an academic researcher.

Reference
Wells, H.G. (1920). The outline of history: Being a plain history of life and mankind (Vol. II). Garden City, New York: Garden City Books.

Saturday, July 05, 2008

A lot of data can tell you a lot of things

Attended a talk by Peter Norvig, Google's research director, where he discussed the use of *very large* amounts of data for certain tasks that limited the need for models. The talk was entitled, Practice makes perfect: How billions of examples lead to better models of language, pictures, and other things (http://gradsymp.ist.psu.edu/archive/2008/keynote)

Peter’s talk at this venue and other places has caused several interesting posts, including an article by Chris Anderson (The End of Theory: The Data Deluge Makes the Scientific Method Obsolete - http://www.wired.com/science/discoveries/magazine/16-07/pb_theory/) and a response by Kevin Kelly (The Google Way of Science - http://www.kk.org/thetechnium/archives/2008/06/the_google_way.php).

As an empirical researcher, there is a lot that I agree with the ‘data is king view'. Specifically, that (1) a lot of data can tell you are lot of things. (2) if you are primarily interested in what people are doing, why do you care why? They did it. (3) that a lot of models that deal with people are inaccurate and often wrong and (4) most models are not models at all but are little more than paradigms create to by ‘us’ that impose our biases on a process in a given context.

Chris Anderson takes this to another level, saying that theories are out now also. That this is the end of the scientific method.

Theories and models are not the only scientific structures that we use. In fact, a lot of current scientific research is only loosely connected to theory and models. Plus, grounded theory is based on finding patterns in data, so this data-first approach is not really that new. It is new in that the quantity of data has exponentially increased, provide new opportunities.

However, in addition to theory, there are also theoretical constructs, the small, simple, narrow statements or laws on which theories and then models are built. Every human endeavor, even the ones mentioned by Peter, Chris, and Kevin have these implicit constructs at their core. Some theoretical construct, either explicit or implicit, on an approach or even an action's importance are at the core of every scientific project or human endeavor.

Finally, there are some theories for which no or limited empirical data exists, such as string theory and quantum mechanics.

So, the quantity of data available provides new opportunities in some fields. Rather than killing anything, it enriches the scientific domain.

Friday, July 04, 2008

Summize, a comparison search engine

Recently visited Summize (http://summize.com/), a search engine for topics and attitudes expressed within online conversations. Really interesting product, especially their Twitter searching. Had some good conversations with Abdur Chowdhury (http://www.ir.iit.edu/~abdur/), CEO and co-founded of Summize.

A lot of our conversation centered on the changing nature of privacy. In Twitter, LinkedIn, Facebook, and others social networking applications, the default is wide open. A lot of users know this going in or find out quickly, and they see no reason to change.

Tweets get default posted to the Twitter Website. I have noticed on LinkedIn that most folks leave their connections viewable. While some on Facebook use privacy settings to avoid cyber stalking, most allow their profiles to be openly viewed by others. Although, I have noticed a generational split with this on Facebook (i.e., the older folks using more tighten privacy). … Also, my kid did ‘de-friend’ me on Facebook because it was way un-cool to be friends with dad. :-)

To be honest, I have found the open Tweets quite useful and interesting. The ability to view other’s LinkedIn connections has been quite helpful in facilitating business. I have my profile open on Facebook and have reconnected with some old friends.

There appears to be an openness with the younger folks concerning the use of and posting of online media, with an attitude of “if I didn’t want people to know this, I would not be using a computer”. This is a whole different view of privacy than what existed previously.

I am currently working on a research project with Summize and the use of Twitter for brand management. See the press release here - http://live.psu.edu/story/31198

Thursday, July 03, 2008

Click Fraud

Was reviewing the Wikipedia page on click fraud (http://en.wikipedia.org/wiki/Click_fraud). When I looked at it, it really needed some updating. I may carve out some time and edit the page.

Click fraud has been the nemesis of online marketing for the search engines, especially in the content networks. There have been some lawsuits brought against the major search engine companies, both Google and Yahoo! See Tuzhilin's report concerning the Google case (http://googleblog.blogspot.com/pdf/Tuzhilin_Report.pdf) and this news article concerning the Yahoo! case (http://www.imediaconnection.com/content/10294.asp).

Ben Elgin also did a nice Business Week article on click fraud (http://www.cis.upenn.edu/~mkearns/teaching/SponsoredSearch/BizWeek.pdf).

The general consensus seems to be that the click fraud on the search engine results page is not much of a problem. Does it happen? Sure, but it is just a cost of doing business; much like there will be some shoplifting in a brick and mortar store. One just figures it into the day-to-day expenses.

The real problem is with the content networks on both the Google and Yahoo! platforms, as well as the third parties. The easy money is just too much of a temptation for most folks.

However, it would be nice to get an accurate measurement of what the rate of click fraud actually is, both on the search engine results page and on the content networks.

It would be simple to do with the cooperation of the search engine companies. Just set up an experiment.

Basically, get a trusted third party to set-up store fronts and accounts with the search engines. Maybe target a low, medium, and high keyword domain. Then, have the trusted third party hack (i.e., execute click fraud) their own store fronts. Do this for a period of time. At the end of the experiment period, get the logs from the Websites, the search engine advertising accounts, and the logs from the hacker computers. With all three data sets, one can then tell how much click fraud is slipping through the search engine identification systems. This is the key metric – “how much click is occurring that the search engines aren’t catching. One needs all three data sets to do this.

I wrote a short article providing an overview of click fraud and defining some key terms. See http://ist.psu.edu/faculty_pages/jjansen/academic/jansen_click_fraud.pdf

Wednesday, July 02, 2008

Classifying Digital Image Searching on the Web

I recently published a paper on Searching for Digital Images on the Web (http://ist.psu.edu/faculty_pages/jjansen/academic/jansen_image_retrieval.pdf). The paper didn’t get a lot of press, but the research results are really interesting.

Basically, I wanted to see if existing image classification schemes (i.e. Enser and McGregor, Jorgensen, and Chen) ‘fit’ with the way Web searchers were looking for images.

The findings showed that three of these common image classifications schemes didn’t accurately classify the way Web searchers were looking for images.

Finding showed that Web searcher most commonly look for objects and people images. Cost is a factor for web searching (i.e., they want it free). And, there is a lot of searching for image collections.

Another example of how Web searching is different, in some respects, than traditional searching approaches.

Tuesday, July 01, 2008

Search Log Data Again Center in Legal Case

The use of search log data is once again at the center of a legal case, this time dealing with pornography, specifically to help define what is “community standards” (see http://machinist.salon.com/blog/2008/06/24/orgy_apple_pie/).

Basically, the defense lawyer wanted to subpoena Google for local search statistics, especially pornographic searching topics, submitted by local residents of Pensacola, Florida. From this, the lawyer hoped to make the case of what actually are the local standards.

Of course, the most famous legal case involving the use of search log data was the Justice Department’s use of logs in their efforts to combat child pornography (http://builder-news.com.com/Feds+take+porn+fight+to+Google/2100-1030_3-6028701.html). For some other interesting documents on the Department of Justice vs. Google case, see http://www.cdt.org/security/20060224doj-reply-google.pdf and http://blog.searchenginewatch.com/blog/pdf/Google_NoticeofStarkDeclaration.pdf.

This Florida case is interesting in its use of log data, and I believe we will see more and more of these types of cases in the legal system. Search log data is very specific. Therefore, it can provide insight into a variety of social-level legal issues, such as the community standard one in Florida.

In fact, with the increased use of search engine toolbars and using logs for personal information management, I would bet that we will see more of use of this data in even individual cases such as divorce, as well as organizational lawsuits suits.

Of course, as any researcher that uses search logs will tell you, search logs have methodological limitations in that they are records of what occurred but not records of 'why' some action occurred. Therefore, one should generally use other data sources to validate and provide richness to the results from search logs.

I ran a panel at the 2007 ASIST conference (http://www.asis.org/Conferences/AM07/) on log analysis (http://www.asis.org/Conferences/AM07/panels/24.html) where we touched on some of these issues.