I found a very good article on click fraud written by Dmitri Eroshenko, the CEO of Clicklab. I like the article because it gives both a technical and a business perspective on the subject. He notes that his company has developed a system that assigns weights to certain types of sessions, in much the same way that email spam filters do. I think this is the right approach, and regret that it wasn't possible for me to do something like that at AV, due to a lack of time and resources.
I also found a open-source project called hypKNOWsys that attempts to identify certain types of usage patterns in user sessions. Their code probably would not scale to a site with heavy traffic volume, such as a search engine. Also, they primarily rely on page identification using the Apache log format, which is cumbersome, and requires extra processing to determine whether or not pages are identical. I think the gladiator approach of having pages identified by numerical codes corresponding to the domain name (siteid) and rendered page (pageid) was much more efficient. It's unfortunate the implementation was not as good as the ideas. I sometimes wonder if we should have continued to use Apache but modified the page serving code to use numeric codes. I suppose the gladiator team, having had experience with Resin and Java, felt more comfortable with those, but I think they underestimated the amount and nature of AV traffic.