Learning from Server Logs

Chapter 7: Learning from Server Logs

by Philip Greenspun, part of Database-backed Web Sites

Note: this chapter has been superseded by its equivalent in the new edition

What do you want to know? Figure that out and then affirmatively devise a logging strategy. If your goal is to fill up your hard disk, then any Web server program will do that quite nicely with its default logs. However, this information might not be what you want or need.

Here are some examples of starting points:

"I want to know how many users requested non-existent files and where they got the bad file names."
"I want to know how many people are looking at Chapter 3 of http://webtravel.org/samantha/."
"I want to know how long the average reader of Chapter 3 spends before moving on to Chapter 4."
"I sold a banner ad to Sally's Sad Saab Shop. I want to know how many people clicked on the banner and went over to her site."
"I want to know the age, sex, and zip code of every person who visited my site so that I can prepare a brochure for advertisers."

Let's take these one at a time.

Case Studies

"I want to know how many users requested non-existent files and where they got the bad file names."

You can configure your Web server program to log every access by writing a line into a Unix file system file (I've broken it up here into multiple lines for readability, but it is actually written with no newlines):

ip248.providence.ri.pub-ip.psi.net - - [28/Jan/1997:12:35:54 -0500] 
"GET /sammantha/travels-withsammantha.html HTTP/1.0" 
404 170 
- 
"Mozilla/3.0 (Macintosh; I; 68K)"

The first field is the name of the client machine. It looks like someone who connects via PSI in Providence, Rhode Island. My AOLserver has written the date and time of the connection and then the request line that the user's browser sent: "GET /sammantha/travels-withsammantha.html HTTP/1.0". This says "get me the file named /sammantha/travels-withsammantha.html and return it to me via the HTTP/1.0 protocol." This is close to /samantha/travels-with-samantha.html but not close enough for the Unix file system, which tells the AOLserver that it can't find the file. The AOLserver returns a 404 File Not Found message to the user. We see the 404 status code in the log file and then the number of bytes sent (170). The dash after the 170 normally contains the value of the "referer header" (yes, it is misspelled in the standard). In this case that field is empty, meaning either that the user has typed in the URL directly from "Open Location" or that the user's browser did not supply a referer header, indicating the page from which the user clicked. I instructed my AOLserver to log the user-agent header so I know that this person is using Netscape (Mozilla) 3.0 on the Macintosh. Netscape 3.0 definitely does supply the referer header. So unless I can drive down to Providence and teach this user how to spell, we're both out of luck.

Moving on to the next 404 . . .

hd07-097.compuserve.com - - [28/Jan/1997:12:42:53 -0500] 
"GET /photo/canon-70-200.html HTTP/1.0" 404 170 http://www.cmpsolv.com/photozone/easy.htm
"Mozilla/2.0 (compatible; MSIE 2.1;  Windows 3.1)"

Here's a user from CompuServe. This person is asking for my review of the Canon EOS 70-200/2.8L lens (a delicious $1,500 piece of glass) and gets the same 404 and 170 bytes. But this user's referer header was "http://www.cmpsolv.com/photozone/easy.htm". There is a link from this page to a non-existent file on my server. Does that mean the photozone folks at cmpsolv.com are losers? No, it means I'm a loser.

I didn't think carefully enough about my file system organization. The file used to be at /photo/canon-70-200.html but then I created a canon subdirectory and moved this review into it. So now a correct request would be for /photo/canon/canon-70-200.html.

What should I do about this? I could try to find the maintainers of photozone and send them an e-mail message asking them to update their link to me. Creating extra work for them because of my incompetence-that seems like a nice way to pay them back for doing me a favor by linking to me in the first place. Alternatively, I could reconfigure my AOLserver to redirect requests for the old file name to the new file. I have already installed quite a few of these redirects, a testament to my inability to learn from experience. Finally, I could be relatively user-unfriendly and put an HTML file at the old location saying "please click to the new location." That's not really any less trouble than installing the redirect, though, so there wouldn't much point to doing it unless I was using someone else's Web server where I wasn't able to install redirects.

Note: An interesting side note about this server log entry is the user-agent header: "Mozilla/2.0 (compatible; MSIE 2.1; Windows 3.1)". The first part says "I'm Netscape 2.0." The second part says "I'm Microsoft Internet Explorer 2.1." A couple of years ago, Web publishers with too much time and money to spend programmed their services to deliver a frames-based site to Netscape 2.0 Achievers and a non-frames site to other browsers. The CGI or API scripts made the decision of which site to display based on whether the user-agent header contained the string "Mozilla/2.0." Microsoft, anxious that its users not be denied the wondrous user interface experience of frames, programmed Internet Explorer to pretend to be Netscape 2.0 so that publishers wouldn't have to rewrite their code.

"I want to know how many people are looking at Chapter 3 of http://webtravel.org/samantha/."

My answer here could be adapted from an article in Duh magazine: Search the server log for "GET /samantha/samantha-III.html". Here's a typical log entry:

ld20-147.compuserve.com - - [30/Jan/1997:18:28:50 -0500] 
"GET /samantha/samantha-III.html HTTP/1.0" 200 17298 
http://www-swiss.ai.mit.edu/samantha/samantha-II.html
"Mozilla/2.01E-CIS (Win16; I)"

The host name tells us that this person is a CompuServe user. The document requested was Chapter 3 and it was successfully delivered (status code of 200; 17298 bytes served). The referer header is "samantha-II.html", meaning that this reader was reading Chapter II and then clicked on "next chapter." Finally, we learn that the reader is running Netscape 2.01 on a Windows 3.1 box.

What are the subtleties here? First, the user might be coming through a caching proxy server. America Online, for example, doesn't let most of its users talk directly to your Web server. Why not? For starters, AOL doesn't use Internet protocols so their users don't have software that understands TCP/IP or HTTP. Even if their users had Internet software, AOL only has a limited connection to the rest of the world. When 100 of their users request the same page, say, http://www.playboy.com, at around the same time, AOL would rather than only one copy of the page be dragged through their Internet pipe. So all of their users talk first to the proxy server. If the proxy has downloaded the page from the Internet recently, the cached copy is delivered to the AOL user. If there is no copy in the cache, the proxy server requests the page from your Web server and finally passes it on to the AOL customer.

A lot of companies require proxy connections for reasons of security. This was already an issue in pre-Java days but Java provides the simplest illustration of the security problem. A badly implemented Java-enhanced browser will permit an applet to read files on a user's computer and surreptitiously upload them to a foreign Web server. At most companies, the user's computer has the authority to read files from all over the internal company network. So one downloaded Java applet could potentially export all of a company's private data. On the other hand, if the company uses a firewall to force proxy connections, it can enforce a "no Java applet" policy. Computers with access to private information are never talking directly to foreign computers. Internal computers talk to the proxy server and the proxy server talks to foreign computers. If the foreign computer manages to attack the proxy server, that may interfere with Web browsing for employees but it won't compromise any internal data since the proxy server is outside of the company's private network.

Company security proxies distort Web server stats just as badly as AOL's private protocol-Internet bridge proxies. In my server's early days, there was one computer at Digital that downloaded about 250 .html files a day. I thought, "Wow, this guy at DEC must be my biggest fan; I really must find out whose machine this is." I eventually did find out. The computer was not sitting on a bored engineer's desktop; it was the proxy server for the entire corporation.

If proxy servers result in statistical understatements of traffic, user behavior can result in overstatements. Suppose a user with a flaky connection is trying to read Chapter 3 of Travels with Samantha and two of the 18 in-line images don't load properly. The user clicks "reload" in hopes of getting a fully illustrated page, adding 19 spurious hits to the server log (one for the .html file and 18 for the in-lines).

These statistical inaccuracies troubled me until I realized, "Hey, I'm not launching the Space Shuttle here." On average, more downloads equals more readers. The number of people reading Chapter 3 is pretty well correlated with the number of "GET /samantha/samantha-III.html" requests. I'll just collect that number and be happy.

Suppose you aren't as easy-going as I; what can you do to get more accurate data? See the next section.

"I want to know how long the average reader of Chapter 3 spends before moving on to Chapter 4."

I remember when my Web site was new, back in the winter of 1993-94. Every day or two I'd load the whole HTTPD access log into Emacs and lovingly read through the latest lines, inferring from the host name which of my friends it was, tracing users' paths through Travels with Samantha, seeing where they gave up for the day.

Lately, my server gets 25 hits a second during peak hours. Emacs is happy to display a 25MB log file so there is no reason why volume per se should keep me from doing "visual thread analysis." A big problem, though, is that these hits are coming from dozens of simultaneous users whose threads are intertwined. Worse yet, I don't see the readable host names that I've printed in this book. A Web server only gets IP addresses which are 32-bit integers. You have to explicitly do a "reverse Domain Name System (DNS)" lookup to turn an IP address into a readable name, for example "129.34.139.30" turns into "ibm.com". Reverse DNS consumes network and CPU resources and can lead to the server process hanging. My physical server is a pathetic old Unix box so I turned it off. My even more pathetic and older brain is then completely unable to untangle the threads and figure out which users are getting which files in which sequence.

One approach to tracking an individual reader's surfing is to reprogram your Web server to issue a magic cookie to every user of your site. Every time a user requests a page, your server will check to see if a cookie header has been sent by his browser. If not, your server program will generate a unique ID and return the requested page with a Set-Cookie header. The next time the user's browser requests a page from your server, it will set the cookie header so that your server program can log "user with browser ID #478132 requested /samantha/samantha-III.html."

This gives you a very accurate count of the number of users on your Web site and it is easy to write a program to grind over the server log and print out actual user click streams.

Problems with this approach? Not all browsers support the Netscape Magic Cookie protocol (introduced with Netscape 1.0; see http://webtools.com/wtr/ for a link to the spec). AOL's proprietary browser has been the perennial exception. And some users set their browsers to warn them before setting cookies. If they reject the cookie that you try to set, their browser will never give it back to your server program. So you keep issuing cookies to users unable or unwilling to accept them. If such a user requests 50 documents from your server, casually implemented reporting software will see him as 50 distinct users requesting one document each.

"I sold a banner ad to Sally's Sad Saab Shop. I want to know how many people clicked on the banner and went over to her site."

The number of click-throughs is information that is contained only in Sally's server log. She can grind through her server log and look for people who requested "/index.html" with a referer header of "http://yoursite.com/page-with-banner-ad.html". Suppose your arrangement with Sally is that she pays you ten cents per click-through. And further suppose that she has been hanging around with Internet Entrepreneurs and has absorbed their philosophy. Here's how your monthly conversation would go:

You: How many click-throughs last month, Sally?

Sally: Seven.

You: Are you sure? I had 28,000 hits on the page with the banner ad.

Sally: I'm sure. We're sending you a check for 70 cents.

You: Can I see your server logs?

Sally: Those are proprietary!

You: I think something may be wrong with your reporting software; I'd like to check.

Sally [sotto voce to her sysadmin]: "How long would it take to write a Perl script to strip out all but seven of the referrals from this loser's site? An hour? OK."

Sally [to you]: "I'll put the log on my anonymous FTP site in about an hour."

Of course, Sally doesn't have to be evil-minded to deny you this information or deliver it in a corrupted form. Her ISP may be running an ancient Web server program that doesn't log referer headers. Some of your readers may be using browsers that don't supply referer headers. Sally may lack the competence to analyze her server log in any way.

What you need to do is stop linking directly to Sally. Link instead to a "click-through server" that will immediately redirect the user to Sally's site but keep a thread alive to log the click-through. If you have a low-tech site, your click-through script could dump an entry to a Unix file. Alternatively, have the thread establish a connection to a relational database and record the click-through there.

What if you have a really low-tech site? You are hosted by an ISP that doesn't know how to spell "relational database." Fill out a form on my click-through server and establish a realm for your site. My server will log click-throughs and prepare reports for you. See Figure 7-1 for the architecture of this system.

*** wouldn't it be nice if I had the time to convert these figures? Anyway, if you buy a real dead trees copy then you'll get them. ****

Figure 7-1: My click-through monitoring service, allows low-tech Web publishers to measure traffic going from their site to foreign sites. This can be useful if you are selling banner ads. Instead of linking directly to http://their.com, you link to http://clickthrough.net/your?send_to=http://their.com. A user clicking through will first make a request of my click-through server. The request is logged in a relational database and then an HTTP 302 redirect is returned to the user's browser, causing it to make a final request from http://their.com.

In addition to being the author of the click-through server software, I also use the software to collect statistics on my personal site. Here's a portion of one of my report pages:

http://harmoniamundi.com/ (from materialism/stereo.html) : 1 
http://nz.com/webnz/flying_kiwi/ (from nz/nz-mtn-bike.html) : 10 
http://nz.com/webnz/flying_kiwi/ (from nz/iwannago.html) : 2 
http://nz.com/webnz/flying_kiwi/ (from nz/wellington-to-milford.html) : 8 
http://photoarts.com/banning/gallery/evans.html (from photo/walker-evans.html) : 2 
http://www.acura.com/ (from philg/cars/nsx.html) : 50 
http://www.adiweb.com/ (from photo/credits.html) : 1 
http://www.adiweb.com/ (from webtravel/vietnam/) : 3 
http://www.adiweb.com/ (from photo/speed-graphic.html) : 8 
http://www.adiweb.com/ (from photo/labs.html) : 27 
http://www.alanet.com/ (from summer94/french-quarter.html) : 5 
http://www.alanet.com/ (from summer94/new-orleans-zoo.html) : 13 
http://www.architext.com/ (from photo/credits.html) : 2 
http://www.audioadvisor.com/ (from materialism/stereo.html) : 1 
http://www.bhphotovideo.com/ (from photo/edscott/spectsel.htm) : 5 
http://www.bhphotovideo.com/ (from photo/where-to-buy.html) : 386 
http://www.bhphotovideo.com/ (from photo/nature/atkins-primer.html) : 10 
http://www.bhphotovideo.com/ (from photo/labs.html) : 5 
http://www.bostonphoto.com/ (from photo/labs.html) : 40 
http://www.bostonphoto.com/ (from photo/travel/foliage.html) : 16 
http://www.bostonphoto.com/ (from photo/credits.html) : 1 
http://www.calphalon.com/ (from materialism/kitchen.html) : 2 
http://www.canon.com/ (from photo/canon/canon-reviews.html) : 47 
http://www.chesky.com/chesky/ (from materialism/stereo.html) : 2 
http://www.cool.co.cr/crexped.html (from cr/central-valley.html) : 29 
http://www.cool.co.cr/crexped.html (from cr/tour-operators.html) : 37 
http://www.cool.co.cr/crexped.html (from cr/baru.html) : 5 
http://www.cool.co.cr/crexped.html (from cr/internet-resources.html) : 2 
http://www.cool.co.cr/crexped.html (from cr/corcovado-tent-camp.html) : 7 
http://www.cool.co.cr/crexped.html (from cr/tortuga-lodge.html) : 9 
http://www.cool.co.cr/crexped.html (from cr/monteverde-lodge.html) : 12 
http://www.cool.co.cr/toping/tur/faq.html (from cr/index.html) : 13 
http://www.cris.com/~tnv2001/ (from materialism/stereo.html) : 3 
http://www.cuisinart.com/ (from materialism/kitchen.html) : 2 
http://www.dacorappl.com/ (from materialism/kitchen.html) : 3 
http://www.fujifilm.co.jp/usa/aps/smartcity/f-better.html (from photo/aps.html) : 10 
http://www.ge.com/appliances/ge_profile_products/jgbp79wevww.htm (from materialism/kitchen.html) : 1 
http://www.goodnet.com/~rmnsx (from philg/cars/nsx.html) : 45 
http://www.hasselblad.com/ (from photo/rollei-6008.html) : 11 
http://www.hearstnewmedia.com/ (from photo/credits.html) : 2 
http://www.hp.com/ (from photo/credits.html) : 1 
http://www.intel.com/ (from photo/credits.html) : 1 
http://www.keh.com/ (from photo/credits.html) : 3 
http://www.keh.com/ (from photo/alex.html) : 7 
http://www.keh.com/ (from photo/where-to-buy.html) : 137 
http://www.klt.co.jp/nikon/ (from photo/nikon/nikon-reviews.html) : 67 
http://www.kodak.com:80/ciHome/APS/APS.shtml (from photo/aps.html) : 22 
http://www.kyocera.com/kai/input.html (from materialism/kitchen.html) : 3 
http://www.lcs.mit.edu/ (from photo/credits.html) : 1 
http://www.lightroom.com/ (from photo/labs.html) : 11 
http://www.lightroom.com/masking.html (from photo/labs.html) : 9 
http://www.monstercable.com/ (from materialism/stereo.html) : 1 
http://www.moon.com/ (from cr/moon/index.html) : 27 
http://www.mpex.com/ (from photo/where-to-buy.html) : 59 
http://www.novanet.co.cr/milvia/index.html (from cr/milvia.html) : 10 
http://www.p-c-d.com/ (from materialism/kitchen.html) : 4 
http://www.portphoto.com (from samantha/gift-shop.html) : 11 
http://www.portphoto.com (from photo/labs.html) : 23 
http://www.portphoto.com/ (from photo/labs.html) : 1 
http://www.pricehunter.com/interpro/index.html (from photo/where-to-buy.html) : 111 
http://www.sheffieldlab.com/ (from materialism/stereo.html) : 2 
http://www.theabsolutesound.com/ (from materialism/stereo.html) : 1 
http://www.thermador.com/ (from materialism/kitchen.html) : 2 
http://www.vanguard.com/ (from materialism/money.html) : 3 
http://www.wweb.com/cayman (from webtravel/cayman.html) : 50 
http://www.zzyzxworld.com/ (from photo/credits.html) : 1 
http://www.zzyzxworld.com/ (from photo/labs.html) : 11

Take a look at the second line. It shows that ten people clicked from my New Zealand mountain biking story to http://nz.com/webnz/flying_kiwi/ (a tour company's home page). About 15 lines down, we see that 386 people were referred to http://www.bhphotovideo.com/ from my photo.net magazine's "where to buy" page. Hmm . . . That's getting to be an interesting number. What if we click on it?

February 03, 1997 : 34 
February 02, 1997 : 41 
February 01, 1997 : 45 
January 31, 1997 : 44 
January 30, 1997 : 42 
January 29, 1997 : 54 
January 28, 1997 : 60 
January 27, 1997 : 66

Hmmm . . . 45 people a day, people who were reading my reviews of cameras in http://photo.net/photo, then decided to click on "where to buy," and then decided to click on my link to B&H Photo's home page. I've always liked B&H Photo, but my evil twin wonders how much another camera shop would pay us to make just a few changes to the text of http://photo.net/photo/where-to-buy.html.

Anyway, not to bore you too thoroughly by walking you statically through a dynamic site, but it seems like a good time to show off the advantages of using an RDMBS for this. The click-through server can slice and dice the reports in all kinds of interesting ways. Suppose I want to lump together all referrals from my personal site to B&H Photo regardless of which page they are from. Click:

February 03, 1997 : 38 
February 02, 1997 : 42 
February 01, 1997 : 49 
January 31, 1997 : 48 
January 30, 1997 : 45 
January 29, 1997 : 55 
January 28, 1997 : 61 
January 27, 1997 : 68

What about that day when there were 48 click-throughs? Where did they come from? Oh, "48" seems to be a hyperlink. Let me click on it:

from photo/edscott/spectsel.htm : 1 
from photo/labs.html : 1 
from photo/nature/atkins-primer.html : 2 
from photo/where-to-buy.html : 45

Hmmm . . . This "where to buy" page seems to be crushing the competition as far as links out. Can I see a report of all the click-throughs from this page to others? February 03, 1997 : 60 February 02, 1997 : 71 February 01, 1997 : 94 January 31, 1997 : 81 January 30, 1997 : 76 January 29, 1997 : 100 January 28, 1997 : 101 January 27, 1997 : 111

What about February 1? Where did those 94 click-throughs go?

to http://www.bhphotovideo.com/ : 45 
to http://www.keh.com/ : 22 
to http://www.mpex.com/ : 12 
to http://www.pricehunter.com/interpro/index.html : 15

Oooh! I'm in RDBMS heaven now. And all I had to do was

fill out a Web form to establish a realm on the click-through server
replace things like "http://www.bhphotovideo.com" with things like "http://clickthrough.photo.net/ct/philg/photo/where-to-buy.html?send_to=http://www.bhphotovideo.com/" in my static .html files

If you want to get these reports for your own Web site, just visit http://webtools.com/wtr to get started.

"I want to know the age, sex, and zip code of every person who visited my site so that I can prepare a brochure for advertisers."

The traditional answer to this request is "all you can get is IP address; HTTP is an anonymous peer-to-peer protocol." Then Netscape came out with the Magic Cookie protocol in 1994. It looked pretty innocent to me. The server gives me a cookie. My browser gives it back to the server. Now I can have a shopping basket. My friends all said, "This is the end of privacy on the Internet, Greenspun, and you're a pinhead if you can't figure out why."

So I thought about it for a while. Then I started adding some code to my click-through server.

Suppose I add an invisible GIF to my photo.net page:

<img width=1 height=1 border=0 
src="http://clickthrough.photo.net/blank/philg/photo/index.html">

This is a coded reference to my click-through server. The first part of the URL, "blank", tells the click-through server to deliver a 1-pixel blank GIF. The second part, "philg", says "this is for the philg realm, whose base URL is http://photo.net/". The last part is a URL stub that specifies where on the philg server this blank GIF is appearing.

[Note: the above URL looks a little confusing unless you have read Chapter 9 and are familiar with quality Web server program like the AOLserver that allow a publisher to register a whole range of URLs, e.g., those starting with "blank", to be passed to a Tcl program. So this reference looks like it is grabbing a static .html page but actually it is running a program.]

Suppose that http://photo.net/photo/index.html is the first page that Joe User has ever requested with one of these GIFs from clickthrough.photo.net. In that case, his browser won't offer a cookie to clickthrough.photo.net. My program sees the cookie-less request and says, "Ah, new user, let's issue him a new browser_id and log this request with his IP address and user-agent header." Suppose Joe is the sixth user that clickthrough.photo.net has ever seen. My program then issues a

Set-Cookie:  ClickthroughNet=6; path=/;
expires=Fri, 01-Jan-2010 01:00:00 GMT

This code tells Joe's browser to return the string "ClickthroughNet=6" in the cookie header every time it requests any URL from clickthrough.photo.net (that's the "path=/" part). This cookie would normally expire when Joe terminated his browser session. However, I'd really like to track Joe for a while so I explicitly set the expiration date to January 1, 2010. I could have made it last longer, but I figured that by 2010 Joe ought to have abandoned all of his quaint notions about privacy and will be submitting his name, address, home phone number, and VISA card number with every HTTP GET.

Every time Joe comes back to http://photo.net/photo, his browser will see the IMG reference to the click-through server again. Normally, his browser would say, "Oh, that's a GIF that I cached two days ago so I won't bother to rerequest it." However, I wrote my program to include a "Pragma: no-cache" header before the blank GIF. This instructs proxy servers and browser programs not to cache the reference. They aren't required to obey this instruction, but most do.

So Joe's browser will request the blank GIF again. This time, though, his browser will include a cookie header with his browser ID so my click-through server can just return the blank GIF and then keep a thread alive to log the access.

Now I can ask questions like "What are all the times that the Netscape with browser_id 6 requested tagged pages from my server?" and "What percentage of users return to http://photo.net/photo more than twice a week?"

To make life a little more interesting, suppose I add a little bit of code to http://www.webho.com/WealthClock (Bill Gates Personal Wealth Clock):

<img width=1 height=1 border=0 
src="http://clickthrough.photo.net/blank/webho/WealthClock">

Note that www.webho.com is a different server from photo.net. If photo.net had issued Joe User's browser a cookie, his browser would not offer that cookie up to www.webho.com. But photo.net did not issue Joe a cookie; clickthrough.photo.net did. And that is the same server being referenced by the in-line IMG on the Wealth Clock. So my click-through server will be apprised of the access (see Figure 12-2).

Figure 12-2: Magic cookies mean the end of privacy on the Internet. Suppose that three publishers cooperate and agree to serve all of their banner ads from http://noprivacy.com. When Joe User visits search-engine.com and types in "acne cream", the page comes back with an IMG referencing noprivacy.com. Joe's browser will automatically visit noprivacy.com and ask for "the GIF for SE9734". If this is Joe's first time using any of these three cooperating services, noprivacy.com will issue a Set-Cookie header to Joe's browser. Meanwhile, search-engine.com sends a message to noprivacy.com saying "SE9734 was a request for acne cream pages." The "acne cream" string gets stored in noprivacy.com's database along with "browser_id 7586." When Joe visits bigmagazine.com, he is forced to register and give his name, e-mail address, Snail mail address, and credit card number. There are no ads in bigmagazine.com. They have too much integrity for that. So they include in their pages an IMG referencing a blank GIF at noprivacy.com. Joe's browser requests "the blank GIF for BM17377" and, because it is talking to noprivacy.com, the site that issued the Set-Cookie header, includes a cookie header saying "I'm browser_id 7586." When all is said and done, the noprivacy.com folks know Joe User's name, his interests, and the fact that he has downloaded 6 spanking JPEGs from kiddieporn.com.

Finally, I added an extra few lines of code to my click-through stats collector. IF there was a browser_id AND detailed logging was enabled, THEN also write a log entry for the click-through.

After all of this evil work is done, what do we get?

Realm where originally logged: philg 
original IP address: 18.23.10.101 
browser used initially: Mozilla/3.01 (WinNT; I) 
email address: 
CLICK STREAM
1997-01-30 01:44:36 Page View: philg/photo/index.html 
1997-01-30 01:46:11 Page View: philg/photo/where-to-buy.html 
1997-01-30 01:46:17 Clickthrough from text ref: philg/photo/where-to-buy.html to http://www.bhphotovideo.com/ 
1997-01-30 02:30:46 Page View: webho/WealthClock 
1997-01-31 13:13:17 Page View: webho/WealthClock 
1997-02-01 08:04:15 Page View: philg/photo/index.html 
1997-02-01 18:33:17 Page View: philg/photo/index.html 
1997-02-03 12:46:18 Page View: philg/photo/where-to-buy.html 
1997-02-03 14:53:56 Page View: webho/WealthClock

We know that this guy was originally logged at 18.23.10.101 (my home computer) and that he is using Netscape 3.01 on Windows NT. We don't yet know his e-mail address, but only because he hasn't yet visited a guestbook page served by clickthrough.photo.net.

Then there is the click stream. We know that he downloaded the photo.net home page at 1:44 a.m. on January 30, 1997. Two minutes later, he downloaded the "where to buy" page. Six seconds later, he clicked through to B&H Photo. Forty-five minutes later, he showed up on another server (the webho realm) viewing the Wealth Clock. The next day at 1:30 p.m., this guy checks the Wealth Clock again. On February 1, 1997, he visits photo.net at 8:04 a.m. and then again at 6:33 p.m. He's back on the "where to buy" page on February 3. Two hours after that, he's checking the Wealth Clock once more . . .

If I get enough Web sites to cooperate in using one click-through server and even one of those sites requires registration, offers a contest, or does anything else where users type in names and e-mail addresses, it is only a matter of time before I can associate browser_id 6 with "philg@mit.edu; Philip Greenspun; 5 Irving Terrace, Cambridge, MA 02138."

Of course, I don't have to use this information for evil. I can use it to offer users a page of "new documents since your last visit." Suppose someone comes to the photo.net home page for the fourth time. I find that he has looked at my travel page but not read Travels with Samantha. I probably ought to serve him a banner that says "You might like Travels with Samantha; click here to read it."

Does this all sound too futuristic and sinister to be really happening? Have a look at your browser's cookies file. With Netscape Navigator, you'll find this as "cookies.txt" in the directory where you installed it. With Internet Explorer, you can find one file/cookie by doing View -> Options -> Advanced -> View (Temporary) Files. See if there is an entry that looks like this:

ad.doubleclick.net FALSE / FALSE 942191940 IAF 248bf21

Then go to http://www.doubleclick.net/ and see the long list of companies (including AltaVista) that are sharing this ad server so that your activity can be tracked. Of course, Double Click assures everyone that your privacy is assured.

Case Studies Conclusions

Here are the conclusions that we can draw from these case studies:

Your readers have no privacy and haven't had any ever since late 1994 when the Netscape Magic Cookie protocol came out.
Vital information for most Web publishers, such as number of click-throughs, is unobtainable from standard server logs and traditional linking practices.
With a little bit of RDBMS programming or a visit to http://webtools.com/wtr/, you're on your way to collecting the information that you need.

Let's Back Up for a Minute

Suppose that the preceding talk about click-throughs and cookies has overloaded your brain. You don't want to spend the rest of your life programming Tcl and SQL. You don't even want to come to http://webtools.com/wtr and fill out a form. You just want to analyze the server logs that you've already got.

Is that worth doing?

Well, sure. As discussed in the first case above, you certainly want to find out which of your URLs are coughing up errors. If you have hundreds of thousands of hits per day, casual inspection of your logs isn't going to reveal the 404 File Not Found errors that make users miserable. This is especially true if your Web server program logs errors and successful hits into the same file.

You can also use the logs to refine content. My very first log summary revealed that half of my visitors were just looking at the slide show for Travels with Samantha. Did that mean they thought my writing sucked? Well, maybe, but it actually looked like my talents as a hypertext designer were lame. The slide show was the very first link on the page. Users had to scroll way down past a bunch of photos to get to the Chapter 1 link. I reshuffled the links and traffic on the slide show fell to 10 percent.

You can also discover "hidden sites." You might have read Dave Siegel's book and spent $20,000 designing http://yourdomain.com/entry-tunnel.html. But somehow the rest of the world has discovered http://yourdomain.com/old-text-site.html and is linking directly to that. You're getting 300 requests a day for the old page, whose information is badly organized and out of date. That makes it a hidden site. You'd ceased spending any time or money maintaining it because you thought there weren't any users. You probably want to either bring the site up to date or add a redirect to your server to bounce these guys to the new URL.

Finally, once your site gets sufficiently popular, you will probably turn off host name lookup. As mentioned above, Unix named is slow and sometimes causes odd server hangs. Anyway, after you turn lookup off, your log will be filled up with just plain IP addresses. You can use a separate machine to do the nslookups offline and at least figure out whether your users are foreign, domestic, internal, or what.

Enter the Log Analyzer

The first piece of Web "technology" that publishers acquire is the Web server program. The second piece is often a log analyzer program. Venture capitalists demonstrated their keen grasp of technology futures by funding at least a dozen companies to write and sell commercial log analyzer programs. This might have been a great strategy if the information of importance to Web publishers were present in the server log to begin with. Or if there weren't a bunch of more reliable freeware programs available. Or if companies like Netscape hadn't bundled log analyzers into their Web server programs.

Anyway, be thankful that you don't have money invested in any of these venture funds and that you have plenty of log analyzer programs from which to choose. These programs can be categorized along two dimensions:

Source code availability
Stand-alone or substrate-based

Whether or not the source code is available is extremely important in a new field like Web publishing. As with Web server programs, software authors can't anticipate your needs or the evolution of Web standards. If you don't have the source code, you are probably going to be screwed in the long run. Generally the free public domain packages come with source code and the commercial packages don't.

A substrate-based log analyzer makes use of a well-known and proven system to do storage management and sometimes more. Examples of popular substrates for log analyzers are Perl and relational databases. A stand-alone log analyzer is one that tries to do everything by itself. Usually these programs are written in primitive programming languages like C and do storage management in an ad hoc manner. This leads to complex source code that you might not want to tangle with and ultimately core dumps on logs of moderate size.

Here's my experience with a few programs . . .

wwwstat

This is an ancient public-domain Perl script, available for download and editing from http://www.ics.uci.edu/WebSoft/wwwstat/. I found that it doesn't work very well on my sites for the following reasons:

There are at least three URLs to get to many of my pages: http://webtravel.org/~philg/samantha/travels-with-samantha.html was what I used initially. Then I discovered the "index.html" religion so http://webtravel.org/~philg/samantha/ is another gateway to the same page. Then I got a little symlink-happy and made http://webtravel.org/samantha/ work. But I also made http://webtravel.org/philg/samantha/ work too. So there are six URLs for the same file. They are all reported separately by wwwstat and there was no easy way to group them together. I think the latest version is beginning to have grouping capabilities.
wwwstat doesn't count "distinct hosts" like a lot of other tools. In the era of proxy servers, counting distinct hosts is a mighty crude way to gauge number of users, but it is better than nothing.
wwwstat doesn't understand the extra information that modern Web servers log, such as browser and referer. These extra items don't interfere with wwwstat, but you don't get a "90% using Netscape" report either.
wwwstat had no built-in facility for doing host name lookup, though it now does.

I've fed wwwstat 50MB and larger log files without once seeing it fail. There is a companion tool called gwstat that makes pretty graphs from the wwwstat output. It is free but you have to be something of a Unix wizard to make it work.

There are a lot of newer public domain tools than wwwstat listed in http://www.yahoo.com/Computers_and_Internet/Software/Internet/World_Wide_Web/Servers/Log_Analysis_Tools/. A lot of wizards seem to like analog (referenced from Yahoo), but I haven't tried it.

WebReporter

This is a stand-alone commercial product, written in C and sold for $500 by OpenMarket. It took me two solid days of reading the manual and playing around with the tail of a log file to figure out how to tell WebReporter 1.0 what I wanted. For a brief time, I was in love with the program. It would let me lump certain URLs together and print nice reports saying "Travels with Samantha cover page." I fell out of love when I realized that

If you want cumulative reports, you have to freeze your groups when you start accumulating data. That makes WebReporter a great tool for sites where files are never added, subtracted, or moved. Yeah.
It dumped core trying to show ASCII histograms of activity from a puny little log file (just an hour or two's worth of data for a popular site).

When I complained about the core dumps, they said, "Oh yes, we might have a fix for that in the next release. Just wait four months." So I waited and let some friends at a commercial site beta test the new release. How do you like it? I asked. They responded quickly: "It dumps core."

More on OpenMarket

An interesting aside to this experience with the software was an opportunity to see how OpenMarket is making Internet commerce a reality:

Their demo site for the new version was down. So I couldn't try the software on their machines.
When I asked for a new copy of the software to fix the core dumping problem, they said I'd have to pay for it. There was apparently no warranty on the broken version I'd bought a couple of months earlier.
Several e-mail messages asking how much the new copy would cost went unanswered. Finally, I was told to telephone (yes, telephone) a particular saleswoman for a quote.
I managed to get the saleswoman's e-mail address and sent her mail. She never answered.

That's when it occurred to me that I'd managed to serve several billion hits with software from Netscape and NaviSoft (AOLserver) without ever having to telephone either company.

My experience with WebReporter has made me wary of stand-alone commercial products in general. Cumulative log data may actually be important to you. Why do you want to store it in a proprietary format accessible only to a C program for which you do not have source code? What guarantee do you have that the people who made the program will keep it up to date? Or even stay in business?

Relational Database-backed Tools

What are the characteristics of our problem anyway? Here are some obvious ones:

We need to maintain a data set over many years.
We can't know now what kinds of queries we're going to do into this data set.
We may end up with many gigabytes of data.
We don't trust one vendor to serve all of our needs. We can't afford to lose access to our data if an application code vendor folds.

Do these sound like the problems that IBM thought they were solving in the early 1970s with the relational model? Call me an Oracle whore but it seems apparent that the correct tool is a relational database management system.

So brilliant and original was my thinking on this subject that the net.Genesis guys (http://www.netgenesis.com/) apparently had the idea a long time ago. They make a product called net.Analysis that purports to stuff server logs into a (bundled) Informix RDBMS in real time.

Probably this is the best of all possible worlds. You do not surrender control of your data. With a database-backed system, the data model is exposed. If you want to do a custom query or the monolithic program dumps core, you don't have to wait four months for the new version. Just go into SQL*PLUS and type your query. Any of these little Web companies might fold and/or decide that they've got better things to do than write log analyzers, but Oracle, Informix, and Sybase will be around. Furthermore, SQL is standard enough that you can always dump your data out of Brand O into Brand I or vice versa.

More importantly, if you decide to get a little more ambitious and start logging click-throughs and/or sessions, you can use the same RDBMS installation and do SQL JOINs on the vanilla server log tables and the tables from your more sophisticated logs.

Caveats? Maintaining a relational database is not such a thrill, though using it for batch inserts isn't too stressful. If you don't want the responsibility of keeping the RDBMS up 7x24 then you can have your Web server log to a file as usual and insert the data into your RDBMS in batches.

If you do decide to go for real-time logging, be a little bit thoughtful about how many inserts per second your RDBMS can handle. I maintain some RDBMS benchmarks at http://webtools.com/wtr but the bottom line is that you should start to worry if you need to log more than ten hits per second on a standard Unix box with one disk.

Summary

Here's what you should have learned from reading this chapter:

You can collect a metric buttload of data about user activity on your site without too much effort.
You have to think and work if you want to collect data that is truly more interesting than what a 10-year-old could get from his home Web server and a freeware stats package.
You can use free software and services from http://webtools.com/wtr to do a lot of the fancy tricks that well-funded Web publishers use.

What do I do with my server logs? My server is programmed to delete them every few hours. The logs are about 50MB a day on a machine with only 4GB of disk space. I got tired of struggling with WebReporter. I realized that I didn't have any advertisers who cared. Collecting gigabytes of useless information is probably good preparation for a career in Fortune 500 middle management but I don't really want a job where I couldn't take my dog to work.

It would be nice to get a rough idea of who is reading what on my site, but not if maintaining a complicated program is going to keep me from writing more content.

Note: If you like this book you can move on to the other chapters.

philg@mit.edu