A failed experiment

It seems in the past month or so, newspapers have been making some noise that they are going to start charging for at least some of the content. One of the reports I heard about for boston.com was that people were considering charging for the content found in the Boston Globe, and leaving the rest as available to all readers.

That got me thinking about how much of what gets read from boston.com is Globe content, and also about how that may have changed over the years. I started thinking how traffic gets to the article pages anyway, and to a great extent the biggest traffic driver to an article page seems to be the homepage. This could be something I could measure, how much of the boston.com homepage is allocated to links to Boston Globe articles, and how much of it is from other sources. And has it changed over time. (My assumption was that Globe content would increase around 2006 when the boston.com and Boston Globe editorial departments became more closely integrated, and Globe reports started writing more direct to online content. Before that, the sense that I got was that Globe content was frequently highlighted for its unique view, but when news of the day changed from what was published the night before, the site started putting more up to the minute wire content from AP or Reuters.)

So I grabbed every copy of the boston.com homepage that existed on the Internet Archive Wayback Machine. I wrote a small script that would read each file and judge each link to be either a Globe link, a non-globe link, or one that didn’t count (I omitted things links to other section fronts like the news, sports, etc. pages.) I called a link a Globe link if it either was within a section of Globe content (/dailyglobe2/world/…) or the tease associated with the link had an attribution of “(Boston Globe)” or “(Today’s Globe)” I ran my script over all the homepages, put the results into a spreasheet to graph them and found….

Practically no difference in the ratio of Globe content vs. other content from about 2001 through 2008 (the last dates that the wayback machine has data.) Now I’m wondering what to do next. There may be something that I’m missing (I’m counting any link with a greater relevancy over another, even though links “above the fold” tend to get clicked on more than links near the bottom.) It could be the script I wrote to parse the homepage into globe/noglobe links isn’t counting things accurately. Maybe the data is right, but I should start playing around with it in r just to learn it.

Oh well, I was looking forward to publish the results that supported my hypothesis. Saying that I can’t support it isn’t nearly as much fun, but I figure its as much of a story to tell as the other.

Leave a Reply