EOY 2007 Data Analysis

It's a new year and time for some dumb data analysis.

Most interesting thing to me this year is that most of the traffic
to this site and my other sites (notably epcostello.net) is from
automated agents: search engines, random webcrawlers, SEO's link injectors.

HTTP Status Codes:

155127200
64662304
11400301
6352302
5705404
5308202
2784401
1609405
126400
63414
31500
30403
3501

Top Ten Hosts

455166.249.73.200crawl-66-249-73-200.googlebot.com
223481.52.143.16natcrawlbloc03.net.m1.fti.net
189081.52.143.15natcrawlbloc01.net.m1.fti.net.
149264.152.34.36jfk-lv3-n4.panthercdn.com
138038.99.203.110Panscient_Data_Services.demarc.cogentco.com
991128.194.135.94web-crawler.irl.cs.tamu.edu
885216.240.154.103
88466.249.73.148crawl-66-249-73-148.googlebot.com
80064.92.162.210
77372.30.177.225wm509310.inktomisearch.com.

Raw Top Ten Requests

21547/robots.txt
8030/favicon.ico
6801/articles/2005/12/27/a_practically_u/
6062/articles/nav-commenters.gif
6055/g/Google_logo_transparent.png
5582/d/4/js/ajax/
5290/202/2006/06/disabling_trackbacks_in_movabl/
4445/
4217/g/feed-icon-16×16.png
3993/g/by-sa-3.0-88×31.png

Filtered Top Ten Requests

21547/robots.txt
6801/articles/2005/12/27/a_practically_u/
5290/202/2006/06/disabling_trackbacks_in_movabl/
4445/
2812/202/
2664/202/2006/12/google_reader_annoyances/
2121/202/2006/10/social-bookmarking-and-attention/
1278/202/2006/11/bloglines_new_features_playlis/
912/articles/
890/202/2006/07/yet_another_spam_retaliation_t/

Top 20 non-caching requestors of Robots.txt:

2353crawl-66-249-73-200.googlebot.com[66.249.73.200]
1404natcrawlbloc03.net.m1.fti.net[81.52.143.16]
1192natcrawlbloc01.net.m1.fti.net[81.52.143.15]
770wm509310.inktomisearch.com[72.30.177.225]
736ct501085.crawl.yahoo.net[74.6.86.230]
688wm509458.inktomisearch.com[74.6.74.202]
498crawl-66-249-73-148.googlebot.com[66.249.73.148]
496livebot-65-55-213-74.search.live.com[65.55.213.74]
491wm508816.inktomisearch.com[74.6.69.173]
342lj512274.crawl.yahoo.net[74.6.19.77]
342wm511001.inktomisearch.com[72.30.252.135]
327natcrawlbloc02.net.s1.fti.net [193.252.149.15]
288lm502044.crawl.yahoo.net[72.30.226.173]
262ct501101.crawl.yahoo.net[74.6.86.207]
23367.110.56.45.ptr.us.xo.net[67.110.56.45]
224crawl-66-249-73-132.googlebot.com[66.249.73.132]
208wm511565.inktomisearch.com[72.30.226.209]
206c02.entireweb.com[89.150.197.130]
199wm509426.inktomisearch.com[74.6.75.46]
197ip67-95-51-86.z51-95-67.customer.algx.net[67.95.51.86]

Last time robots.txt changed: 23 March 2007

Top Ten Referrers (filtered):

97488"-"
533"http://neworder.box.sk/forum.php?page=last&did=multSecurity%20and%20Networking&thread=251392"
244"http://www.zenatode.org.uk/ian/internet/hotmail.xhtml"
212"http://my.yahoo.com/"
142"http://www.google.com/search?hl=en&q=phx.gbl"
122"http://www.stumbleupon.com/refer.php?url=http%3A%2F%2Fartific.com%2Farticles%2F2005%2F12%2F27%2Fa_practically_u%2F"
117"http://www.google.com/search?q=phx.gbl&ie=utf-8&oe=utf-8&aq=t&rls=org.mozilla:en-US:official&client=firefox-a"
88"http://www.google.com/search?hl=en&q=phx.gbl&btnG=Google+Search"
68"http://neworder.box.sk/forum.php?did=multSecurity%20and%20Networking&thread=251392"
55"http://www.dslreports.com/shownews/DNS-Hacks-Phishing-20-90182"

Internal referrers and obviously junk referrers have been filtered out.

Top Ten Raw Search Requests:

1474q=phx.gbl
260q=phx.gbl"
122q=.phx.gbl
107q=phx.gbl%3A1863
98q=phx%2egbl"
76q=crawler.bloglines.com
62q=phx.gbl+domain
59q=gbl+domain
53q=gbl+tld
42q=%22phx.gbl%22

Top 20 phx.gbl searches:

phx.gbl is a pseudo-domain used by Microsoft for a variety of services. I wrote about it in On The Importances of Revers DNS which I now realize is still using the previous design system for this site.

1474q=phx.gbl
260q=phx.gbl"
122q=.phx.gbl
107q=phx.gbl%3A1863
62q=phx.gbl+domain
42q=%22phx.gbl%22
30q=.phx.gbl%3A1863
29q=@phx.gbl
26q=%40phx.gbl
23q=phx.gbl+netstat
22q=phx.gbl+1863
22q=.phx.gbl"
19q=phx.gbl%3A1863"
18q=by2msg2204708.phx.gbl
17q=what+is+phx.gbl
15q=netstat+phx.gbl
15q=by1msg4176104.phx.gbl
14q=phx.gbl+msn
13q=by2msg2204912.phx.gbl
13q=%22phx.gbl%22"

Top 20 Non-phx.gbl searches:

76q=crawler.bloglines.com
38q=artific
16q=infobackground
15q=ed+costello
15q=207.46.108.36
15q=202+Accepted
13q=crawler.bloglines.com"
13q=207.46.111.86
13q=202+accepted
11q=spam+retaliation
11q=InfoBackground
10q=google+reader+rename+folder
10q=Reverse+DNS
9q=kb05474
9q=importance+of+reverse+dns
9q=crawler.bloglines.com+
8q=nokia+espionage
8q=iab+ad+units
8q=hotmail+reverse+dns
7q=tvpath.com

In March 2007 I wrote my own trackback endpoint in PHP which logs all of the trackback data to a file instead of beating up my MovableType installation and MySQL database.

Trackbacks Received since 21 March 2007: 10258

Number of Valid Trackbacks: 0

Top Ten Trackback Sources:

447207-234-131-237.ptr.primarydns.com[207.234.131.237]
156movinglabs.com [195.242.99.80]
150u15250532.onlinehome-server.com [74.208.14.63]
144218.189.232.72.static.reverse.ltdomains.com[72.232.189.218]
124[206.123.73.15] [206.123.73.15]
122server.camelotwealthcreation.com[69.50.210.8]
113giantlogic.net [208.101.35.52]
9989-149-195-161.internetserviceteam.com [89.149.195.161]
96u15251680.onlinehome-server.com [74.208.14.215]
95210.219.232.72.static.reverse.ltdomains.com[72.232.219.210]

Top Fifteen Trackback Titles:

169"Tramadol."
151"Phentermine."
119"Xanax."
94"Cialis."
62"Lexapro."
56"Ephedra."
52"Valium."
52"Ultram."
47"Zoloft."
43"Ambien."
42"Fioricet."
37"Percocet."
37"Cheapphentermine."
37"Adderall."
34"Soma."

Posted in Webmastery

Archives

202: Accepted Archives

Feeds

We use Feedburner to distribute our web feeds:

Google

Copyright 2002–2008 Artific Consulting LLC.

Unless otherwise noted, content is licensed for reuse under the Creative Commons Attribution-ShareAlike 3.0 License. Please read and understand the license before repurposing content from this site.