Tuesday, January 5, 2010

使用awk处理网站访问日志[上]

对于一个程序来说,Loging是一件非常有利的武器,其可以帮助程序员快速的找到BUG,分析性能瓶颈等等...甚至还可以在技术社区彰显一番代码的华丽,实乃死coder居家旅行必备之宝.而网站访问日志除了对开发者有修改缺陷,提升功力之良效以外,对待运营人员,也是分析用户行为的第一手宝贵资料.

本博打算通过若干系列文章来讲讲如何有效运用访问日志来改善程序和用户行为分析:)也就是说:站在程序员的角度来审视程序的运行状态;通过数字手段来分析用户访问状况

既然一切都从访问日志开始,那么首先就来parser weblog吧.让数据变得更加可读些,嘿嘿,自然也更方便处理些.
这里的日志格式如下:
124.205.30.210 - - [14/Dec/2009:16:02:46 +0800] "GET /HLRelationLog/ent.hunantv.com/y/l/20090306/225172.html H
TTP/1.1" 404 252
124.205.30.210 - - [14/Dec/2009:16:04:51 +0800] "GET /HLRelationLog/ent.hunantv.com/y/l/20090901/410652.html H
TTP/1.1" 404 252
211.152.32.122 - - [14/Dec/2009:16:15:10 +0800] "GET /manager/html HTTP/1.1" 404 210
124.205.30.210 - - [14/Dec/2009:17:02:18 +0800] "GET / HTTP/1.1" 200 44
124.205.30.210 - - [14/Dec/2009:17:02:19 +0800] "GET /favicon.ico HTTP/1.1" 404 209
124.205.30.210 - - [14/Dec/2009:17:24:59 +0800] "GET /HLRelationLog/ent.hunantv.com/x/20090616/331092.html"

格式很简单:来路IP,本地时间,以及访问的路径.实际情况下往往还有一个User-Agent.但是在本篇我们无视这个参数.先将这些数据"取"出来存入数据库再说.

awk -F ' ' '/HLRelationLog/ {printf("insert into logs (url,ip,time) values (\"%s\",\"%s\",\"%s\");",$7,$1,substr($4,2))}' access.log > target.sql

然后将数据导入
mysql -u root -p -h 8.8.8.8 < target.sql

这里简单再来看看大致的访问情况:
awk -F ' ' '/HLRelationLog/ {print $7}' access.log |sort|uniq -c|sort -nr > sort.txt

结果摘抄如下:
 13102 /HLRelationLog/ent.hunantv.com/e/h/20080917/49514.html
   7112 /HLRelationLog/ent.hunantv.com/e/h/20080719/23308.html
   6539 /HLRelationLog/ent.hunantv.com/m/20091231/536520.html
   6451 /HLRelationLog/ent.hunantv.com/m/20090427/281233.html
   5739 /HLRelationLog/ent.hunantv.com/m/20091222/527487.html
   5248 /HLRelationLog/ent.hunantv.com/x/20091224/529517.html
   4863 /HLRelationLog/ent.hunantv.com/y/20091219/524512.html
   4597 /HLRelationLog/ent.hunantv.com/m/20091221/526802.html
   4168 /HLRelationLog/ent.hunantv.com/m/20091221/526782.html
   4162 /HLRelationLog/ent.hunantv.com/y/20081028/87956.html
   3951 /HLRelationLog/ent.hunantv.com/m/20091222/527342.html
   3729 /HLRelationLog/ent.hunantv.com/e/20080716/22250.html
   3631 /HLRelationLog/ent.hunantv.com/e/h/20080728/26581.html
   3561 /HLRelationLog/ent.hunantv.com/m/20091222/527785.html
   3372 /HLRelationLog/ent.hunantv.com/e/h/20080903/42076.html
   3339 /HLRelationLog/ent.hunantv.com/y/20091216/522056.html
   3163 /HLRelationLog/ent.hunantv.com/y/l/20091221/525879.html
   2998 /HLRelationLog/ent.hunantv.com/y/20091216/521082.html
   2914 /HLRelationLog/ent.hunantv.com/x/20091221/526583.html
   2889 /HLRelationLog/ent.hunantv.com/y/20081231/154998.html
   2851 /HLRelationLog/ent.hunantv.com/z/20091225/530766.html

下面来处理下频道的访问分布,每个link的倒数第三段[x,y,e...]:首先通过awk得到每个link的频道的识别符和访问数量,丢进一个文件

map(lambda x:[x,sum(map(lambda y:int(y[1]),filter(lambda z:z[0]==x,t)))],set(map(lambda x:x[0],t)))
>>>
[['x\n', 48173], ['y\n', 266146], ['d\n', 4035], ['z\n', 40631], ['e\n', 98009], ['ent\n', 558], ['m\n', 174542], ['t\n', 59660]]

来看看并发的情况:
tail -10000 access.log | awk '{print $4;}' | sort | uniq -c | sort -nr | head

=====我是分割线====

33 [31/Dec/2009:22:03:18
     16 [25/Dec/2009:18:16:07
     14 [25/Dec/2009:18:16:06
      9 [31/Dec/2009:22:03:19
      8 [25/Dec/2009:18:13:35
      7 [31/Dec/2009:22:06:38
      7 [31/Dec/2009:19:22:40
      7 [25/Dec/2009:19:55:54
      7 [25/Dec/2009:14:39:06
      7 [02/Jan/2010:18:37:54