重剑无锋大巧不工: 使用awk处理网站访问日志[上]

对于一个程序来说,Loging是一件非常有利的武器,其可以帮助程序员快速的找到BUG,分析性能瓶颈等等...甚至还可以在技术社区彰显一番代码的华丽,实乃死coder居家旅行必备之宝.而网站访问日志除了对开发者有修改缺陷,提升功力之良效以外,对待运营人员,也是分析用户行为的第一手宝贵资料.

本博打算通过若干系列文章来讲讲如何有效运用访问日志来改善程序和用户行为分析:)也就是说:站在程序员的角度来审视程序的运行状态;通过数字手段来分析用户访问状况

既然一切都从访问日志开始,那么首先就来parser weblog吧.让数据变得更加可读些,嘿嘿,自然也更方便处理些.
这里的日志格式如下:

205.30.210 - - [14/Dec/2009:16:02:46 +0800] "GET /HLRelationLog/ent.hunantv.com/y/l/20090306/225172.html H

TTP/1.1" 404 252

205.30.210 - - [14/Dec/2009:16:04:51 +0800] "GET /HLRelationLog/ent.hunantv.com/y/l/20090901/410652.html H

TTP/1.1" 404 252

152.32.122 - - [14/Dec/2009:16:15:10 +0800] "GET /manager/html HTTP/1.1" 404 210

205.30.210 - - [14/Dec/2009:17:02:18 +0800] "GET / HTTP/1.1" 200 44

205.30.210 - - [14/Dec/2009:17:02:19 +0800] "GET /favicon.ico HTTP/1.1" 404 209

205.30.210 - - [14/Dec/2009:17:24:59 +0800] "GET /HLRelationLog/ent.hunantv.com/x/20090616/331092.html"

格式很简单:来路IP,本地时间,以及访问的路径.实际情况下往往还有一个User-Agent.但是在本篇我们无视这个参数.先将这些数据"取"出来存入数据库再说.

awk -F ' ' '/HLRelationLog/ {printf("insert into logs (url,ip,time) values (\"%s\",\"%s\",\"%s\");",$7,$1,substr($4,2))}' access.log > target.sql

然后将数据导入

mysql -u root -p -h 8.8.8.8 < target.sql

这里简单再来看看大致的访问情况:

awk -F ' ' '/HLRelationLog/ {print $7}' access.log |sort|uniq -c|sort -nr > sort.txt

结果摘抄如下:

 13102 /HLRelationLog/ent.hunantv.com/e/h/20080917/49514.html

   7112 /HLRelationLog/ent.hunantv.com/e/h/20080719/23308.html

   6539 /HLRelationLog/ent.hunantv.com/m/20091231/536520.html

   6451 /HLRelationLog/ent.hunantv.com/m/20090427/281233.html

   5739 /HLRelationLog/ent.hunantv.com/m/20091222/527487.html

   5248 /HLRelationLog/ent.hunantv.com/x/20091224/529517.html

   4863 /HLRelationLog/ent.hunantv.com/y/20091219/524512.html

   4597 /HLRelationLog/ent.hunantv.com/m/20091221/526802.html

   4168 /HLRelationLog/ent.hunantv.com/m/20091221/526782.html

   4162 /HLRelationLog/ent.hunantv.com/y/20081028/87956.html

   3951 /HLRelationLog/ent.hunantv.com/m/20091222/527342.html

   3729 /HLRelationLog/ent.hunantv.com/e/20080716/22250.html

   3631 /HLRelationLog/ent.hunantv.com/e/h/20080728/26581.html

   3561 /HLRelationLog/ent.hunantv.com/m/20091222/527785.html

   3372 /HLRelationLog/ent.hunantv.com/e/h/20080903/42076.html

   3339 /HLRelationLog/ent.hunantv.com/y/20091216/522056.html

   3163 /HLRelationLog/ent.hunantv.com/y/l/20091221/525879.html

   2998 /HLRelationLog/ent.hunantv.com/y/20091216/521082.html

   2914 /HLRelationLog/ent.hunantv.com/x/20091221/526583.html

   2889 /HLRelationLog/ent.hunantv.com/y/20081231/154998.html

   2851 /HLRelationLog/ent.hunantv.com/z/20091225/530766.html

下面来处理下频道的访问分布,每个link的倒数第三段[x,y,e...]:首先通过awk得到每个link的频道的识别符和访问数量,丢进一个文件

map(lambda x:[x,sum(map(lambda y:int(y[1]),filter(lambda z:z[0]==x,t)))],set(map(lambda x:x[0],t))) 

>>>

[['x\n', 48173], ['y\n', 266146], ['d\n', 4035], ['z\n', 40631], ['e\n', 98009], ['ent\n', 558], ['m\n', 174542], ['t\n', 59660]]

来看看并发的情况:

tail -10000 access.log | awk '{print $4;}' | sort | uniq -c | sort -nr | head

=====我是分割线====

33 [31/Dec/2009:22:03:18

     16 [25/Dec/2009:18:16:07

     14 [25/Dec/2009:18:16:06

      9 [31/Dec/2009:22:03:19

      8 [25/Dec/2009:18:13:35

      7 [31/Dec/2009:22:06:38

      7 [31/Dec/2009:19:22:40

      7 [25/Dec/2009:19:55:54

      7 [25/Dec/2009:14:39:06

      7 [02/Jan/2010:18:37:54

重剑无锋大巧不工

Tuesday, January 5, 2010

使用awk处理网站访问日志[上]

No comments:

Post a Comment