網路爬蟲框架Scrapy教程以及案例

網路爬蟲框架Scrapy

Python開發的一個快速，高層次的屏幕抓取和web抓取框架，用於抓取web站點並從頁面中提取結構化的數據。Scrapy用途廣泛，可以用於數據挖掘、監測和自動化測試。Scrapy吸引人的地方在與它是一個框架，任何人都可以根據需求方便的修改。它也提供了多種類型爬蟲的基類，如BaseSpider、sitemap爬蟲等。Scrach，是住區的意思，這個Python的爬蟲框架叫Scrapy，大概也是這個意思吧，我們可以叫它：小刮刮~

一、安裝：

Scrapy框架官網網址：

http://doc.scrapy.org/en/latest/

具體Scrapy安裝流程參考

http://doc.scrapy.org/en/latest/intro/install.html#intro-install-platform-notes

裡面有各個平臺安裝的方法

安裝成功之後，只需要在你的命令終端輸入：

localhost:mySplit ldb$ scrapy Scrapy 1.0.5 - project: mySplit Usage: scrapy <command> [options] [args]

Available commands:
bench Run quick benchmark test
check Check spider contracts
commands
crawl Run a spider
edit Edit spider
fetch Fetch a URL using the Scrapy downloader
genspider Generate new spider using pre-defined templates
list List available spiders
parse Parse URL (using its spider) and print the results
runspider Run a self-contained spider (without creating a project)
settings Get settings values
shell Interactive scraping console
startproject Create new project
version Print Scrapy version
view Open URL in browser, as seen by Scrapy

Use "scrapy <command> -h" to see more info about a comma

提示類似下面這幾種結果，代表已經安裝成功：

Scrapy爬蟲網站一共需要4步：

新建目標（Project）：新建一個新的爬蟲項目
明確目標（Items）：明確你想要抓取的目標
製作爬蟲（Spider）：製作爬蟲開始爬取網頁
存儲內容（Pipeline）：設計管道存儲爬起內容

二、新建項目

scrapy startproject mySpider

其中，mySpider為項目名稱

可以看到將會創建一個mySpider文件夾，目錄結構如下：

. └── mySpider ├── mySpider │ ├── __init__.py │ ├── items.py │ ├── pipelines.py │ ├── settings.py │ └── spiders │ └── __init__.py └── scrapy.cfg

下面來簡單介紹一下各個文件的作用：

scrapy.cfg ：項目的配置文件
tutorial/ ：項目的Python模塊，將會從這裡引用代碼
tutorial/items.py ：項目的items文件
tutorial/pipelines.py ：項目的pipelines文件
tutorial/settings.py ：項目的設置文件
tutorial/spiders/ ：存儲爬蟲的目錄

三、明確目標

在Scrapy中，items是用來載入抓取內容的容器，有點像Python中的Dic，也就是字典，但是提供了一些額外的保護減少錯誤。

一般來說item可以用scrapy.item.Iiem類來創建，並且用scrapy.item.Field對象來定義屬性（可以理解成類似於ORM的映射關係）

接下來，我們開始來構建item模型（model）

修改mySpider目錄下的items.py文件，在原本的class後面添加我們自己的class。

因為要抓http://www.itcast.cn/channel/teacher.shtml#ac

網站的所有講師信息，我們可以將其命名為ItcastItem：

class itcastItem(scrapy.Item): name = scrapy.Field() level = scrapy.Field() info = scrapy.Field()

剛開始看起來可能會有些看不懂，但是定義這些item能讓你用其他組件的時候知道你的items到底是什麼。

可以吧Item簡單的理解成封裝好的類對象。

四、製作爬蟲

爬蟲一共分為兩部：先爬，再取。

1.爬

得到網頁中所有的數據。

Spider是用戶自己編寫的類，用來從一個域（或域組）中抓取信息。

他們定義了用於下載的URL列表、跟蹤鏈接的方案、解析網頁內容的方式，以此來提取items。

要建立一個Spider，你必須用scrapy.spider.BaseSpider創建一個子類，並確定三個強制的屬性：

name：爬蟲的識別名稱，必須是唯一的，在不同的爬蟲中你必須定義不同的名字。
start_urls：爬取URL列表。爬蟲從這裡開始抓取數據，所以第一次下載的數據將會從這些urls開始。其他子URL將會從這些其實URL中繼承性生成。
parse()：解析的方法，調用的時候傳入從每一個URL傳回Response對象作為唯一參數，負責解析並匹配抓取的數據（解析為item），跟蹤更多的URL。

下面我們來定義一直爬蟲，命名為itcast_spider.py，保存在mySpiderspiders目錄下。

itcast_spider.py代碼如下：

import scrapy class ItcastSpider(scrapy.spiders.Spider): name = "itcast" allowed_domains=["http://www.itcast.cn?"] start_urls=[ "http://www.itcast.cn/channel/teacher.shtml#ac" ]

def parse(self, response):
filename="teacher.html"
open(filename, wb).write(response.body)

allow_domains是搜索的域名範圍，也就是爬蟲的約束區域，規定爬蟲只爬取這個域名下的網頁。

從parse函數可以看出，將爬取的網頁內存存儲到一個叫"teacher.html「的文件中。

然後運行一下看看，在mySpider目錄下執行：

scrapy crawl itcast

其中itcast為ItcastSpider類的name屬性

注意：如果發現編碼問題，請在itcast_spider.py加上

import sys reload(sys) sys.setdefaultencoding( "utf-8")

或者：

import sys reload(sys) sys.setdefaultencoding(gb2312)

運行之後，如果列印的日誌出現finish字樣，代表成功。

之後當前文件夾中就出現了一個teacher.html文件，裡面就是我們剛剛要爬取的網頁的全部源代碼信息。

2.取

爬取整個網頁完畢，接下來的就是取的過程了。

光存儲一整個網頁還是不夠用的。

在基礎的爬蟲裏，這一步可以用正則表達式來抓。

在Scrapy裏，使用一種叫做Xpath selectors的機制，它基於XPath表達式。

如果你想要了解更多的selectors和其他機制可以查閱資料（或者私信找我獲取）

以下是一些XPath表達式的例子和他們的含義：

/html/head/title : 選擇HTML文檔 <head> 元素下面的 <title> 標籤。

/html/head/title/text() : 選擇前面提到的 <title> 元素下面的文本內容

//td : 選擇所有 <td> 元素

//div[@class="mine"] : 選擇所有包含 class="mine" 屬性的div 標籤元素

以上只是幾個使用XPath的簡單例子，但是實際上XPath非常強大

Selectors有四中基礎的方法：

xpath（）：返回一系列的selectors，每一個select表示一個xpath參數表達式選擇的節點
css（）：返回一系列的selectors，每一個select表示一個css參數表達式選擇的節點
extract（）：返回一個Unicode字元串，為選中的數據
re（）：返回一串一個Unicode字元串，為使用正則表達式抓取出來的內容

我們來再來看一下，我們要爬取的頁面。

http://www.itcast.cn/channel/teacher.shtml#ac

的源代碼

發現所有講師的信息都包括在一個

<div class="li_txt"> 的標籤中。

所以我們第一步就是要篩選這些div標籤；

response.xpath(//div[@class="li_txt"])

就是利用xpath規則語法，篩選這類的div。

他的返回值應該是一個每個被篩選到div的字元串列表。

然後在將每個div過濾出老師的姓名teachername，職位teacherlevel、簡介teacher_info

再次通過xPath規則匹配。分別找出<h3>，<h4>，<p>中的內容

teacher_name = site.xpath(h3/text()).extract() teacher_level = site.xpath(h4/text()).extract() teacher_info = site.xpath(p/text()).extract()

如果有中文編碼格式混亂，可以統一改成utf-8

def parse(self, response): for site in response.xpath(//div[@class="li_txt"]):

teacher_name = site.xpath(h3/text()).extract()
teacher_level = site.xpath(h4/text()).extract()
teacher_info = site.xpath(p/text()).extract()

# print teacher_name
# print teacher_level
# print teacher_info

unicode_teacher_name = teacher_name[0].decode(utf-8)
unicode_teacher_level = teacher_level[0].decode(utf-8)
unicode_teacher_info = teacher_info[0].decode(utf-8)

print unicode_teacher_name
print unicode_teacher_level
print unicode_teacher_in

保存好修改的itcast_spider.py文件之後，再次執行：

scrapy crawl itcast

會發現已經將所有的數據列印到我們的終端上。

五、存儲內容

作為一隻爬蟲，Spider希望能將其抓取的數據存放到Item對象中。為了返回我們抓取數據，spider的最終代碼應該是下面這樣：

我們之前在mySpider/items.py裏定義了一個ItcastItem類。

這裡引入進來

from mySpider.items import ItcastItem

然後我們得到的數據封裝到一個ItcastItem對象中。

最終返回一個ItcastItem對象列表給架構

def parse(self, response):

items = []
for site in response.xpath(//div[@class="li_txt"]):
item = itcastItem()
print "--------------"
teacher_name = site.xpath(h3/text()).extract()
teacher_level = site.xpath(h4/text()).extract()
teacher_info = site.xpath(p/text()).extract()

unicode_teacher_name = teacher_name[0].decode(utf-8)
unicode_teacher_level = teacher_level[0].decode(utf-8)
unicode_teacher_info = teacher_info[0].decode(utf-8)
print unicode_teacher_name
print unicode_teacher_level
print unicode_teacher_info

item[name] = unicode_teacher_name
item[level] = unicode_teacher_level
item[info] = unicode_teacher_info
items.append(item)
print "--------------"

最後我們就要將爬取的數據保存到本地中。

保存信息的最簡單的方法是通過Feed exports，主要有四中：JSON,JSON lines，CSV,XNL。

我們將結果用最常用的JSON帶出，命令如下：

scrapy crawl itcast -o itcast_teachers.json -t json

最後在當前目錄下就會生成一個itcast_teachers.json文件。這是一個標準的數據文件，可以用任意一個可以解析json的工具或者介面來讀取裡面的內容。

比如 http://www.bejson.com/jsonview2/

一款在線解析json文件的工具，將生成itcast_teachers.json數據拷貝到上面就可以查看我們最終爬取到的數據了。