scrapy中最為重要的兩個對象Request、Response

Request

request是請求對象，參數列表為Request[url[, callback, method=GET, headers, body, cookies, meta, encoding=utf-8, priority=0, dont_filter=False, errback])]

url:必填，請求的目標網址

callback：回調函數，默認一個參數response，其他請求相關信息可通過meta傳遞

method：請求方式，默認為Get

headers: 請求頭信息，字典格式

body：請求正文，bytes或str類型

cookies：請求cookie信息，可以是cookie列表或cookie字典

meta：包含此請求的任意元數據的字典。此dict對於新請求為空，通常由不同的Scrapy組件（擴展程序，中間件等）填充，內置關鍵詞為如下

dont_redirect：如果 Request.meta 包含 dont_redirect 鍵，則該request將會被RedirectMiddleware忽略 dont_retry：如果 Request.meta 包含 dont_retry 鍵，該request將會被RetryMiddleware忽略 handle_httpstatus_list：Request.meta 中的 handle_httpstatus_list 鍵可以用來指定每個request所允許的response code handle_httpstatus_all：handle_httpstatus_all為True ，可以允許請求的任何響應代碼 dont_merge_cookies：Request.meta 中的dont_merge_cookies設為TRUE，可以避免與現有cookie合併 cookiejar：Scrapy通過使用 Request.meta中的cookiejar 來支持單spider追蹤多cookie session。默認情況下其使用一個cookie jar(session)，不過可以傳遞一個標示符來使用多個 dont_cache：可以避免使用dont_cache元鍵等於True緩存每個策略的響應 redirect_urls：通過該中間件的(被重定向的)request的url可以通過 Request.meta 的 redirect_urls 鍵找到 bindaddress：用於執行請求的傳出IP地址的IP dont_obey_robotstxt：如果Request.meta將dont_obey_robotstxt鍵設置為True，則即使啟用ROBOTSTXT_OBEY，RobotsTxtMiddleware也會忽略該請求 download_timeout：下載器在超時之前等待的時間（以秒為單位） download_maxsize：爬取URL的最大長度 download_latency：自請求已經開始，即通過網路發送的HTTP消息，用於獲取響應的時間量該元密鑰僅在下載響應時纔可用。雖然大多數其他元鍵用於控制Scrapy行為，但是這個應用程序應該是隻讀的 download_fail_on_dataloss：是否在故障響應失敗 proxy：可以將代理每個請求設置為像http：// some_proxy_server：port這樣的值 ftp_user ：用於FTP連接的用戶名 ftp_password ：用於FTP連接的密碼 referrer_policy：為每個請求設置referrer_policy max_retry_times：用於每個請求的重試次數。初始化時，max_retry_times元鍵比RETRY_TIMES設置更高優先順序

其中常用的proxy用於設置代理，其餘關鍵字在使用時查詢文檔即可。

encoding： 此請求的編碼（默認為utf-8）

priority：此請求的優先順序（默認為0），較高優先順序值的請求將較早執行

dont_filter：默認為False，表示參與過濾，反之則不對該URL過濾

errback： 指定處理請求中的異常回調函數，包括失敗的404 HTTP錯誤等頁面。

Request子類FormRequest

FormRequest用於表單操作，原文文檔如下：

FormRequest objects
The FormRequest class extends the base Request with functionality for dealing with HTML forms. It uses lxml.html forms to pre-populate form fields with form data from Response objects.

classscrapy.http.FormRequest(url[, formdata, ...])
The FormRequest class adds a new argument to the constructor. The remaining arguments are the same as for the Request class and are not documented here.Parameters:formdata (dict or iterable of tuples) – is a dictionary (or iterable of (key, value) tuples) containing HTML Form data which will be url-encoded and assigned to the body of the request.The FormRequest objects support the following class method in addition to the standard Requestmethods:

response (Response object) – the response containing a HTML form which will be used to pre-populate the form fields
formname (string) – if given, the form with name attribute set to this value will be used.
formid (string) – if given, the form with id attribute set to this value will be used.
formxpath (string) – if given, the first form that matches the xpath will be used.
formcss (string) – if given, the first form that matches the css selector will be used.
formnumber (integer

) – the number of form to use, when the response contains multiple forms. The first one (and also the default) is 0.
formdata (dict) – fields to override in the form data. If a field was already present in the response <form> element, its value is overridden by the one passed in this parameter.
clickdata (dict) – attributes to lookup the control clicked. If it』s not given, the form data will be submitted simulating a click on the first clickable element. In addition to html attributes, the control can be identified by its zero-based index relative to other submittable inputs inside the form, via the nrattribute.
dont_click (boolean) – If True, the form data will be submitted without clicking in any element.
classmethodfrom_response(response[, formname=None, formid=None, formnumber=0, formdata=None, formxpath=None, formcss=None, clickdata=None, dont_click=False, ...])
Returns a new FormRequest object with its form field values pre-populated with those found in the HTML <form> element contained in the given response. For an example see Using FormRequest.from_response() to simulate a user login.The policy is to automatically simulate a click, by default, on any form control that looks clickable, like a <input type="submit">. Even though this is quite convenient, and often the desired behaviour, sometimes it can cause problems which could be hard to debug. For example, when working with forms that are filled and/or submitted using javascript, the default from_response() behaviour may not be the most appropriate. To disable this behaviour you can set the dont_click argument to True. Also, if you want to change the control clicked (instead of disabling it) you can also use the clickdata argument.Parameters:The other parameters of this class method are passed directly to the FormRequestconstructor.New in version 0.10.3: The formname parameter.

New in version 0.17: The formxpath parameter.
New in version 1.1.0: The formcss parameter.New in version 1.1.0: The formid parameter.

Request usage examples

Using FormRequest to send data via HTTP POSTIf you want to simulate a HTML Form POST in your spider and send a couple of key-value fields, you can return a FormRequest object (from your spider) like this:return [FormRequest(url="http://www.example.com/post/action", formdata={name: John Doe, age: 27}, callback=self.after_post)]Using FormRequest.from_response() to simulate a user login

It is usual for web sites to provide pre-populated form fields through <input type="hidden">elements, such as session related data or authentication tokens (for login pages). When scraping, you』ll want these fields to be automatically pre-populated and only override a couple of them, such as the user name and password. You can use the FormRequest.from_response() method for this job. Here』s an example spider which uses it:

import scrapyclass LoginSpider(scrapy.Spider): name = example.com start_urls = [http://www.example.com/users/login.php] def parse(self, response): return scrapy.FormRequest.from_response( response, formdata={username: john, password: secret}, callback=self.after_login

)

def after_login(self, response): # check login succeed before going on if "authentication failed" in response.body: self.logger.error("Login failed") return # continue scraping with authenticated session...不規則翻譯如下：

請求子類
這裡是內置子類的Request列表。您還可以將其子類化以實現您自己的自定義功能。

FormRequest對象
FormRequest類擴展了Request具有處理HTML表單的功能的基礎。它使用lxml.html表單從Response對象的表單數據預填充表單欄位。class scrapy.http.FormRequest(url[, formdata, ...])本FormRequest類增加了新的構造函數的參數。其餘的參數與Request類相同，這裡沒有記錄。

參數：formdata（元組的dict或iterable） - 是一個包含HTML Form數據的字典（或（key，value）元組的迭代），它將被url編碼並分配給請求的主體。該FormRequest對象支持除標準以下類方法Request的方法：

classmethod from_response(response[, formname=None, formid=None, formnumber=0, formdata=None, formxpath=None, formcss=None, clickdata=None, dont_click=False, ...])

返回一個新FormRequest對象，其中的表單欄位值已預先<form>填充在給定響應中包含的HTML 元素中。有關示例，請參閱使用FormRequest.from_response（）來模擬用戶登錄。該策略是在任何可查看的表單控制項上默認自動模擬點擊，如a 。即使這是相當方便，並且經常想要的行為，有時它可能導致難以調試的問題。例如，當使用使用javascript填充和/或提交的表單時，默認行為可能不是最合適的。要禁用此行為，您可以將參數設置為。此外，如果要更改單擊的控制項（而不是禁用它），您還可以使用參數。<input type="submit"> from_response() dont_click True clickdata參數：

response（Responseobject） - 包含將用於預填充表單欄位的HTML表單的響應
formname（string） - 如果給定，將使用name屬性設置為此值的形式。
formid（string） - 如果給定，將使用id屬性設置為此值的形式。
formxpath（string） - 如果給定，將使用匹配xpath的第一個表單。
formcss（string） - 如果給定，將使用匹配css選擇器的第一個形式。
formnumber（integer） - 當響應包含多個表單時要使用的表單的數量。第一個（也是默認）是0。
formdata（dict） - 要在表單數據中覆蓋的欄位。如果響應<form>元素中已存在欄位，則其值將被在此參數中傳遞的值覆蓋。
clickdata（dict） - 查找控制項被點擊的屬性。如果沒有提供，表單數據將被提交，模擬第一個可點擊元素的點擊。除了html屬性，控制項可以通過其相對於表單中其他提交表輸入的基於零的索引，通過nr屬性來標識。
dont_click（boolean） - 如果為True，表單數據將在不點擊任何元素的情況下提交。

請求使用示例

使用FormRequest通過HTTP POST發送數據

如果你想在你的爬蟲中模擬HTML表單POST並發送幾個鍵值欄位，你可以返回一個FormRequest對象（從你的爬蟲）像這樣：

return [FormRequest(url="http://www.example.com/post/action",
formdata={name: John Doe, age: 27},
callback=self.after_post)]

使用FormRequest.from_response（）來模擬用戶登錄網站通常通過元素（例如會話相關數據或認證令牌（用於登錄頁面））提供預填充的表單欄位。進行剪貼時，您需要自動預填充這些欄位，並且只覆蓋其中的一些，例如用戶名和密碼。您可以使用此作業的方法。這裡有一個使用它的爬蟲示例：<input type="hidden"> FormRequest.from_response()import scrapyclass LoginSpider(scrapy.Spider): name = example.com start_urls = [http://www.example.com/users/login.php] def parse(self, response): return scrapy.FormRequest.from_response(

response,

formdata={username: john, password: secret}, callback=self.after_login ) def after_login(self, response): # check login succeed before going on if "authentication failed" in response.body: self.logger.error("Login failed") return # continue scraping with authenticated session...

感興趣的看看原文文檔（https://doc.scrapy.org），後面會說具體的案例。

Response

response是響應對象，具體子類有TextResponse、HtmlResponse、XmlResponse，子類繼承於Response基類，它是根據響應中的Content-Type類型生成的，使用最多的是HtmlResponse。

HtmlResponse子類具有下列屬性：

url ：HTTP響應的url地址,str類型
status：HTTP響應的狀態碼, int類型
headers ：HTTP響應的頭部, 類字典類型, 可以調用get或者getlist方法對其進行訪問
body：HTTP響應正文, bytes類型
text：文本形式的HTTP響應正文, str類型, 它是由response.body 使用response.encoding解碼得到的
response.text = response.body.decode(response.encoding)
encoding：HTTP響應正文的編碼, 它的值可能是從HTTP響應頭部或者正文中解析出來的
reqeust：產生該HTTP響應的Reqeust對象
meta：即response.request.meta, 在構造Request對象時, 可將要傳遞給響應處理函數的信息通過meta參數傳入, 響應處理函數處理響應時, 通過response.meta將信息提取出來
selector：Selector對象用於在Response中提取數據使用
xpath(query)：使用XPath選擇器在Response中提取數據, 實際上是response.selector.xpath方法的快捷方式
css(query) ：使用css選擇器在Response中提取數據, 實際上它是response.selector.css方法的快捷方式
urljoin(url) ：用於構造絕對url, 當傳入的url參數是一個相對地址時, 根據response.url計算出相應的絕對url.