python爬蟲學習筆記...

Github源碼 Ingram7/LagouSpider

說明

啟動Splash服務相關 Scrapy-Splash 爬取京東 - 晚來天御雪的文章 - 知乎 zhuanlan.zhihu.com/p/57

設置隨機UA Scrapy中設置隨機User_Agent - 晚來天御雪的文章 - 知乎 zhuanlan.zhihu.com/p/55

獲取瀏覽器cookie登錄 Scrapy爬蟲使用browsercookie登錄 - 晚來天御雪的文章 - 知乎 zhuanlan.zhihu.com/p/64


class LagouspdSpider(scrapy.Spider):
name = lagouspd
allowed_domains = [lagou.com]
start_urls = [http://lagou.com/]

def start_requests(self):
yield scrapy.Request(url=self.start_urls[0], callback=self.start_parse_job, meta={cookiejar:chrome})

帶著meta={cookiejar:chrome}參數登錄網頁

def start_parse_job(self, response):

for url_job in response.xpath(//div[contains(@class, "menu_sub dn")]//dd/a):

classify_href = url_job.xpath(@href).extract_first()
classify_name = url_job.xpath(text()).extract_first()
# print(classify_name)
url = classify_href + 1/?filterOption=3

yield SplashRequest(url,
endpoint=execute,
meta={classify_name: classify_name, classify_href: classify_href},
callback=self.parse_total_page,
dont_filter=True,
args={lua_source: lua_script},
cache_args=[lua_source])

獲取job各個分類,如Java,C++,PHP,數據挖掘... 及它們的鏈接,請求他們的鏈接到下一函數

def parse_total_page(self, response):
try:
total_page = response.xpath(//*[@id="order"]/li/div[4]/div/span[2]/text()).extract()[0]
except Exception as e:
total_page = 0

classify_href = response.meta[classify_href]
for i in range(1, int(total_page) + 1):
url = classify_href + {}/?filterOption=3.format(i)

yield SplashRequest(url,
endpoint=execute,
meta={classify_name: response.meta[classify_name]},
callback=self.parse_item,
dont_filter=True,
args={lua_source: lua_script},
cache_args=[lua_source])

進入job每個分類的鏈接,如Java頁面

獲取如Java這類的頁面總頁數,請求每一頁鏈接到下一個函數

def parse_item(self, response):

for node in response.xpath(//li[@class="con_list_item default_list"]):
job_name = .join(node.xpath(
./div[@class="list_item_top"]/div[@class="position"]/div[@class="p_top"]/a/h3/text()).extract_first()).strip()
money = .join(node.xpath(
./div[@class="list_item_top"]/div[@class="position"]/div[@class="p_bot"]/div[@class="li_b_l"]/span/text()).extract_first()).strip()
company = .join(node.xpath(
./div[@class="list_item_top"]/div[@class="company"]/div[@class="company_name"]/a/text()).extract_first()).strip()
job_info_url = node.xpath(
./div[@class="list_item_top"]/div[@class="position"]/div[@class="p_top"]/a/@href).extract_first()

yield SplashRequest(url=job_info_url,
endpoint=execute,
meta={job_name: job_name,
money: money,
company: company,
classify_name: response.meta[classify_name]},
callback=self.parse_info,
dont_filter=True,
args={lua_source: lua_script},
cache_args=[lua_source])

在如Java類裡面,獲取每一頁的job_name,money,company(並通過meta參數傳送到下一個函數),及獲取詳情頁鏈接,請求詳情頁鏈接到下一函數

def parse_info(self, response):
# print(response.request.headers[User-Agent])

item = LagouspiderItem()
item[job_name] = response.meta[job_name]
item[money] = response.meta[money]
item[company] = response.meta[company]
item[classify_name] = response.meta[classify_name]
item[advantage] = .join(response.css(.job-advantage p::text).extract()).strip()
item[requirements] = .join(response.css(.job_bt p::text).extract()).strip()
item[info] = .join(response.css(
.position-head .position-content .position-content-l .job_request p).xpath(./span/text()).extract()).strip()
print(item: + str(item))
yield item

進入到詳情頁,再獲取其他信息advantage,requirements,info和傳入的信息一起保存到item

其它設置請查看 Github源碼: Ingram7/LagouSpider

推薦閱讀:

相關文章