python爬蟲學習筆記...
Github源碼 Ingram7/LagouSpider
說明
啟動Splash服務相關 Scrapy-Splash 爬取京東 - 晚來天御雪的文章 - 知乎 https://zhuanlan.zhihu.com/p/57374645
設置隨機UA Scrapy中設置隨機User_Agent - 晚來天御雪的文章 - 知乎 https://zhuanlan.zhihu.com/p/55081179
獲取瀏覽器cookie登錄 Scrapy爬蟲使用browsercookie登錄 - 晚來天御雪的文章 - 知乎 https://zhuanlan.zhihu.com/p/64366444
class LagouspdSpider(scrapy.Spider):
name = lagouspd
allowed_domains = [lagou.com]
start_urls = [http://lagou.com/]
def start_requests(self):
yield scrapy.Request(url=self.start_urls[0], callback=self.start_parse_job, meta={cookiejar:chrome})
帶著meta={cookiejar:chrome}參數登錄網頁
def start_parse_job(self, response):
for url_job in response.xpath(//div[contains(@class, "menu_sub dn")]//dd/a):
classify_href = url_job.xpath(@href).extract_first()
classify_name = url_job.xpath(text()).extract_first()
# print(classify_name)
url = classify_href + 1/?filterOption=3
yield SplashRequest(url,
endpoint=execute,
meta={classify_name: classify_name, classify_href: classify_href},
callback=self.parse_total_page,
dont_filter=True,
args={lua_source: lua_script},
cache_args=[lua_source])
獲取job各個分類,如Java,C++,PHP,數據挖掘... 及它們的鏈接,請求他們的鏈接到下一函數