經常看博客的同志知道,博客園每個欄目下面有200頁,多了的數據他就不顯示了,最多顯示4000篇博客如何儘可能多的得到博客數據,是這篇文章研究的一點點核心內容,能√get到多少就看你的了~
4000篇博客
單純的從每個欄目去爬取是不顯示的,轉換一下思路,看到搜索頁面,有時間~,有時間!
注意看URL鏈接
https://zzk.cnblogs.com/s/blogpost?Keywords=python&datetimerange=Customer&from=2019-01-01&to=2019-01-01
這個鏈接得到之後,其實用一個比較簡單的思路就可以獲取到所有python相關的文章了,迭代時間。
import scrapy from scrapy import Request,Selector import time import datetime
class BlogsSpider(scrapy.Spider): name = Blogs allowed_domains = [zzk.cnblogs.com] start_urls = [http://zzk.cnblogs.com/] from_time = "2010-01-01" end_time = "2010-01-01" keywords = "python" page =1 url = "https://zzk.cnblogs.com/s/blogpost?Keywords={keywords}&datetimerange=Customer&from={from_time}&to={end_time}&pageindex={page}" custom_settings = { "DEFAULT_REQUEST_HEADERS":{ "HOST":"zzk.cnblogs.com", "TE":"Trailers", "referer": "https://zzk.cnblogs.com/s/blogpost?w=python", "upgrade-insecure-requests": "1", "user-agent": "Mozilla/5.0 Gecko/20100101 Firefox/64.0"
} }
def start_requests(self): cookie_str = "想辦法自己獲取到" self.cookies = {item.split("=")[0]: item.split("=")[1] for item in cookie_str.split("; ")} yield Request(self.url.format(keywords=self.keywords,from_time=self.from_time,end_time=self.end_time,page=self.page),cookies=self.cookies,callback=self.parse)
頁面爬取完畢之後,需要進行解析,獲取翻頁頁碼,同時將時間+1天,下面的代碼重點看時間疊加部分的操作。
def parse(self, response): print("正在爬取",response.url) count = int(response.css(#CountOfResults::text).extract_first()) # 獲取是否有數據 if count>0: for page in range(1,int(count/10)+2): # 抓取詳細數據 yield Request(self.url.format(keywords=self.keywords,from_time=self.from_time,end_time=self.end_time,page=page),cookies=self.cookies,callback=self.parse_detail,dont_filter=True)
time.sleep(2) # 跳轉下一個日期 d = datetime.datetime.strptime(self.from_time, %Y-%m-%d) delta = datetime.timedelta(days=1) d = d + delta self.from_time = d.strftime(%Y-%m-%d) self.end_time =self.from_time yield Request( self.url.format(keywords=self.keywords, from_time=self.from_time, end_time=self.end_time, page=self.page), cookies=self.cookies, callback=self.parse, dont_filter=True)
本部分操作邏輯沒有複雜點,只需要按照流程編寫即可,運行代碼,跑起來,在mongodb等待一些時間
db.getCollection(dict).count({})
返回
372352條數據 def parse_detail(self,response): items = response.xpath(//div[@class="searchItem"]) for item in items: title = item.xpath(h3[@class="searchItemTitle"]/a//text()).extract() title = "".join(title)
author = item.xpath(".//span[@class=searchItemInfo-userName]/a/text()").extract_first() public_date = item.xpath(".//span[@class=searchItemInfo-publishDate]/text()").extract_first() pv = item.xpath(".//span[@class=searchItemInfo-views]/text()").extract_first() if pv: pv = pv[3:-1] url = item.xpath(".//span[@class=searchURL]/text()").extract_first() #print(title,author,public_date,pv) yield { "title":title, "author":author, "public_date":public_date, "pv":pv, "url":url }
一頓操作猛如虎,數據就到手了~後面可以做一些簡單的數據分析,那篇博客再見啦@
推薦閱讀: