如何解決用 Beautiful Soup 抓取網頁卻得到亂碼的問題？

def scoreScanner(username,password):
a_jar = cookielib.CookieJar() b_jar = urllib2.build_opener(urllib2.HTTPCookieProcessor(a_jar))

urllib2.install_opener(b_jar)
dust1 = http://xxx dust2 = username dust3 = mm=

dust4 = password
dust5 = yhlbdm=01 response = urllib2.urlopen(dust1 dust2 dust3 dust4 dust5) next = urllib2.urlopen("http://xxxx/xxx") doc = next.read()

soup = BeautifulSoup(.join(doc))
soup.originalEncoding soup =str(soup) score_raw = str(soup) #print score_raw

#print zzz
saveFile(username,score_raw)

可能編碼識別錯了，建議創建soup對象時手動把正確的編碼傳過去,國內的網站編碼主要是gb2312和utf8，對於大部分非utf8編碼中文網站可以用gb18030通吃 &> &> &> encoding = "gb18030"

&> &> &> soup = BeautifulSoup(page, fromEncoding=encoding)

這個問題我之前也遇到了，後來在StackOverflow上提問，找到了解決方法。

樓上所說的編碼問題只是一個方面，使用GB18030確實能夠解決。另一個造成亂碼的原因是壓縮格式。很多規模較大的網站都是以gzip的壓縮格式輸出頁面的，所以在用BS解析之前需要先判斷該網頁是否經過壓縮，如果經過壓縮則先進行解壓操作。完整代碼如下：