[XPath] - 開始使用XPath

1 在Python中使用XPath

方法一：

# -*- coding: utf-8 -*- from lxml import etree html = uthis is a html 內容 sel = etree.HTML(html)

方法二：

from scrapy.selector import Selector html = uthis is a html 內容 sel = Selector(text=body)

2 XPath基本語義

樣例文本

""" <html> <head> <base href=http://example.com/ /> <title>Example website</title> </head> <body> <div id="url"> <a href="https://www.baidu.com" title="baidu">BaiDu</a> <a href="https://www.zhihu.com" title="zhihu">ZhiHu</a> </div> </body> </html> """

基本語義1

//定位節點，不考慮節點的位置 /一層一層往下，按照順序依次查找

示例

>>> sel.xpath(/html/head/title/text()).extract() >>> sel.xpath(//title/text()).extract() [uExample website]

基本語義2

/text()：當前節點下的文本內容 /@x：當前節點下屬性x的屬性值

示例

>>> sel.xpath(//base/@href).extract() [uExample Domain] >>> sel.xpath(//*[@id="url"]/a/@title).extract() # //*表示任意節點 >>> sel.xpath(//div[@id="url"]/a/@title).extract() # 等價表示 [ubaidu, uzhihu]

3 XPath中的函數

3.1 查找特定字元串後的內容

>>> sel.xpath("substring-after(string(.), html )") 內容

講解：

# substring-after( haystack , needle ) # haystack: 源字元串，該字元串部分內容會被返回。 # needle : 搜索的子串，其後的所有內容將被返回。

3.2 查找指定字元串開頭的內容

"""<body> <div id="test-1">string 1</div> <div id="test-2">string 2</div> <div id="test-3">string 3</div> </body> """ >>> sel.xpath(//div[starts-with(@id,"test")]/text()).extract() [string 1,string 2,string 3]

3.3 輸出所有標籤下的字元

<body> <div id="test-1"> string 1 <span id=1> string 2 <ul>string 3 <li>string 4</li> </ul> </sapn> </div> </body>

方法

>>> div = sel.xpath(//div[@id="test-1"]) >>> div.xpath(string(.)).extract_first().replace( ,).replace( ,) string1string2string3string4

3.4 其他

<?xml version="1.0" encoding="ISO-8859-1"?> <bookstore> <book category="COOKING"> <title lang="en">Everyday Italian</title> <author>Giada De Laurentiis</author> <year>2005</year> <price>30.00</price> </book>

<book category="CHILDREN">
<title lang="en">Harry Potter</title>
<author>J K. Rowling</author>
<year>2005</year>
<price>29.99</price>
</book>

<book category="WEB">
<title lang="en">XQuery Kick Start</title>
<author>James McGovern</author>
<author>Per Bothner</author>
<author>Kurt Cagle</author>
<author>James Linn</author>
<author>Vaidyanathan Nagarajan</author>
<year>2003</year>
<price>49.99</price>
</book>