Python｜簡單幾行代碼提取html文件中全部指定

Python｜簡單幾行代碼提取html文件中全部指定標簽內的文本

有以下文件html.html：

想要提取全部標簽<h4></h4>內的文本，可使用如下Python代碼：

import re
with open("html.html",'rU') as strf:
....str=strf.read()
res=r'(?<=<h4>).*?(?=</h4>)'
li=re.findall(res,str)
with open("new.txt","w") as wstr:
....for s in li:
........wstr.write(s)
........wstr.write("\r\n")
........print(s,'\r\n')

正則表達式r'(?<=<h4>).*?(?=</h4>)中括號部分屬于向后向前查找，相當于字符串作為邊界進行查找。

運行后會將標簽<h4></h4>內的文本提取到文件new.txt：

-End-

信我或關注微信號：獅范課，回復：學習，獲取免費學習資源包。

如有以下文件html.html：

想要提取全部標簽<h4></h4>內的文本，可使用如下Python代碼：

import re
with open("html.html",'rU') as strf:
....str=strf.read()
res=r'(?<=<h4>).*?(?=</h4>)'
li=re.findall(res,str)
with open("new.txt","w") as wstr:
....for s in li:
........wstr.write(s)
........wstr.write(" ")
........print(s,' ')

正則表達式r'(?<=<h4>).*?(?=</h4>)中括號部分屬于向后向前查找，相當于字符串作為邊界進行查找。

運行后會將標簽<h4></h4>內的文本提取到文件new.txt：

來源網絡，侵權聯系刪除

私信我或關注微信號：獅范課，回復：學習，獲取免費學習資源包。

要使用sed、awk、grep等工具進行嘗試（這會導致不可預期的結果）。在許多情況下，你最好選擇使用支持XML數據的編程語言進行處理。如果必須使用shell腳本，有一些專門用于解析HTML和XML文件的工具可供使用。

Lynx

你可能知道Lynx是一個帶有極限限制的終端模式Web瀏覽器。確實如此，但它也是一個可編程的HTML解析器。它特別擅長從文檔中提取鏈接并將其打印出來：

如果你想要包括圖像鏈接，請添加-image_links選項。現在，根據你的需求過濾鏈接應該相對簡單，因為每個鏈接都在單獨的一行上，沒有HTML標簽的干擾。

$ lynx -dump -listonly -nonumbers http://mywiki.wooledge.org/
http://mywiki.wooledge.org/EnglishFrontPage?action=rss_rc&unique=1&ddiffs=1
http://mywiki.wooledge.org/EnglishFrontPage?action=edit
http://mywiki.wooledge.org/EnglishFrontPage
http://mywiki.wooledge.org/EnglishFrontPage?action=raw
http://mywiki.wooledge.org/EnglishFrontPage?action=print
http://mywiki.wooledge.org/EnglishFrontPage?action=AttachFile&do=view&target=Greg's-wiki.zip
[...]

你可能會認為wget在這方面也很好用，對吧？我的意思是，它有遞歸鏡像模式，所以顯然內部做了這種操作。祝你好運，試圖找到一種方法讓wget將URL打印出來而不是下載全部文件。

我試著嘗試了一下，找到了一種方法。沒有經過充分測試。我們可以使用--rejected-log和始終匹配的--reject-regex參數。我們使用--spider選項以不保存文件的方式執行。

$ wget -q --spider -r --rejected-log=rejected --reject-regex=^ http://mywiki.wooledge.org/
$ cat rejected
REASON  U_URL   U_SCHEME        U_HOST  U_PORT  U_PATH  U_PARAMS        U_QUERY U_FRAGMENT      P_URL   P_SCHEME        P_HOST  P_PORT  P_PATH  P_PARAMS        P_QUERY P_FRAGMENT
REGEX   http%3A//mywiki.wooledge.org/moin_static198/common/js/common.js SCHEME_HTTP     mywiki.wooledge.org     80      moin_static198/common/js/common.js      http%3A//mywiki.wooledge.org/   SCHEME_HTTP     mywiki.wooledge.org     80                          
REGEX   http%3A//mywiki.wooledge.org/moin_static198/modernized/css/common.css   SCHEME_HTTP     mywiki.wooledge.org     80      moin_static198/modernized/css/common.css                                http%3A//mywiki.wooledge.org/   SCHEME_HTTP     mywiki.wooledge.org     80                          
REGEX   http%3A//mywiki.wooledge.org/moin_static198/modernized/css/screen.css   SCHEME_HTTP     mywiki.wooledge.org     80      moin_static198/modernized/css/screen.css                                http%3A//mywiki.wooledge.org/   SCHEME_HTTP     mywiki.wooledge.org     80                          
REGEX   http%3A//mywiki.wooledge.org/moin_static198/modernized/css/print.css    SCHEME_HTTP     mywiki.wooledge.org     80      moin_static198/modernized/css/print.css                         http%3A//mywiki.wooledge.org/   SCHEME_HTTP     mywiki.wooledge.org     80                          
REGEX   http%3A//mywiki.wooledge.org/moin_static198/modernized/css/projection.css       SCHEME_HTTP     mywiki.wooledge.org     80      moin_static198/modernized/css/projection.css                            http%3A//mywiki.wooledge.org/   SCHEME_HTTP     mywiki.wooledge.org     80                          
[...]

要將鏈接提取到標準輸出中：

$ wget -q --spider -r --rejected-log=/dev/stdout --reject-regex=^ http://mywiki.wooledge.org/ | tail -n +2 | cut -f 2
http%3A//mywiki.wooledge.org/moin_static198/common/js/common.js
http%3A//mywiki.wooledge.org/moin_static198/modernized/css/common.css
http%3A//mywiki.wooledge.org/moin_static198/modernized/css/screen.css
http%3A//mywiki.wooledge.org/moin_static198/modernized/css/print.css
http%3A//mywiki.wooledge.org/moin_static198/modernized/css/projection.css
[...]

xmllint

xmllint是處理大多數XML的最佳選擇。不幸的是，使用它需要學習XPath，而我并不知道任何合理的XPath入門教程。以下是一些簡單的技巧。它們是使用以下輸入文件演示的：

<staff>
<person name="bob"><salary>70000</salary></person>
<person name="sue"><salary>90000</salary></person>
</staff>

請注意，xmllint在輸出中不添加換行符。如果你用CommandSubstitution進行捕獲，這不是問題。但如果你在交互式shell中測試，這將很快變得很煩人。你可以考慮編寫一個包裝函數，例如：

xmllint() { command xmllint "$@"; echo; }

簡單技巧：

打印第一個salary標簽：

$ xmllint --xpath 'string(//salary)' foo.xml
70000

打印所有的salary標簽（請注意，以這種形式打印并不特別有用）：

$ xmllint --xpath '//salary/text()' foo.xml
7000090000

計算person標簽的數量：

$ xmllint --xpath 'count(//person)' foo.xml
2

分別打印每個人的salary：

$ xmllint --xpath '//person[1]/salary/text()' foo.xml
70000
$ xmllint --xpath '//person[2]/salary/text()' foo.xml
90000

打印bob的salary：

$ xmllint --xpath '//person[@name="bob"]/salary/text()' foo.xml 
70000

打印第二個人的name：

$ xmllint --xpath 'string(//person[2]/@name)' foo.xml
sue

Namespaces

上述示例顯示，當你擁有一個不錯的XML解析器時，解析XML是相當容易的，但這違背了XML的目的，即讓每個人都感到痛苦。因此，一些聰明人引入了XML命名空間。

一個典型的maven構建文件（稱為pom.xml）就是這樣的例子，大致如下：

<project xmlns="http://maven.apache.org/POM/4.0.0"
  xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
  xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
                      http://maven.apache.org/xsd/maven-4.0.0.xsd">
  <modelVersion>4.0.0</modelVersion>
 
  <groupId>org.codehaus.mojo</groupId>
  <artifactId>my-project</artifactId>
  <version>1.0-SNAPSHOT</version>
</project>

通常還會有幾百行用于依賴項，但我們來保持簡潔。

根據前一章的示例，我們知道從該文件中提取版本只需使用xpath /project/version/text()：

$ xmllint --xpath '/project/version/text()' pom.xml
XPath set is empty

嗯，不是這樣的，因為作者聰明地為這個xmlns="http://maven.apache.org/POM/4.0.0"添加了一個默認命名空間，所以現在你首先必須指定確切的URL，然后才能指明要獲取的版本元素內部的文本。

xmllint --shell

xmllint的--xpath選項沒有辦法指定命名空間，所以它現在無法使用（除非你編輯文件并刪除命名空間聲明）。但其shell功能確實允許設置命名空間。

xmllint --shell pom.xml << EOF
setns ns=http://maven.apache.org/POM/4.0.0
cat /ns:project/ns:version/text()
EOF
/ > / >  -------
1.0-SNAPSHOT
/ >

耶！我們得到了版本號...外加一些來自xmllint shell的提示和廢話，之后必須將其刪除。

xmlstarlet

xmlstarlet對于這個任務來說稍微容易一些

$ xmlstarlet sel -N ns=http://maven.apache.org/POM/4.0.0 -t -v /ns:project/ns:version -n pom.xml
1.0-SNAPSHOT

python

Python也附帶了一個XML解析器，通常比xmllint和xmlstarlet更常用。它也可以以一種笨拙的方式處理命名空間。

$ python -c 'import xml.etree.ElementTree as ET;print(ET.parse("pom.xml").find("{http://maven.apache.org/POM/4.0.0}version").text)'
1.0-SNAPSHOT

xsltproc

xsltproc恰好在大多數Linux系統上安裝。例如提取播客的標題和URL：

xslt() {
cat << 'EOX'
<?xml version="1.0"?>
<x:stylesheet version="1.0" xmlns:x="http://www.w3.org/1999/XSL/Transform">
<x:output method="text" />
<x:template match="/">
<x:for-each select="//item">
        <x:text># </x:text>
        <x:value-of select="./title/text()" /><x:text>
<!-- newline --></x:text>
        <x:value-of select="enclosure/@url" /><x:text>
</x:text>
</x:for-each>
</x:template>
</x:stylesheet>
EOX
}

curl -s http://podcasts.files.bbci.co.uk/p02nq0lx.rss | xsltproc <(xslt) -

如果你想學習如何編寫更加健壯和可靠的 Shell 腳本，減少生產環境中的錯誤和故障，那么關注我吧！我會分享 Shell 編程的最佳實踐和建議，幫助你提高 Shell 腳本的魯棒性和可維護性。如果你想深入了解 Shell 編程的實際應用和技巧，可以關注我的《Shell 腳本編程最佳實踐》專欄，里面有我在一線互聯網大廠的實際生產經驗和最佳實踐，幫助你高效完成各種自動化任務。

在線咨詢

上一篇：一分鐘復習-JavaScript基礎語法（2）
下一篇：Web前端培訓：理解CSS位置屬性

您的項目需求

*請認真填寫需求信息，我們會在24小時內與您取得聯系。

整合營銷服務商