python 模塊BeautifulSoup 從HTML或XML文件中提取數據

、安裝

Beautiful Soup 是一個HTML/XML的解析器，主要的功能也是如何解析和提取 HTML/XML 數據。

lxml 只會局部遍歷，而Beautiful Soup 是基于HTML DOM的，會載入整個文檔，解析整個DOM樹，因此時間和內存開銷都會大很多，所以性能要低于lxml。

BeautifulSoup 用來解析 HTML 比較簡單，API非常人性化，支持CSS選擇器、Python標準庫中的HTML解析器，也支持 lxml 的 XML解析器。

pip install beautifulsoup4

二、使用案例

from bs4 import BeautifulSoup
import requests
import asyncio
import functools
import re

house_info = []

'''異步請求獲取鏈家每頁數據'''
async def get_page(page_index):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/116.0.0.0 Safari/537.36'
    }
    request = functools.partial(requests.get, f'https://sh.lianjia.com/ershoufang/pudong/pg{page_index}/',
                                headers=headers)
    loop = asyncio.get_running_loop()
    response = await loop.run_in_executor(None, request)
    return response


'''使用xpath獲取房屋信息'''
def get_house_info(soup):
    house_info_list = soup.select('.info')  # 房屋title
    reg = re.compile(r'\n|\s')
    for html in house_info_list:

        house_info.append({
            'title': re.sub(reg,'',html.select('.title a')[0].getText()),
            'house_pattern': re.sub(reg,'',html.select('.houseInfo')[0].getText()),
            'price': re.sub(reg,'',html.select('.unitPrice')[0].getText()),
            'location': re.sub(reg,'',html.select('.positionInfo')[0].getText()),
            'total': re.sub(reg,'',html.select('.totalPrice')[0].getText())
        })

'''異步獲取第一頁數據，拿到第一頁房屋信息，并返回分頁總數和當前頁'''
async def get_first_page():
    response = await get_page(1)
    soup = BeautifulSoup(response.text, 'lxml')
    get_house_info(soup)
    print(house_info)


if __name__ == '__main__':
    asyncio.run(get_first_page())

三、創建soup對象

soup = BeautifulSoup(markup="", features=None, builder=None,parse_only=None, from_encoding=None, exclude_encodings=None,element_classes=None)

markup：要解析的HTML或XML文檔字符串。可以是一個字符串變量，也可以是一個文件對象（需要指定"html.parser"或"lxml"等解析器）。
features：指定解析器的名稱或類型。默認為"html.parser"，可以使用其他解析器如"lxml"、"html5lib"等。
builder：指定文檔樹的構建器。默認為None，表示使用默認構建器。可以使用"lxml"或"html5lib"等指定其他構建器。
parse_only：指定要解析的特定部分。可以傳遞一個解析器或一個標簽名或一個元素的列表。
from_encoding：指定解析器使用的字符編碼。默認為None，表示自動檢測編碼。
exclude_encodings：指定要排除的編碼列表，用于字符編碼自動檢測。
element_classes：指定要用于解析文檔的元素類。默認為None，表示使用默認元素類。

解析器	使用方法	優勢	劣勢
Python標準庫	BeautifulSoup(markup,"html.parser")	Python 的內置標準庫、執行速度適中、文檔容錯能力強	Python 2.7.3 or3.2.2) 前的版本中文容錯能力差
LXML HTML 解析器	BeautifulSoup(markup,"lxml")	速度快、文檔容錯能力強	需要安裝 C 語言庫
LXML XML解析器	BeautifulSoup(markup,"xml")	速度快、唯一支持 XML 的解析器	需要安裝 C 語言庫
html5lib	BeautifulSoup(markup,"html5lib")	最好的容錯性、以瀏覽器的方式解析文檔、生成 HTML5 格式的文檔	速度慢、不依賴外部擴展

四、soup對象

soup.prettify(encoding=None, formatter="minimal")：返回格式化后的HTML或XML文檔的字符串表示。它將文檔內容縮進并使用適當的標簽閉合格式，以提高可讀性
soup.title：返回文檔的

問題

我們需要以客戶端的形式通過HTTP協議訪問多種服務。比如，下載數據或者同一個基于REST的API進行交互。

解決方案

對于簡單的任務來說，使用urllib.request模塊通常就足夠了。比方說，要發送一個簡單的HTTP GET請求到遠端服務器上，可以這樣做：

from urllib import request, parse
# Base URL being accessed
url = 'http://httpbin.org/get'
# Dictionary of query parameters (if any)
parms = {
 'name1' : 'value1',
 'name2' : 'value2'
}
# Encode the query string
querystring = parse.urlencode(parms)
# Make a GET request and read the response
u = request.urlopen(url+'?' + querystring)
resp = u.read()

如果需要使用POST方法在請求主體（request body）中發送查詢參數，可以將參數編碼后作為可選參數提供給urlopen()函數，就像這樣：

from urllib import request, parse
# Base URL being accessed
url = 'http://httpbin.org/post'
# Dictionary of query parameters (if any)
parms = {
 'name1' : 'value1',
 'name2' : 'value2'
}
# Encode the query string
querystring = parse.urlencode(parms)
# Make a POST request and read the response
u = request.urlopen(url, querystring.encode('ascii'))
resp = u.read()

如果需要在發出的請求中提供一些自定義的HTTP頭，比如修改user-agent字段，那么可以創建一個包含字段值的字典，并創建一個Request實例然后將其傳給urlopen()。示例如下：

from urllib import request, parse
...
# Extra headers
headers = {
 'User-agent' : 'none/ofyourbusiness',
 'Spam' : 'Eggs'
}
req = request.Request(url, querystring.encode('ascii'), headers=headers)
# Make a request and read the response
u = request.urlopen(req)
resp = u.read()

如果需要交互的服務比上面的例子都要復雜，也許應該去看看requests庫（[http://pypi. python.org/pypi/requests](http://pypi. python.org/pypi/requests)）。比如，下面這個示例采用requests庫重新實現了上面的操作：

import requests
# Base URL being accessed
url = 'http://httpbin.org/post'
# Dictionary of query parameters (if any)
parms = {
 'name1' : 'value1',
 'name2' : 'value2'
}
# Extra headers
headers = {
 'User-agent' : 'none/ofyourbusiness',
 'Spam' : 'Eggs'
}
resp = requests.post(url, data=parms, headers=headers)
# Decoded text returned by the request
text = resp.text

關于requests庫，一個值得一提的特性就是它能以多種方式從請求中返回響應結果的內容。從上面的代碼來看，resp.text帶給我們的是以Unicode解碼的響應文本。但是，如果去訪問resp.content，就會得到原始的二進制數據。另一方面，如果訪問resp.json，那么就會得到JSON格式的響應內容。

下面這個示例利用requests庫來發起一個HEAD請求，并從響應中提取出一些HTTP頭數據的字段：

import requests
resp = requests.head('http://www.python.org/index.html')
status = resp.status_code
last_modified = resp.headers['last-modified']
content_type = resp.headers['content-type']
content_length = resp.headers['content-length']

下面的示例使用requests庫通過基本的認證在Python Package Index（也就是pypi）上執行了一個登錄操作：

import requests
resp = requests.get('http://pypi.python.org/pypi?:action=login',
 auth=('user','password'))

下面的示例使用requests庫將第一個請求中得到的HTTP cookies傳遞給下一個請求：

import requests
# First request
resp1 = requests.get(url)
...
# Second requests with cookies received on first requests
resp2 = requests.get(url, cookies=resp1.cookies)

最后但也同樣重要的是，下面的例子使用requests庫來實現內容的上傳：

import requests
url = 'http://httpbin.org/post'
files = { 'file': ('data.csv', open('data.csv', 'rb')) }
r = requests.post(url, files=files)

討論

對于確實很簡單的HTTP客戶端代碼，通常使用內建的urllib模塊就足夠了。但是，如果要做的不僅僅只是簡單的GET或POST請求，那就真的不能再依賴它的功能了。這時候就是第三方模塊比如requests大顯身手的時候了。

舉個例子，如果我們決定堅持使用標準的程序庫而不考慮像requests這樣的第三方庫，那么也許就不得不使用底層的http.client模塊來實現自己的代碼。比方說，下面的代碼展示了如何執行一個HEAD請求：

from http.client import HTTPConnection
from urllib import parse
c = HTTPConnection('www.python.org', 80)
c.request('HEAD', '/index.html')
resp = c.getresponse()
print('Status', resp.status)
for name, value in resp.getheaders():
 print(name, value)

同樣地，如果必須編寫涉及代理、認證、cookies以及其他一些細節方面的代碼，那么使用urllib就顯得特別別扭和啰嗦。比方說，下面這個示例實現在Python package index上的認證：

import urllib.request
auth = urllib.request.HTTPBasicAuthHandler()
auth.add_password('pypi','http://pypi.python.org','username','password')
opener = urllib.request.build_opener(auth)
r = urllib.request.Request('http://pypi.python.org/pypi?:action=login')
u = opener.open(r)
resp = u.read()
# From here. You can access more pages using opener
...

坦白說，所有這些操作在requests庫中都變得簡單得多。

在開發過程中測試HTTP客戶端代碼常常是很令人沮喪的，因為所有棘手的細節問題都需要考慮（例如cookies、認證、HTTP頭、編碼方式等）。要完成這些任務，考慮使用httpbin服務（http://httpbin.org）。這個站點會接收發出的請求，然后以JSON的形式將響應信息回傳回來。下面是一個交互式的例子：

>>> import requests
>>> r = requests.get('http://httpbin.org/get?name=Dave&n=37',
... headers = { 'User-agent': 'goaway/1.0' })
>>> resp = r.json
>>> resp['headers']
{'User-Agent': 'goaway/1.0', 'Content-Length': '', 'Content-Type': '',
'Accept-Encoding': 'gzip, deflate, compress', 'Connection':
'keep-alive', 'Host': 'httpbin.org', 'Accept': '*/*'}
>>> resp['args']
{'name': 'Dave', 'n': '37'}
>>>

在要同一個真正的站點進行交互前，先在httpbin.org這樣的網站上做實驗常常是可取的辦法。尤其是當我們面對3次登錄失敗就會關閉賬戶這樣的風險時尤為有用（不要嘗試自己編寫HTTP認證客戶端來登錄你的銀行賬戶）。

盡管本節沒有涉及，requests庫還對許多高級的HTTP客戶端協議提供了支持，比如OAuth。requests模塊的文檔（http://docs.python-requests.org）質量很高（坦白說比在這短短一節的篇幅中所提供的任何信息都好），可以參考文檔以獲得更多的信息。

詳細]php調用python腳本，將word轉為html代碼及調用失敗處理

起因：因為公司遇到發稿問題，很多人喜歡用word編碼，然后再發布到網站上。PHP的包中雖然有部分可以使用的類庫，但是對于圖片始終處理不好，我就想到了python。研究了下，python將word轉為html還真是方便。但是，怎么結合到服務器上呢？我們的服務器是用PHP開發的。

1：python腳本

#!/usr/bin/python# -*- coding: UTF-8 -*-import sysfrom pydocx import PyDocXreload(sys)sys.setdefaultencoding('utf8')FileName = sys.argv[1] #獲取文件名參數ShortName = sys.argv[2] #獲取文件名參數html = PyDocX.to_html(FileName) # f = open("/www/wwwroot/micuer.com/pythoncode/runtime/99.txt", 'w') #服務器的全路徑# f.write(html)# f.close()print(html)

2:php處理腳本

public function uploadword(){        try {            $file = request()->file("file");            // 上傳到本地服務器            $savename = \think\facade\Filesystem::disk('upload')->putFile( 'word', $file);            $shotrname = time().".txt"; // 短名稱            $savename = "/www/wwwroot/micuer.com/data/upload/".$savename; //Request::domain().            $python_file_name = "/www/wwwroot/micuer.com/pythoncode/WordToHtml.py";            //組裝命令            $cmd = "python {$python_file_name} ".$savename." {$shotrname}  2>error.txt 2>&1";            $res = exec($cmd,$array, $ret);            return json(["code"=>200,"msg"=>"成功","data"=>$savename,"cmd"=>$cmd,"array"=>$array]);        } catch (think\exception\ValidateException $e) {            return json(["code"=>40000,"msg"=>$e->getMessage()]);        }    }

上傳界面如下：

實現的功能就是利用PHP的exec函數，調用py腳本，將html代碼返回給前臺服務器。

返回數據如下

其實，再處理這個方案中，也遇到了很多問題，比如在命令行下只能成功，但是exec函數執行不成功等等。
參考了資料：https://my.oschina.net/u/4427610/blog/3155816
也就是

exec("python python_test.py 2>error.txt 2>&1", $array, $ret);

在bash中0,1,2三個數字分代表STDIN_FILENO、STDOUT_FILENO、STDERR_FILENO，即標準輸入（一般是鍵盤），標準輸出（一般是顯示屏，準確的說是用戶終端控制臺），標準錯誤（出錯信息輸出）。
也可以通過以下方式將標準錯誤重定向到標準輸出保存到$array中：
打印之后，發現是沒有權限調用。于是就直接改為輸出了，也就是 py的print(html)函數。

注意幾點：
1：執行權限問題
2：exec(“python python_test.py 2>error.txt 2>&1”, $array, $ret); 中 $array就接受到了 print(html)的值
3：各個腳本盡量使用全路徑

在線咨詢

上一篇：講解 CSS 過渡和動畫 transition/animation (很全面 )
下一篇：HTML的基本語法

您的項目需求

*請認真填寫需求信息，我們會在24小時內與您取得聯系。

整合營銷服務商

python 模塊BeautifulSoup 從HTML或XML文件中提取數據

問題

解決方案

討論

詳細]php調用python腳本，將word轉為html代碼及調用失敗處理

您的項目需求