python網(wǎng)絡(luò)數(shù)據(jù)處理之html2text模塊和readability模塊學(xué)習(xí)使用

天在處理html數(shù)據(jù)的時候發(fā)現(xiàn)了python里面比較好玩的幾個庫，先存起來之后有時間慢慢再去學(xué)習(xí)和使用，覺得是一件蠻有意思的事情。今天想學(xué)習(xí)使用的是html2text模塊和readability模塊。其中，第一個模塊是負(fù)責(zé)對html數(shù)據(jù)進(jìn)行處理的，返回html中的文本信息；第二個模塊是負(fù)責(zé)html數(shù)據(jù)中指定信息如：文章標(biāo)題、作者等信息的提取。

之前在處理爬取的html數(shù)據(jù)的時候大多數(shù)是自己編寫正則表達(dá)式或者是xpath規(guī)則集來完成指定數(shù)據(jù)字段信息的提取，相對來說比較靈活，但是規(guī)則集的編寫較為耗時，且遇上復(fù)雜數(shù)據(jù)的時候就難以應(yīng)付了。如果有現(xiàn)成的封裝好的模塊可以直接完成某一項工作的話還是很不錯的，這兩個庫可以說還是比較不錯，今天只是簡單拿來使用一下，之后再做到類似的項目的時候可以將一部分的工作交由固定模塊來完成。

下面是具體的實踐：

#!usr/bin/env python
# encoding:utf-8
 
'''
__Author__:沂水寒城
功能： html2text 模塊和 readability 模塊使用
'''
 
import sys
import urllib
import requests
import html2text
from readability import Document
 
 
reload(sys)
sys.setdefaultencoding('utf-8')
 
 
 
def test_func():
 '''
 官網(wǎng)實例，清洗html
 '''
 print html2text.html2text("<p>Hello, world.</p>")
 h=html2text.HTML2Text()
 h.ignore_links=True
 print h.handle("<p>Hello, <a )
 
 
def test_func2(url):
 '''
 獲取指定URL的html，對html進(jìn)行處理
 '''
 html=urllib.urlopen(url).read()
 h=html2text.HTML2Text()
 h.ignore_links=True
 print h.handle(html)
 
 
def test_func3(url):
 '''
 抽取指定URL中的標(biāo)題等數(shù)據(jù)
 '''
 response=requests.get(url)
 doc=Document(response.text)
 print doc.title()
 html=urllib.urlopen(url).read()
 #該方式抽取出來的readable_article是帶HTML標(biāo)簽的文本
 readable_article=Document(html).summary() 
 readable_title=Document(html).short_title()
 print 'readable_article: ',readable_article
 print 'readable_title: ',readable_title
 
 
 
if __name__=='__main__':
 url='https://mbd.baidu.com/newspage/data/landingsuper?context=%7B%22nid%22%3A%22news_2588586383061242738%22%7D&n_type=0&p_from=1'
 test_func()
 print '-|'*50
 test_func2(url)
 print '-|'*50
 test_func3(url)

運行結(jié)果如下：

所周知,python最強(qiáng)大的地方在于，python社區(qū)匯總擁有豐富的第三方庫，開源的特性，使得有越來越多的技術(shù)開發(fā)者來完善。

python的完美性。

未來人工智能，大數(shù)據(jù)方向，區(qū)塊鏈的識別和進(jìn)階都將以python為中心來展開。

咳咳咳！好像有點打廣告的嫌疑了。

當(dāng)前互聯(lián)網(wǎng)信息共享時代，最重要的是什么?是數(shù)據(jù)。最有價值的是什么？是數(shù)據(jù)。最能直觀體現(xiàn)技術(shù)水平的是什么?還是數(shù)據(jù)。

所以，今天我們要分享的是：如何來獲取各個文件格式的文本信息。

普通文件的格式一般分為: txt普通文本信息，doc word文檔，html網(wǎng)頁內(nèi)容，excel表格數(shù)據(jù)，以及特殊的mht文件。

一、Python處理html網(wǎng)頁信息

html類型的文本數(shù)據(jù)，內(nèi)容是由前端代碼書寫的標(biāo)簽+文本數(shù)據(jù)的格式，可以直接在chrome瀏覽器打開，清楚的展示出文本的格式。

python 獲取html文件的內(nèi)容和獲取txt文件的方法相同，直接打開文件讀取就可以了。

讀取代碼如下:

with open(html_path, "r", encoding="utf-8") as f:
 file = f.read()

file 是html文件的文本內(nèi)容。是一個網(wǎng)頁標(biāo)簽的格式內(nèi)容。

二、Python處理excel表格信息

python擁有直接操作excel表格的第三方庫xlwt,xlrd。調(diào)用對應(yīng)的方法就可以讀寫excel表格數(shù)據(jù)。

讀取excel操作代碼如下:

filepath = "C:\\Users\Administrator\Desktop\新建文件夾\笨笨 前程6份 武漢.xls"
sheet_name = "UserList"
rb = xlrd.open_workbook(filepath)
sheet = rb.sheet_by_name(sheet_name)
# clox_list = [0, 9, 14, 15, 17]
for row in range(1, sheet.nrows):
 w = WriteToExcel()
 # for clox in clox_list:
 name = sheet.cell(row, 0).value
 phone = sheet.cell(row, 15).value
 address = sheet.cell(row, 9).value
 major = sheet.cell(row, 14).value
 age = sheet.cell(row, 8).value

其中row是表格數(shù)據(jù)對應(yīng)的行數(shù)， cell獲取具體行數(shù)，列數(shù)的具體數(shù)據(jù)。

三、Python讀取doc文檔數(shù)據(jù)

python讀取doc文檔是最麻煩的。處理邏輯復(fù)雜。處理的方式也有很多種。

python 沒有直接處理doc文檔的第三方庫，但是有一個處理docx的第三方庫。可以通過將doc文件轉(zhuǎn)換為docx文件，再調(diào)用第三方python庫pydocx來讀取doc文檔的內(nèi)容。

這里需要注意的是,不要直接修改doc的后綴來修改成docx文件。直接通過修改后綴獲取的docx文件，pydocx無法讀取內(nèi)容。

我們可以使用另外一個庫來修改doc為docx。

具體代碼如下:

def doSaveAas(self, doc_path):
 """
 將doc文檔轉(zhuǎn)換為docx文檔
 :rtype: object
 """
 docx_path = doc_path.replace("doc", "docx")
 word = wc.Dispatch('Word.Application')
 doc = word.Documents.Open(doc_path) # 目標(biāo)路徑下的文件
 doc.SaveAs(docx_path, 12, False, "", True, "", False, False, False, False) # 轉(zhuǎn)化后路徑下的文件
 doc.Close()
 word.Quit()

代碼所需的包接口:

import os
import zipfile
from win32com import client as wc
import xlrd
from bs4 import BeautifulSoup
from pydocx import PyDocX
from lxml import html
from xpath_content import XpathContent
from write_to_excel import WriteToExcel

python處理docx文檔的方法有很多種，具體使用情況，根據(jù)個人需求來決定。

No.1 解壓docx文件

docx文件的原理，本質(zhì)上就是一個壓縮的zip文件，通過解壓以后，就可以獲取原來文件的各個內(nèi)容。

docx解壓后的文件結(jié)構(gòu)如下:

docx文件的文本內(nèi)容存儲結(jié)構(gòu)如下:

文本內(nèi)容存儲于word/document.xml文件中。

第一種方法，我們就可以先將docx還原成zip壓縮文件，再解壓zip文件，讀取word/document.xml文件的內(nèi)容就ok了。

具體操作代碼如下:

def get_content(self):
 """
 獲取docx文檔的文本內(nèi)容
 :rtype: object
 """
 os.chdir(r"C:\Users\Administrator\Desktop\新建文件夾") # 改變目錄到文件的目錄
 #
 os.rename("51 2014.09.12 1份Savannah.docx", "51 2014.09.12 1份Savannah.ZIP") # 重命名為zip文件
 f = zipfile.ZipFile('51 2014.09.12 1份Savannah.ZIP', 'r') # 進(jìn)行解壓
 xml = f.read("word/document.xml")
 wordObj = BeautifulSoup(xml.decode("utf-8"))
 # print(wordObj)
 texts = wordObj.findAll("w:t")
 content = []
 for text in texts:
 content.append(text.text)
 content_str = "".join(content)
 return content_str

最后獲取到的就是docx文檔的所有文本數(shù)據(jù)了。

No.2 將docx文檔轉(zhuǎn)換成python能夠處理的文本格式

第一種方法，是依據(jù)docx文檔的原理來獲取數(shù)據(jù),流程有點繁瑣，有沒有能直接讀取docx文檔內(nèi)容的方法呢？答案，肯定是沒有的，別想了，洗洗回家睡吧。

直接讀取docx文檔的方法沒有，有沒有能夠?qū)ocx文檔轉(zhuǎn)換成python能夠輕松處理的文本格式呢？

這個可以有，前面說了，python擁有大量豐富的第三方庫（先夸一波我大python）,歷經(jīng)千辛萬苦終于找到了，一個能轉(zhuǎn)換docx文檔格式的第三方庫,pydocx,pydocx庫中有個方法pydocx.to_html()就可以直接將docx文檔轉(zhuǎn)換為html文件，怎么樣？意不意外，驚喜不驚喜！

第二種方法，轉(zhuǎn)換文本格式的代碼如下:

def docx_to_html(self, docx_path):
 """
 docx文檔轉(zhuǎn)換成html響應(yīng)
 :rtype: object
 """
 # docx_path = "C:\\Users\Administrator\Desktop\新建文件夾\\51 2014.09.12 1份Savannah.docx"
 response = PyDocX.to_html(docx_path)

獲取到的response是html文件內(nèi)容。

四、Python處理mht文件

mht文件是一種只能在IE瀏覽器上展示的文本格式，在chrome瀏覽器中打開是一堆的亂碼。

No.1 偽造IE請求mht文件內(nèi)容

最基礎(chǔ)的讀取mht文本的方法就是偽造IE瀏覽器請求。

調(diào)用requests庫，發(fā)送get請求網(wǎng)頁鏈接，構(gòu)造IE的請求頭信息。

理論上來說，這種方法是可行的。但是呢，不建議用，原因大家都懂得。

No.2 轉(zhuǎn)換文件格式

好了說正經(jīng)的方法，猜測mht文件能否修改成其他文件格式來直接讀取呢？

docx,不行；html，不行；excel，更不用說了。

真相只有一個！！！

直接修改后綴得到的docx，無法讀取。

so,我們想到的方法是什么呢。沒錯，就是修改成doc文檔。

方法是匪夷所思的，但也是靈感一現(xiàn)。

mht可以直接通過修改后綴轉(zhuǎn)換成doc文檔，doc文檔讀取文本內(nèi)容的方法具體參考上面讀取doc文檔的方法。

如何獲取html文本的內(nèi)容？

html文本的內(nèi)容是網(wǎng)頁結(jié)構(gòu)標(biāo)簽數(shù)據(jù)，取出文本的方式是：re正則，或者xpath。

后續(xù)，小伙伴有需要的話，會再開一章詳細(xì)了解re,xapth的使用規(guī)則。

來源網(wǎng)絡(luò)，侵權(quán)聯(lián)系刪除

家好，我是IT共享者，人稱皮皮。這篇文章我們來講講CSS的文本樣式。

一、文本顏色Color

顏色屬性被用來設(shè)置文字的顏色。

顏色是通過CSS最經(jīng)常的指定：

十六進(jìn)制值 - 如"＃FF0000"。
一個RGB值 - "RGB（255,0,0）"。
顏色的名稱 - 如"紅"。

一個網(wǎng)頁的文本顏色是指在主體內(nèi)的選擇：

<html>
    <head>
        <meta charset="utf-8">
        <meta name="viewport" content="width=640, user-scalable=no">
        <title>項目</title>
        <style>
            body {
                color: blue;
            }


            h1 {
                color: #00ff00;
            }


            h2 {
                color: rgb(255, 0, 0);
            }
</style>
    </head>


    <body>
        <h2>hello world</h2>
        <h1>welcome to CaoZhou</h1>
    </body>


</html>

注：對于W3C標(biāo)準(zhǔn)的CSS：如果你定義了顏色屬性，你還必須定義背景色屬性。

二、屬性

1. text-align 文本的對齊方式

文本排列屬性是用來設(shè)置文本的水平對齊方式。

文本可居中或?qū)R到左或右,兩端對齊。

當(dāng)text-align設(shè)置為"justify"，每一行被展開為寬度相等，左，右外邊距是對齊（如雜志和報紙）。

<!doctype html>
<html lang="en">


    <head>
        <meta charset="UTF-8">
        <title>Document</title>
        <style>
            h1 {
                text-align: center;
            }


            p.date {
                text-align: right;
            }


            p.main {
                text-align: justify;
            }
</style>
    </head>


    <body>


        <p class="date">2015 年 3 月 14 號</p>
        <p class="main">　從前有個書生，和未婚妻約好在某年某月某日結(jié)婚。到那一天，未婚妻卻嫁給了別人。書生受此打擊， 一病不起。　　這時，路過一游方僧人，從懷里摸出一面鏡子叫書生看。書生看到茫茫大海，一名遇害的女子一絲不掛地躺在海灘上。路過一人， 看一眼，搖搖頭，走了。又路過一人，將衣服脫下，給女尸蓋上，走了。再路過一人，過去，挖個坑，小心翼翼把尸體掩埋了。　　僧人解釋道， 那具海灘上的女尸，就是你未婚妻的前世。你是第二個路過的人，曾給過他一件衣服。她今生和你相戀，只為還你一個情。但是她最終要報答一生一世的人，是最后那個把她掩埋的人，那人就是他現(xiàn)在的丈夫。書生大悟，病愈。


        </p>
        <p><b>注意：</b> 重置瀏覽器窗口大小查看 "justify" 是如何工作的。</p>
    </body>


</html>

2. text-decoration文本修飾

text-decoration 屬性用來設(shè)置或刪除文本的裝飾。

從設(shè)計的角度看 text-decoration屬性主要是用來刪除鏈接的下劃線：

<!doctype html>
<html lang="en">


    <head>
        <meta charset="UTF-8">
        <title>Document</title>
        <style>
            .none {}


            .del {
                text-decoration: none;
            }
</style>
    </head>


    <body>
        <p>原來的樣子</p>
        <a href="#" class="none">wwwwwwwwwwwwwwwwww</a>
        <p>去掉下劃線</p>
        <a href="#" class="del">wwwwwwwwwwwwwwwwwwwww</a>
    </body>


</html>

也可以這樣裝飾文字：

<html>
    <head>
        <meta charset="utf-8">
        <meta name="viewport" content="width=640, user-scalable=no">
        <title>項目</title>
        <style>
            h1 {
                text-decoration: overline;
            }


            h2 {
                text-decoration: line-through;
            }


            h3 {
                text-decoration: underline;
            }
</style>
    </head>


    <body>
        <h1>This is heading 1</h1>
        <h2>This is heading 2</h2>
        <h3>This is heading 3</h3>
    </body>


</html>

注：不建議強(qiáng)調(diào)指出不是鏈接的文本，因為這常常混淆用戶。

3. text-transform文本轉(zhuǎn)換

text-transform文本轉(zhuǎn)換屬性是用來指定在一個文本中的大寫和小寫字母。

uppercase:轉(zhuǎn)換為全部大寫。
lowercase：轉(zhuǎn)換為全部小寫。
capitalize ：每個單詞的首字母大寫。

<!DOCTYPE html>
<html>


    <head>
        <meta charset="utf-8">
        <meta name="viewport" content="width=640, user-scalable=no">
        <title>項目</title>
        <style>
            p.uppercase {
                text-transform: uppercase;
            }


            p.lowercase {
                text-transform: lowercase;
            }


            p.capitalize {
                text-transform: capitalize;
            }
</style>
    </head>


    <body>
        <p class="uppercase">This is some text.</p>
        <p class="lowercase">This is some text.</p>
        <p class="capitalize">This is some text.</p>
    </body>


</html>

4. text-indent文本縮進(jìn)

text-indent文本縮進(jìn)屬性是用來指定文本的第一行的縮進(jìn)。

p {text-indent:50px;}

5. letter-spacing 設(shè)置字符間距

增加或減少字符之間的空間。

<style>
     h1 {
       letter-spacing:2px;
}
      h2 {
        letter-spacing:-3px;
}
</style>

6. line-height設(shè)置行高

指定在一個段落中行之間的空間。

<html>
    <head>
        <meta charset="utf-8">
        <meta name="viewport" content="width=640, user-scalable=no">
        <title>項目</title>
        <style>
            p.small {
                line-height: 70%;
            }


            p.big {
                line-height: 200%;
            }
</style>
    </head>


    <body>
        <p>
            This is a paragraph with a standard line-height.<br> This is a paragraph with a standard line-height.<br> The default line height in most browsers is about 110% to 120%.<br>
        </p>


        <p class="small">
            This is a paragraph with a smaller line-height.<br> This is a paragraph with a smaller line-height.<br> This is a paragraph with a smaller line-height.<br> This is a paragraph with a smaller line-height.<br>
        </p>


        <p class="big">
            This is a paragraph with a bigger line-height.<br> This is a paragraph with a bigger line-height.<br> This is a paragraph with a bigger line-height.<br> This is a paragraph with a bigger line-height.<br>
        </p>


    </body>


</html>

7. word-spacing 設(shè)置字間距

增加一個段落中的單詞之間的空白空間。

<html>
    <head>
        <meta charset="utf-8">
        <meta name="viewport" content="width=640, user-scalable=no">
        <title>項目</title>
        <style type="text/css">
            p {
                word-spacing: 30px;
            }
</style>
    </head>


    <body>


        <p>
            This is some text. This is some text.
        </p>


    </body>


</html>

8. vertical-align 設(shè)置元垂直居中

設(shè)置文本的垂直對齊圖像。

<html>
    <head>
        <meta charset="utf-8">
        <meta name="viewport" content="width=640, user-scalable=no">
        <title>項目</title>
        <style>
            img{
                width: 200px;
                height: 100px;
            }
            img.top {
                vertical-align: text-top;


            }


            img.bottom {
                vertical-align: text-bottom;


            }
</style>
    </head>


    <body>
        <p>An <img src="img/logo.png"  /> image with a default alignment.</p>
        <p>An <img class="top" src="img/logo.png" /> image with a text-top alignment.</p>
        <p>An <img class="bottom" src="img/logo.png" /> image with a text-bottom alignment.</p>
    </body>


</html>

9. text-shadow 設(shè)置文本陰影

設(shè)置文本陰影。

<html>
    <head>
        <meta charset="utf-8">
        <meta name="viewport" content="width=640, user-scalable=no">
        <title>項目</title>
        <style>
         h1{
            text-shadow: 2px 2px #FF0000;
     }
</style>
    </head>


    <body>
    <h1>Text-shadow effect</h1>
    </body>


</html>

三、總結(jié)

本文主要介紹了CSS文本樣式實際應(yīng)用中應(yīng)該如何去操作，通過講解文本中對應(yīng)的屬性去改變文本的表現(xiàn)形式。使用豐富的效果圖的展示，能夠更直觀的看到運行的效果，能夠更好的理解。使用Html語言，代碼結(jié)構(gòu)更佳的清晰，能夠幫助你更好的學(xué)習(xí)。

在線咨詢

上一篇：前端面試經(jīng)典問題：CSS中居中的幾種方式
下一篇：今天全國哀悼日，手把手教你一段css讓全站變灰

您的項目需求

*請認(rèn)真填寫需求信息，我們會在24小時內(nèi)與您取得聯(lián)系。

整合營銷服務(wù)商