話說(shuō)前端53-組件基礎(chǔ)

享興趣，傳播快樂(lè)，增長(zhǎng)見聞，留下美好！

親愛的您，這里是LearningYard新學(xué)苑。

今天小編為大家?guī)?lái)話說(shuō)前端53-組件基礎(chǔ)，歡迎您的訪問(wèn)。

Share interests, spread happiness, increase knowledge, and leave a good legacy!

Dear you, this is The LearningYard Academy.

Today Xiaobian brings you the knwowledge sharing of management principles (7): Crowd Relationship Theory (Mayo's Huasang experiment), welcome to your visit.

組件是 Vue.js 最強(qiáng)大的功能之一，組件可以擴(kuò)展 HTML 元素，封裝可重用的代碼。組件系統(tǒng)讓我們可以用獨(dú)立可復(fù)用的小組件來(lái)構(gòu)建大型應(yīng)用，幾乎任意類型的應(yīng)用的界面都可以抽象為一個(gè)組件樹：

Component is one of the most powerful functions of Vue.js Components can extend HTML elements and encapsulate reusable code. Component system allows us to build large-scale applications with independent and reusable small components, and the interface of almost any type of application can be abstracted into a component tree:

傳統(tǒng)方式編寫頁(yè)面：傳統(tǒng)方式下，我們進(jìn)行前端開發(fā)時(shí)，都是一個(gè)html文檔對(duì)應(yīng)一個(gè)或多個(gè)css樣式和js文件，且多個(gè)頁(yè)面中，可能出現(xiàn)相同的部分，例如網(wǎng)頁(yè)導(dǎo)航，例如網(wǎng)頁(yè)底部信息，又難免出現(xiàn)復(fù)用同樣的html結(jié)構(gòu)，css樣式和js文件，但假如任意改動(dòng)其中一部分，那整個(gè)項(xiàng)目中復(fù)用的部分都會(huì)隨之發(fā)生改變，就會(huì)造成依賴關(guān)系混亂，且不好維護(hù)。其次，傳統(tǒng)方式編寫項(xiàng)目，每一個(gè)頁(yè)面都是一個(gè)html文檔，每出現(xiàn)一個(gè)新頁(yè)面，就要新增一個(gè)html文檔和一個(gè)或多個(gè)css樣式及js文件，難免會(huì)存在文件較多的問(wèn)題，每一個(gè)網(wǎng)頁(yè)大多都是一個(gè)獨(dú)立的部分，所以，代碼復(fù)用率不是很高。

Writing pages in the traditional way: In the traditional way, when we do front-end development, an html document corresponds to one or more css styles and js files, and in multiple pages, the same parts may appear, such as web navigation, such as information at the bottom of a web page, and it is inevitable that the same html structure, css styles and js files will be reused. However, if any part of them is changed at will, the reused parts in the whole project will change accordingly, which will lead to confusion of dependency and difficult maintenance. Secondly, in the traditional way of writing a project, every page is an html document. Every time a new page appears, an html document and one or more css styles and js files will be added, which will inevitably lead to the problem of more files. Most of each webpage is an independent part, so the code reuse rate is not very high.

組件方式編寫頁(yè)面：用組件方式來(lái)編寫頁(yè)面，其實(shí)簡(jiǎn)單理解來(lái)說(shuō)就是把一個(gè)完整的網(wǎng)頁(yè)拆分成一個(gè)又一個(gè)的組件，就比如說(shuō)，一個(gè)網(wǎng)頁(yè)包含頭部導(dǎo)航，主體內(nèi)容，底部信息。我們可以把網(wǎng)頁(yè)頭部導(dǎo)航劃分為一個(gè)組件，剩下的同樣對(duì)應(yīng)劃分為組件。拿頂部導(dǎo)航這個(gè)組件來(lái)說(shuō)，這個(gè)組件包含了實(shí)現(xiàn)頂部導(dǎo)航的html結(jié)構(gòu)，css樣式和js代碼。每一個(gè)組件只負(fù)責(zé)對(duì)應(yīng)的結(jié)構(gòu)，樣式和交互，各司其職，互不干擾，然后由這些一個(gè)又有一個(gè)的組件組成了一個(gè)完整的頁(yè)面。且網(wǎng)頁(yè)被拆分為組件后，我們就可以進(jìn)行組件化編碼，最直觀的優(yōu)點(diǎn)或亮點(diǎn)就是，組件復(fù)用，也就是多個(gè)網(wǎng)頁(yè)相同的部分，只需要寫一個(gè)組件然后按需引入就行。

Compiling pages in component mode: Compiling pages in component mode is, in fact, simply speaking, to split a complete web page into one component after another. For example, a web page contains header navigation, main content and bottom information. We can divide the page header navigation into a component, and the rest are also divided into components accordingly. Take the top navigation component as an example. This component includes html structure, css style and js code to realize top navigation. Each component is only responsible for the corresponding structure, style and interaction, each of which does its own job and does not interfere with each other, and then a complete page is composed of these components one after another. And after the web page is divided into components, we can carry out component coding. The most intuitive advantage or highlight is that component reuse, that is, the same part of multiple web pages, only needs to write a component and then introduce it as needed.

組件和模塊化：組件：組件是可復(fù)用的 Vue 實(shí)例，且?guī)в幸粋€(gè)名字。我們可以在一個(gè)通過(guò) new Vue 創(chuàng)建的 Vue 根實(shí)例中，把這個(gè)組件作為自定義元素來(lái)使用。模塊：分屬同一功能/業(yè)務(wù)的代碼進(jìn)行隔離（分裝）成獨(dú)立的模塊，可以獨(dú)立運(yùn)行，以頁(yè)面、功能或其他不同粒度劃分程度不同的模塊，位于業(yè)務(wù)框架層，模塊間通過(guò)接口調(diào)用，目的是降低模塊間的耦合，由之前的主應(yīng)用與模塊耦合，變?yōu)橹鲬?yīng)用與接口耦合，接口與模塊耦合。

Component and modularity: Component: A component is a reusable instance of Vue with a name. We can use this component as a custom element in a root instance of Vue created through new Vue. Modules: Codes belonging to the same function/business are isolated (subpackaged) into independent modules, which can run independently. Modules with different degrees are divided by pages, functions or other granularity, which are located in the business framework layer. Modules are called through interfaces to reduce the coupling between modules, from the previous main application to the module, to the main application and the interface, and the interface and the module.

今天的分享就到這里了。如果您對(duì)今天的文章有什么獨(dú)特的想法，歡迎評(píng)論留言，讓我們相約明天，祝您今天過(guò)得開心快樂(lè)！

That's it for today's sharing. If you have any unique ideas for today's article, please leave a comment, let us meet tomorrow, I wish you a happy day!

翻譯：Google翻譯

本文由LearningYard新學(xué)苑原創(chuàng)，如有侵權(quán)，請(qǐng)聯(lián)系刪除。

文字&排版|李仕陽(yáng)

審核|李煥

第十三章」非結(jié)構(gòu)化數(shù)據(jù)提取

在爬取數(shù)據(jù)的過(guò)程中，需要對(duì)頁(yè)面解析和數(shù)據(jù)提取。

一般來(lái)講對(duì)我們而言，需要抓取的是某個(gè)網(wǎng)站或者某個(gè)應(yīng)用的內(nèi)容，提取有用的價(jià)值。內(nèi)容一般分為兩部分，非結(jié)構(gòu)化的數(shù)據(jù)和結(jié)構(gòu)化的數(shù)據(jù)。

非結(jié)構(gòu)化數(shù)據(jù)：先有數(shù)據(jù)，再有結(jié)構(gòu)。

結(jié)構(gòu)化數(shù)據(jù)：先有結(jié)構(gòu)、再有數(shù)據(jù)。

不同類型的數(shù)據(jù)，我們需要采用不同的方式來(lái)處理。

13.1 正則表達(dá)式

13.1.1 為什么要學(xué)正則表達(dá)式

實(shí)際上爬蟲一共就四個(gè)主要步驟：

1. 明確目標(biāo) (要知道你準(zhǔn)備在哪個(gè)范圍或者網(wǎng)站去搜索)

2. 爬 (將所有的網(wǎng)站的內(nèi)容全部爬下來(lái))

3. 取 (去掉對(duì)我們沒用處的數(shù)據(jù))

4. 處理數(shù)據(jù)（按照我們想要的方式存儲(chǔ)和使用）

之前的案例里實(shí)際上省略了第3步，也就是"取"的步驟。因?yàn)槲覀僤own下了的數(shù)據(jù)是全部的網(wǎng)頁(yè)，這些數(shù)據(jù)很龐大并且很混亂，大部分的東西使我們不關(guān)心的，因此我們需要將之按我們的需要過(guò)濾和匹配出來(lái)。

那么對(duì)于文本的過(guò)濾或者規(guī)則的匹配，最強(qiáng)大的就是正則表達(dá)式，是Python爬蟲世界里必不可少的神兵利器。

13.1.2 什么是正則表達(dá)式

正則表達(dá)式，又稱規(guī)則表達(dá)式，通常被用來(lái)檢索、替換那些符合某個(gè)模式(規(guī)則)的文本。

正則表達(dá)式是對(duì)字符串操作的一種邏輯公式，就是用事先定義好的一些特定字符、及這些特定字符的組合，組成一個(gè)“規(guī)則字符串”，這個(gè)“規(guī)則字符串”用來(lái)表達(dá)對(duì)字符串的一種過(guò)濾邏輯。

給定一個(gè)正則表達(dá)式和另一個(gè)字符串，我們可以達(dá)到如下的目的：

1. 給定的字符串是否符合正則表達(dá)式的過(guò)濾邏輯（“匹配”）；

2. 通過(guò)正則表達(dá)式，從文本字符串中獲取我們想要的特定部分（“過(guò)濾”）。

13.1.3正則表達(dá)式匹配規(guī)則

1. 字符匹配規(guī)則。

2. 預(yù)定義字符集（可以寫在字符集[…]中）。

3. 數(shù)詞量（用在字符或者(...)之后）

4.邊界匹配。

13.1.4 Python3下正則表達(dá)式的模塊的加載

在 Python 中，我們可以使用內(nèi)置的 re 模塊來(lái)使用正則表達(dá)式。

import re

有一點(diǎn)需要特別注意的是，正則表達(dá)式使用對(duì)特殊字符進(jìn)行轉(zhuǎn)義，所以如果我們要使用原始字符串，只需加一個(gè) r 前綴。

例子：

import re

#例子一

str1='nihao\tinghai'

print(str1)

#例子二

str2=r'nihao\tinghai'

print(str2)

運(yùn)行結(jié)果：

nihao inghai

nihao\tinghai

13.1.5 compile 函數(shù)

compile 函數(shù)用于編譯正則表達(dá)式，生成一個(gè)正則表達(dá)式（ Pattern ）對(duì)象，供 match() 和 search() 這兩個(gè)函數(shù)使用。

語(yǔ)法格式為：

re.compile(pattern[, flags])

參數(shù)：

pattern : 一個(gè)字符串形式的正則表達(dá)式

flags 可選，表示匹配模式，比如忽略大小寫，多行模式等，具體參數(shù)為：

re.I 忽略大小寫

re.L 表示特殊字符集 \w, \W, \b, \B, \s, \S 依賴于當(dāng)前環(huán)境

re.M 多行模式

re.S 即為' . '并且包括換行符在內(nèi)的任意字符（' . '不包括換行符）

re.U 表示特殊字符集 \w, \W, \b, \B, \d, \D, \s, \S 依賴于 Unicode 字符屬性數(shù)據(jù)庫(kù)

re.X 為了增加可讀性，忽略空格和' # '后面的注釋

例子：

import re

pattern=re.compile(r'\d+') # 用于匹配至少一個(gè)數(shù)字

m=pattern.match('one12twothree34four') # 查找頭部，沒有匹配

print(m)

m=pattern.match('one12twothree34four', 2, 10) # 從'e'的位置開始匹配，沒有匹配

print(m)

m=pattern.match('one12twothree34four', 3, 10) # 從'1'的位置開始匹配，正好匹配

print(m)

運(yùn)行結(jié)果：

None

<_sre.SRE_Match object; span=(3, 5), match='12'>

13.1.6 正則表達(dá)式對(duì)象

re.compile() 返回 RegexObject 對(duì)象。

re.MatchObject

group() 返回被 RE 匹配的字符串。

start() 返回匹配開始的位置。

end() 返回匹配結(jié)束的位置。

span() 返回一個(gè)元組包含匹配 (開始,結(jié)束) 的位置。

13.1.7 Python3 re模塊的2種使用方式

第一種方式：使用compile 函數(shù)

1.使用 compile() 函數(shù)將正則表達(dá)式的字符串形式編譯為一個(gè) Pattern 對(duì)象

2.通過(guò) Pattern 對(duì)象提供的一系列方法對(duì)文本進(jìn)行匹配查找，獲得匹配結(jié)果，一個(gè) Match 對(duì)象。

3.最后使用 Match 對(duì)象提供的屬性和方法獲得信息，根據(jù)需要進(jìn)行其他的操作

compile 函數(shù)用于編譯正則表達(dá)式，生成一個(gè) Pattern 對(duì)象，它的一般使用形式如下：

import re

# 將正則表達(dá)式編譯成 Pattern 對(duì)象。

pattern=re.compile(r'\d+')

在上面，我們已將一個(gè)正則表達(dá)式編譯成 Pattern 對(duì)象，接下來(lái)，我們就可以利用 pattern 的一系列方法對(duì)文本進(jìn)行匹配查找了。

Pattern 對(duì)象的一些常用方法主要有：

match 方法：從起始位置開始查找，一次匹配

search 方法：從任何位置開始查找，一次匹配

findall 方法：全部匹配，返回列表

finditer 方法：全部匹配，返回迭代器

split 方法：分割字符串，返回列表

sub 方法：替換

第二種方式：直接使用re. search()/re. findall ()方式。

例子：

import re

old_url='http://www.jikexueyuan.com/course/android/?pageNum=2'

total_page=20

html="""

<head>

</head>

<body>

<div class='topic'> <a href="http://jikexueyuan.com/welcone.html">歡迎參加《聽海的Python3接口自動(dòng)化測(cè)試》

<ul>

<li><a href="http://jikexueyuan.com/1.html">這是第一條</a></li>

<li><a href="http://jikexueyuan.com/2.html">這是第二條</a></li>

<li><a href="http://jikexueyuan.com/3.html">這是第三條</a></li>

</ul>

</div>

</body>

</html>

"""

# f.close()

# #任務(wù)一：爬取網(wǎng)頁(yè)標(biāo)題

# title=re.search('<title>(.*?)</title>',html,re.S).group(1)

# print(title)

# #任務(wù)二：爬取鏈接

# links=re.findall('href="(.*?)">',html)

# print(links)

# #任務(wù)三：爬取部分文字內(nèi)容

# u_text=re.findall('<ul>(.*?)</ul>',html,re.S)[0]

# texts=re.findall('">(.*?)</a>',u_text,re.S)

# for every_text in texts:

# print(texts)

#任務(wù)四：sub實(shí)現(xiàn)翻頁(yè)

for i in range(2,total_page+1):

new_link=re.sub('pageNum=\d','pageNum=%d'%i,old_url,re.S)

print(new_link)

13.1.8 re模塊之match 方法

match 方法用于查找字符串的頭部（也可以指定起始位置），它是一次匹配，只要找到了一個(gè)匹配的結(jié)果就返回，而不是查找所有匹配的結(jié)果。它的一般使用形式如下：

match(string,begin,end)

其中，string 是待匹配的字符串，begin 和end 是可選參數(shù)，指定字符串的起始和終點(diǎn)位置，當(dāng)你指定begin 和end 時(shí)，match 方法會(huì)根據(jù)指定的范圍去查詢，如果不指定begin 和end 時(shí)，match 方法默認(rèn)匹配字符串的頭部。

當(dāng)匹配成功時(shí)，返回一個(gè) Match 對(duì)象，如果沒有匹配上，則返回 None。

綜合例子：

import re

#例子一

str1='ting123hai456'

pattern=re.compile(r'\d+') # 用于匹配至少一個(gè)數(shù)字

m1=pattern.match(str1) # 查找頭部，沒有匹配

print(m1)

#例子二

str2='ting123hai456'

pattern=re.compile(r'\d+') # 用于匹配至少一個(gè)數(shù)字

m2=pattern.match(str2,3,8) # 從'g'的位置開始匹配，沒有匹配

print(m2)

#例子三

str3='ting123hai456'

pattern=re.compile(r'\d+') # 用于匹配至少一個(gè)數(shù)字

m3=pattern.match(str3,4,8) # 從'1'的位置開始匹配，正好匹配

print(m3) # 返回一個(gè) Match 對(duì)象

print(m3.group(0))

print(m3.start(0))

print(m3.end(0))

print(m3.span(0))

運(yùn)行結(jié)果：

None

<_sre.SRE_Match object; span=(4, 7), match='123'>

123

(4, 7)

在上面，當(dāng)匹配成功時(shí)返回一個(gè) Match 對(duì)象，其中：

group([group1, …]) 方法：用于獲得一個(gè)或多個(gè)分組匹配的字符串，當(dāng)要獲得整個(gè)匹配的子串時(shí)，可直接使用 group() 或 group(0)；

start([group]) 方法：用于獲取分組匹配的子串在整個(gè)字符串中的起始位置（子串第一個(gè)字符的索引），參數(shù)默認(rèn)值為 0；

end([group]) 方法：用于獲取分組匹配的子串在整個(gè)字符串中的結(jié)束位置（子串最后一個(gè)字符的索引+1），參數(shù)默認(rèn)值為 0；

span([group]) 方法：返回 (start(group), end(group))。

re.I 與re.S

1. re.I 表示忽略大小寫。

2. re.S 表示全文匹配。

例子一：re.I 表示忽略大小寫。

import re

pattern=re.compile(r'([a-z]+) ([a-z]+)', re.I) # re.I 表示忽略大小寫

m=pattern.match('Welcome To Reptiles')

print(m) # 匹配成功，返回一個(gè) Match 對(duì)象

print(m.group(0)) # 返回匹配成功的整個(gè)子串

print(m.span(0)) # 返回匹配成功的整個(gè)子串的索引

print(m.group(1)) # 返回第一個(gè)分組匹配成功的子串

print(m.span(1)) # 返回第一個(gè)分組匹配成功的子串的索引

print(m.group(2)) # 返回第二個(gè)分組匹配成功的子串

print(m.span(2)) # 返回第二個(gè)分組匹配成功的子串

print(m.groups()) # 等價(jià)于 (m.group(1), m.group(2), ...)

print(m.group(3)) # compile(r'([a-z]+) ([a-z]+)'）只是匹配了2組規(guī)則，不存在第三個(gè)分組

運(yùn)行結(jié)果：

<_sre.SRE_Match object; span=(0, 10), match='Welcome To'>

Welcome To

(0, 10)

Welcome

(0, 7)

(8, 10)

('Welcome', 'To')

IndexError: no such group

re.S表示全文匹配，講findall()方法的時(shí)候，再用具體的例子展示。

13.1.9 re模塊之search 方法

search 方法用于查找字符串的任何位置，它也是一次匹配，只要找到了一個(gè)匹配的結(jié)果就返回，而不是查找所有匹配的結(jié)果，它的一般使用形式如下：

search(string,begin,end)

其中，string 是待匹配的字符串，begin 和end 是可選參數(shù)，指定字符串的起始和終點(diǎn)位置，當(dāng)你指定begin 和end 時(shí)，search 方法會(huì)根據(jù)指定的范圍去查詢，如果不指定begin 和end 時(shí)，match 方法默認(rèn)任何位置，只要找到了一個(gè)匹配的結(jié)果就返回。

當(dāng)匹配成功時(shí)，返回一個(gè) Match 對(duì)象，如果沒有匹配上，則返回 None。

綜合例子1：

import re

#例子一

str1='ting123hai456'

pattern=re.compile('\d+')

m1=pattern.search(str1) # 查找字符串任意位置，這里如果使用 match 方法則不匹配

print(m1)

print(m1.group())

print(m1.span())

#例子二

str2='ting123hai456'

pattern=re.compile('\d+')

m2=pattern.search(str2,4,8) # 指定字符串區(qū)間

print(m2)

print(m2.group())

print(m2.span())

運(yùn)行結(jié)果:

<_sre.SRE_Match object; span=(4, 7), match='123'>

123

(4, 7)

<_sre.SRE_Match object; span=(4, 7), match='123'>

123

(4, 7)

綜合例子2：

import re

#例子一

str1='ting123hai456'

pattern=re.compile('\d+')

m1=pattern.search(str1) # 查找字符串任意位置，這里如果使用 match 方法則不匹配

print(m1)

print(m1.group())

print(m1.span())

#例子二

str2='ting123hai456'

pattern=re.compile('\d+')

m2=pattern.search(str2,7,13) # 指定字符串區(qū)間

print(m2)

print(m2.group())

print(m2.span())

運(yùn)行結(jié)果：

<_sre.SRE_Match object; span=(4, 7), match='123'>

123

(4, 7)

<_sre.SRE_Match object; span=(10, 13), match='456'>

456

(10, 13)

13.1.10 re模塊之findall 方法

上面的 match 和 search 方法都是一次匹配，只要找到了一個(gè)匹配的結(jié)果就返回。然而，在大多數(shù)時(shí)候，我們需要搜索整個(gè)字符串，獲得所有匹配的結(jié)果。

findall 方法的使用形式如下：

findall(string,begin,end)

其中，string 是待匹配的字符串，begin 和end 是可選參數(shù)，指定字符串的起始和終點(diǎn)位置，當(dāng)你指定begin 和end 時(shí)，findall 方法會(huì)根據(jù)指定的范圍去查詢，以列表形式返回全部能匹配的子串，如果不指定begin 和end 時(shí)，match 方法會(huì)全文搜索，以列表形式返回全部能匹配的子串。

findall 以列表形式返回全部能匹配的子串，如果沒有匹配，則返回一個(gè)空列表。

綜合例子：

import re

#例子一

str1='hello123hell world456hel'

pattern=re.compile('hel') # 查找數(shù)字

m1=pattern.findall(str1)

print(m1)

#例子二

str2='hello123hell world456hel'

pattern=re.compile('hel') # 查找 hel

m2=pattern.findall(str2, 7, 14)

print(m2)

#例子三

str3='hello123hell world456hel'

pattern=re.compile('hel') # 查找 hel

m3=pattern.findall(str3, 7, 25)

print(m3)

運(yùn)行結(jié)果：

['hel', 'hel', 'hel']

['hel']

['hel', 'hel']

13.1.11 re模塊之finditer 方法

finditer 方法的行為跟 findall 的行為類似，也是搜索整個(gè)字符串，獲得所有匹配的結(jié)果。但它返回一個(gè)順序訪問(wèn)每一個(gè)匹配結(jié)果（Match 對(duì)象）的迭代器。

例子：

import re

pattern=re.compile(r'\d+')

m1=pattern.finditer('hello 123456 789')

m2=pattern.finditer('one1two2three3four4', 0, 10)

print(type(m1))

print(type(m2))

print('----- m1 ------')

for a1 in m1: # a1 是 Match 對(duì)象

print('matching string: {}, position: {}'.format(a1.group(), a1.span()))

print('----- m2 ------')

for a2 in m2:

print('matching string: {}, position: {}'.format(a2.group(), a2.span()))

運(yùn)行結(jié)果：

----- m1 ------

matching string: 123456, position: (6, 12)

matching string: 789, position: (13, 16)

----- m2 ------

matching string: 1, position: (3, 4)

matching string: 2, position: (7, 8)

13.1.12 split 方法

split 方法按照能夠匹配的子串將字符串分割后返回列表，它的使用形式如下：

split(string[, maxsplit])

其中，maxsplit 用于指定最大分割次數(shù)，不指定將全部分割。

例子：

import re

p=re.compile(r'[\s\,\;]+')

print(p.split('a,b;; c d'))

運(yùn)行結(jié)果：

['a', 'b', 'c', 'd']

13.1.13 sub 方法

sub 方法用于替換。它的使用形式如下：

sub(repl, string[, count])

其中，repl 可以是字符串也可以是一個(gè)函數(shù)：

如果 repl 是字符串，則會(huì)使用 repl 去替換字符串每一個(gè)匹配的子串，并返回替換后的字符串，另外，repl 還可以使用 id 的形式來(lái)引用分組，但不能使用編號(hào) 0；

如果 repl 是函數(shù)，這個(gè)方法應(yīng)當(dāng)只接受一個(gè)參數(shù)（Match 對(duì)象），并返回一個(gè)字符串用于替換（返回的字符串中不能再引用分組）。

count 用于指定最多替換次數(shù)，不指定時(shí)全部替換。

例子一：

import re

p=re.compile('123(.*?)123')

s='123asdfxxIxxxxLovexxded123'

f=p.sub('123456789',s)

print(f)

運(yùn)行結(jié)果：

123456789

例子二：

import re

p=re.compile(r'(\w+) (\w+)') # \w=[A-Za-z0-9]

s='hello 123, hello 456'

print(p.sub(r'hello world', s)) # 使用 'hello world' 替換 'hello 123' 和 'hello 456'

print(p.sub(r' ', s)) # 引用分組

def func(m):

return 'hi' + ' ' + m.group(2)

print(p.sub(func, s))

print(p.sub(func, s, 1)) # 最多替換一次

運(yùn)行結(jié)果：

hello world, hello world

123 hello, 456 hello

hi 123, hi 456

hi 123, hello 456

13.1.14 貪婪模式與非貪婪模式

在使用正則匹配的時(shí)候，有2種模式：

【貪婪模式】：在整個(gè)表達(dá)式匹配成功的前提下，盡可能多的匹配 ( * )；

【非貪婪模式】：在整個(gè)表達(dá)式匹配成功的前提下，盡可能少的匹配 ( ? )；

Python里數(shù)量詞默認(rèn)是貪婪的。

綜合例子一：

import re

#例子一貪婪模式

s='abbbc'

p=re.compile('ab*')

f1=p.findall(s)

print(f1)

#例子二非貪婪模式

s='abbbc'

p=re.compile('ab*?')

f2=p.findall(s)

print(f2)

運(yùn)行結(jié)果：

['abbb']

['a']

運(yùn)行結(jié)果說(shuō)明：

使用貪婪的數(shù)量詞的正則表達(dá)式 ab* ，匹配結(jié)果： abbb。

* 決定了盡可能多匹配 b，所以a后面所有的 b 都出現(xiàn)了。

使用非貪婪的數(shù)量詞的正則表達(dá)式ab*?，匹配結(jié)果： a。

即使前面有 *，但是 ? 決定了盡可能少匹配 b，所以沒有 b。

綜合例子二：

import re

html="aa<div>test1</div>bb<div>test2</div>cc"

#例子一貪婪模式

p=re.compile('<div>.*</div>')

f1=p.findall(html)

print(f1)

#例子二非貪婪模式

p=re.compile('<div>.*?</div>')

f2=p.findall(html)

print(f2)

運(yùn)行結(jié)果：

['<div>test1</div>bb<div>test2</div>']

['<div>test1</div>', '<div>test2</div>']

運(yùn)行結(jié)果說(shuō)明：

使用貪婪的數(shù)量詞的正則表達(dá)式：<div>.*</div>

匹配結(jié)果：<div>test1</div>bb<div>test2</div>

這里采用的是貪婪模式。在匹配到第一個(gè)“</div>”時(shí)已經(jīng)可以使整個(gè)表達(dá)式匹配成功，但是由于采用的是貪婪模式，所以仍然要向右嘗試匹配，查看是否還有更長(zhǎng)的可以成功匹配的子串。匹配到第二個(gè)“</div>”后，向右再?zèng)]有可以成功匹配的子串，匹配結(jié)束，匹配結(jié)果為“<div>test1</div>bb<div>test2</div>”

使用非貪婪的數(shù)量詞的正則表達(dá)式：<div>.*?</div>

匹配結(jié)果：<div>test1</div>

正則表達(dá)式二采用的是非貪婪模式，在匹配到第一個(gè)“</div>”時(shí)使整個(gè)表達(dá)式匹配成功，由于采用的是非貪婪模式，所以結(jié)束匹配，不再向右嘗試，匹配結(jié)果“<div>test1</div>”。

13.1.15 使用正則表達(dá)式的爬蟲的案例

學(xué)會(huì)了正則表達(dá)式提取數(shù)據(jù)的相關(guān)方法之后，我們就可以進(jìn)行對(duì)爬取到的全部網(wǎng)頁(yè)源代碼進(jìn)行篩選了，下面講案例。

案例一：爬取極客學(xué)院課程

代碼：

import re,requests

class spider(object):

def __init__(self):

print("開始爬取內(nèi)容")

def getsource(self,source):

html=requests.get(source)

return html.text

def changepage(self,url,total_page):

now_page=int(re.search('pageNum=(\d+)',url,re.S).group(1))

page_group=[]

for i in range(now_page,total_page+1):

link=re.sub('pageNum=(\d+)','pageNum=%s'%i,url,re.S)

page_group.append(link)

return page_group

def geteveryclass(self,html):

everyclass=re.findall('<li id="(.*?)</li>',html,re.S)

return everyclass

def getinfo(self,eachclass):

info={ } #定義一個(gè)空的字典

info['title']=re.search('title="(.*?)" alt="',eachclass,re.S).group(1)

info['content']=re.findall('display: none;">[\s]*([\s\S]*?)[\s]*</p>', eachclass)[0]

classlevel=re.findall('<em>(.*?)</em>',eachclass, re.S)

info['classtime']=classlevel[0]

info['classlevel']=classlevel[1]

info['learnnum']=re.search('"learn-number">(.*?)</em>', eachclass, re.S).group(1)

return info

def saveinfo(self,classinfo):

f=open(u'info.txt','a')

for each in classinfo:

f.writelines('title:'+each['title']+'\n')

f.writelines('content:' + each['content'] + '\n')

f.writelines('classtime:' + each['classtime'] + '\n')

f.writelines('classlevel:' + each['classlevel'] + '\n')

f.writelines('learnnum:' + each['learnnum'] + '\n')

f.close()

if __name__=='__main__':

classinfo=[] #定義一個(gè)空的列表

url='http://www.jikexueyuan.com/course/?pageNum=1' # 初始的url

jikespider=spider() #實(shí)例化一個(gè)類 jikespider

all_links=jikespider.changepage(url,20) #調(diào)用jikespider里的changepage(url,20)方法，獲取1~20頁(yè)的url

for link in all_links:

print("正在處理頁(yè)面:"+ link)

html=jikespider.getsource(link) # 調(diào)用jikespider里的getsource()方法，獲取每個(gè)html的text

everyclass=jikespider.geteveryclass(html) #調(diào)用jikespider里的geteveryclass()方法爬取everyclassh的html的text存到everyclass列表里

#print(everyclass)

for each in everyclass:

#print(each)

info=jikespider.getinfo(each) #調(diào)用jikespider里的getinfo()方法，獲取每個(gè)視頻的title、content、classtime、classlevel、learnnum

classinfo.append(info)

print(classinfo)

jikespider.saveinfo(classinfo)

13.2 XPath介紹

正則雖然很強(qiáng)大，但是正則語(yǔ)法相對(duì)比較復(fù)雜，比較難掌握，還有另外一種方法：XPath，我們可以先將 HTML文件轉(zhuǎn)換成XML文檔，然后用 XPath 查找 HTML 節(jié)點(diǎn)或元素。

13.2.1 什么是XML

XML 指可擴(kuò)展標(biāo)記語(yǔ)言（EXtensible Markup Language）

XML 是一種標(biāo)記語(yǔ)言，很類似 HTML

XML 的設(shè)計(jì)宗旨是傳輸數(shù)據(jù)，而非顯示數(shù)據(jù)

XML 的標(biāo)簽需要我們自行定義。

XML 被設(shè)計(jì)為具有自我描述性。

XML 是 W3C 的推薦標(biāo)準(zhǔn)

13.2.2 XML 和 HTML 的區(qū)別

XML文檔示例：

<?xml version="1.0" encoding="utf-8"?>

<title lang="en">Everyday Italian</title>

<author>Giada De Laurentiis</author>

</book>

<title lang="en">Harry Potter</title>

<author>J K. Rowling</author>

</book>

<title lang="en">XQuery Kick Start</title>

<author>James McGovern</author>

<author>Per Bothner</author>

<author>Kurt Cagle</author>

<author>James Linn</author>

<author>Vaidyanathan Nagarajan</author>

</book>

<title lang="en">Learning XML</title>

</book>

</bookstore>

HTML DOM 模型示例：

HTML DOM 定義了訪問(wèn)和操作 HTML 文檔的標(biāo)準(zhǔn)方法，以樹結(jié)構(gòu)方式表達(dá) HTML 文檔。

13.2.3 XML的節(jié)點(diǎn)關(guān)系

1. 父節(jié)點(diǎn)（Parent）

每個(gè)元素以及屬性都有一個(gè)父。

下面是一個(gè)簡(jiǎn)單的XML例子中，book 元素是 title、author、year 以及 price 元素的父：

<?xml version="1.0" encoding="utf-8"?>

<book>

<title>Harry Potter</title>

<author>J K. Rowling</author>

</book>

2. 子節(jié)點(diǎn)（Children）

元素節(jié)點(diǎn)可有零個(gè)、一個(gè)或多個(gè)子。

在下面的例子中，title、author、year 以及 price 元素都是 book 元素的子：

<?xml version="1.0" encoding="utf-8"?>

<book>

<title>Harry Potter</title>

<author>J K. Rowling</author>

</book>

3. 同胞（Sibling）

擁有相同的父的節(jié)點(diǎn)。

在下面的例子中，title、author、year 以及 price 元素都是同胞：

<?xml version="1.0" encoding="utf-8"?>

<book>

<title>Harry Potter</title>

<author>J K. Rowling</author>

</book>

4. 先輩（Ancestor）

某節(jié)點(diǎn)的父、父的父，等等。

在下面的例子中，title 元素的先輩是 book 元素和 bookstore 元素：

<?xml version="1.0" encoding="utf-8"?>

<book>

<title>Harry Potter</title>

<author>J K. Rowling</author>

</book>

</bookstore>

5. 后代（Descendant）

某個(gè)節(jié)點(diǎn)的子，子的子，等等。

在下面的例子中，bookstore 的后代是 book、title、author、year 以及 price 元素：

<?xml version="1.0" encoding="utf-8"?>

<book>

<title>Harry Potter</title>

<author>J K. Rowling</author>

</book>

</bookstore>

13.2.4 什么是XPath

XPath (XML Path Language) 是一門在 XML 文檔中查找信息的語(yǔ)言，可用來(lái)在 XML 文檔中對(duì)元素和屬性進(jìn)行遍歷。

W3School官方文檔：http://www.w3school.com.cn/xpath/index.asp

13.2.5 XPath 開發(fā)工具

1.開源的XPath表達(dá)式編輯工具:XMLQuire(XML格式文件可用)

2.Chrome插件 XPath Helper

3.Firefox插件 XPath Checker

13.2.6 選取節(jié)點(diǎn)

XPath 使用路徑表達(dá)式在 XML 文檔中選取節(jié)點(diǎn)。節(jié)點(diǎn)是通過(guò)沿著路徑或者 step 來(lái)選取的。

下面列出了最有用的路徑表達(dá)式：

謂語(yǔ)（Predicates）

謂語(yǔ)用來(lái)查找某個(gè)特定的節(jié)點(diǎn)或者包含某個(gè)指定的值的節(jié)點(diǎn)。

謂語(yǔ)被嵌在方括號(hào)中。

實(shí)例

在下面的表格中，我們已列出了一些路徑表達(dá)式以及表達(dá)式的結(jié)果：

實(shí)例

在下面的表格中，我們列出了帶有謂語(yǔ)的一些路徑表達(dá)式，以及表達(dá)式的結(jié)果：

路徑表達(dá)式

結(jié)果

/bookstore/book[1]

選取屬于 bookstore 子元素的第一個(gè) book 元素。

/bookstore/book[last()]

選取屬于 bookstore 子元素的最后一個(gè) book 元素。

/bookstore/book[last()-1]

選取屬于 bookstore 子元素的倒數(shù)第二個(gè) book 元素。

/bookstore/book[position()<3]

選取最前面的兩個(gè)屬于 bookstore 元素的子元素的 book 元素。

//title[@lang]

選取所有擁有名為 lang 的屬性的 title 元素。

//title[@lang='eng']

選取所有 title 元素，且這些元素?fù)碛兄禐?eng 的 lang 屬性。

/bookstore/book[price>35.00]

選取 bookstore 元素的所有 book 元素，且其中的 price 元素的值須大于 35.00。

/bookstore/book[price>35.00]/title

選取 bookstore 元素中的 book 元素的所有 title 元素，且其中的 price 元素的值須大于 35.00。

【選取未知節(jié)點(diǎn)】

XPath 通配符可用來(lái)選取未知的 XML 元素。

實(shí)例

在下面的表格中，我們列出了一些路徑表達(dá)式，以及這些表達(dá)式的結(jié)果：

【選取若干路徑】

通過(guò)在路徑表達(dá)式中使用“|”運(yùn)算符，您可以選取若干個(gè)路徑。

實(shí)例

在下面的表格中，我們列出了一些路徑表達(dá)式，以及這些表達(dá)式的結(jié)果：

13.2.7 XPath 運(yùn)算符

下面列出了可用在 XPath 表達(dá)式中的運(yùn)算符：

言

JavaScript作為Web前端開發(fā)的基石，其強(qiáng)大的功能和靈活性不僅體現(xiàn)在網(wǎng)頁(yè)的動(dòng)態(tài)交互上，更在于其處理數(shù)據(jù)的能力。數(shù)組遍歷是JavaScript中最常見的操作之一，尤其在算法題的求解過(guò)程中，它扮演著至關(guān)重要的角色。本文將深入探討JavaScript中數(shù)組遍歷的多種方法，通過(guò)具體的算法題示例，幫助讀者掌握高效解決問(wèn)題的技巧。

技術(shù)概述

數(shù)組遍歷方法

在JavaScript中，數(shù)組遍歷可以通過(guò)多種方式進(jìn)行，每種方法都有其特點(diǎn)和適用場(chǎng)景：

for循環(huán)：最傳統(tǒng)的遍歷方式，適用于所有情況。
forEach()：ES5引入的數(shù)組方法，簡(jiǎn)化了遍歷語(yǔ)法。
map()：用于創(chuàng)建新數(shù)組，對(duì)原數(shù)組的每個(gè)元素進(jìn)行映射操作。
filter()：用于篩選數(shù)組，返回滿足條件的元素組成的新數(shù)組。
reduce()：用于對(duì)數(shù)組元素進(jìn)行累積操作，常用于求和、合并等場(chǎng)景。
some() 和 every()：用于檢查數(shù)組中是否存在滿足條件的元素或所有元素是否都滿足條件。

代碼示例

const numbers=[1, 2, 3, 4, 5];

// 使用for循環(huán)遍歷
for (let i=0; i < numbers.length; i++) {
    console.log(numbers[i]);
}

// 使用forEach遍歷
numbers.forEach(number=> console.log(number));

// 使用map創(chuàng)建新數(shù)組
const doubled=numbers.map(number=> number * 2);
console.log(doubled); // 輸出: [2, 4, 6, 8, 10]

技術(shù)細(xì)節(jié)

工作原理

數(shù)組遍歷方法本質(zhì)上是通過(guò)迭代數(shù)組中的每一個(gè)元素來(lái)執(zhí)行特定的邏輯操作。不同的方法提供不同的操作能力，如map用于變換，filter用于篩選，而reduce用于聚合。

難點(diǎn)分析

性能考量：盡管現(xiàn)代JavaScript引擎進(jìn)行了大量的優(yōu)化，但在處理大規(guī)模數(shù)據(jù)時(shí)，遍歷方法的選擇仍然會(huì)影響性能。
副作用管理：在遍歷時(shí)避免對(duì)原始數(shù)組造成不必要的修改，尤其是使用map和filter時(shí)。

實(shí)戰(zhàn)應(yīng)用

應(yīng)用場(chǎng)景

假設(shè)我們有一道算法題，要求找出數(shù)組中所有偶數(shù)，并返回它們的平方和。

代碼示例

function sumOfSquaresEvenNumbers(numbers) {
    return numbers
        .filter(number=> number % 2===0) // 篩選偶數(shù)
        .map(number=> number * number)     // 平方
        .reduce((acc, curr)=> acc + curr, 0); // 求和
}

const result=sumOfSquaresEvenNumbers([1, 2, 3, 4, 5, 6]);
console.log(result); // 輸出: 56

優(yōu)化與改進(jìn)

潛在問(wèn)題

性能瓶頸：對(duì)于大數(shù)據(jù)集，多次迭代可能會(huì)導(dǎo)致性能下降。
代碼冗余：過(guò)度使用高階函數(shù)可能導(dǎo)致代碼不易理解。

代碼示例

function optimizedSumOfSquaresEvenNumbers(numbers) {
    let sum=0;
    for (let number of numbers) {
        if (number % 2===0) {
            sum +=number * number;
        }
    }
    return sum;
}

const optimizedResult=optimizedSumOfSquaresEvenNumbers([1, 2, 3, 4, 5, 6]);
console.log(optimizedResult); // 輸出: 56

常見問(wèn)題

Q: 如何在遍歷數(shù)組時(shí)避免修改原數(shù)組？
A: 使用map或filter等方法，它們會(huì)返回新數(shù)組，而不會(huì)修改原數(shù)組。

總結(jié)與展望

數(shù)組遍歷不僅是JavaScript編程的基礎(chǔ)，也是解決復(fù)雜算法問(wèn)題的利器。通過(guò)本文的探討，我們不僅學(xué)習(xí)了多種數(shù)組遍歷的方法，還掌握了如何在實(shí)際問(wèn)題中選擇合適的遍歷策略，以提高代碼的效率和可讀性。未來(lái)，隨著JavaScript語(yǔ)言的不斷發(fā)展，新的數(shù)組方法和迭代器模式將進(jìn)一步豐富我們的編程工具箱，為開發(fā)者提供更加高效和靈活的解決方案。掌握數(shù)組遍歷的技巧，意味著在算法題的求解中擁有了更多的選擇和自信，這也是前端開發(fā)者邁向更高層次的關(guān)鍵一步。

在線咨詢

上一篇：JavaScript 剩余參數(shù)的用法
下一篇：編輯器漏洞詳解

您的項(xiàng)目需求

*請(qǐng)認(rèn)真填寫需求信息，我們會(huì)在24小時(shí)內(nèi)與您取得聯(lián)系。

整合營(yíng)銷服務(wù)商

話說(shuō)前端53-組件基礎(chǔ)

第十三章」 非結(jié)構(gòu)化數(shù)據(jù)提取

言

技術(shù)概述

數(shù)組遍歷方法

代碼示例

技術(shù)細(xì)節(jié)

工作原理

難點(diǎn)分析

實(shí)戰(zhàn)應(yīng)用

應(yīng)用場(chǎng)景

代碼示例

優(yōu)化與改進(jìn)

潛在問(wèn)題

代碼示例

常見問(wèn)題

總結(jié)與展望

您的項(xiàng)目需求

第十三章」非結(jié)構(gòu)化數(shù)據(jù)提取