C# - 使用正則提取,提取組匹配的字符串 120

頭條創(chuàng)作挑戰(zhàn)賽#

字符串提取

1 通過Regex類的Match("要提取的字符串","匹配規(guī)則")方式提取第一個匹配的子字符串,返回類型一個是Match類的對象,通過其Value屬性得到匹配的結(jié)果

注意:做字符串提取時,一般都不加^與$;字符串提取操作只是在原字符串中找到匹配規(guī)則的一部分子字符串,匹配規(guī)則加^與$就是完全匹配(注意:貪婪模式),除非整個原字符串完全匹配規(guī)則,否則都不加^與$進行限定

提取第一個匹配的字符串

2 通過Reges類的Matches("要提取的字符串","匹配規(guī)則")方法得到所有匹配規(guī)則的子字符串,返回類型是一個MatchCollection類型的集合

 string str = "中國中央氣象臺2023年7月31日發(fā)布稱\"卡努\"加強為8級的強臺風(fēng)級";
 //提取所有與年月日有關(guān)的字符串
 string reg = @"\d+(年|月|日)";
 MatchCollection mts = Regex.Matches(str, reg);
 foreach (Match mt in mts)
 {
     Console.WriteLine(mt.Value);
 }
Console.ReadKey();

提取所有匹配的字符串

字符串提取組

通過為匹配規(guī)則加()實現(xiàn)分組提取(可用于統(tǒng)計分類信息)

通過match.Groups[]來獲取提取組;注意:索引為0的第1組存儲的是整個匹配字符串,獲取的提取組應(yīng)該從索引1開始

在Regex.Match();與Regex.Matches();方法中都可用于提取組

//讀取1.html文件到內(nèi)存中,包含郵箱地址
string str = File.ReadAllText(@"d:\1.html");
//提取                         登陸名                   哪個郵箱          .com/.cn ...
string email = @"([a-zA-Z0-9_\-\.]+)@([a-zA-Z0-9]+)(\.[a-zA-Z0-9]+)+";
MatchCollection mts = Regex.Matches(str, email);
            #region 提取組
            //foreach (Match mt in mts)
            //{
            //    //判斷當(dāng)前 mt的值 是否匹配成功
            //    if (mt.Success)
            //    {
            //        //mt.Groups 獲取所有分組結(jié)果
            //        //分組依據(jù)看匹配規(guī)則中有幾個()
            //        //從mt.Groups[1]算起,用于提取組
            //        //獲取整個匹配結(jié)果
            //        Console.WriteLine(mt.Groups[0]);//zs316@163.com
            //        //獲取登陸名
            //        Console.WriteLine(mt.Groups[1]);//zs316
            //        //獲取是哪個郵箱
            //        Console.WriteLine(mt.Groups[2]);//163
            //        //郵箱后綴
            //        Console.WriteLine(mt.Groups[3]);//.com
            //    } 
            //} 
            #endregion
//============================================
//聲明使用哪個統(tǒng)計總數(shù)的變量
int count_163 = 0, count_qq = 0, count_sina = 0, others = 0;
//循環(huán)匹配規(guī)則獲取的結(jié)果集合
foreach (Match mt in mts)
{
    //判斷當(dāng)前分組屬于哪個郵箱
    switch (mt.Groups[2].Value)
    {
        case "163":
            count_163++;
            break;
        case "qq":
            count_qq++;
            break;
        case "sina":
            count_sina++;
            break;
        default:
            others++;
            break;
    }
}
Console.WriteLine("使用163郵箱的共有: {0} 個", count_163);
Console.WriteLine("使用qq郵箱的共有: {0} 個", count_qq);
Console.WriteLine("使用sina郵箱的共有: {0} 個", count_sina);
Console.WriteLine("使用其他郵箱的共有: {0} 個", others);
Console.ReadKey();

獲取分組信息

用三種方式對網(wǎng)頁信息進行提取

XPath語法和lxml模塊
BeautifulSoup4
正則表達式

豆瓣電影

#coding=utf-8
import requests
from lxml import etree

url = 'https://movie.douban.com/cinema/nowplaying/nanjing/'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/83.0.4090.0 Safari/537.36 Edg/83.0.467.0 ',
    'Referer': 'https://movie.douban.com/'
}
response = requests.get(url, headers=headers)
html = response.text

htmlElement = etree.HTML(html)
ul = htmlElement.xpath("//ul[@class='lists']")[0]

movies = []
lis = ul.xpath('./li')
for li in lis:
    title = li.xpath('@data-title')[0]
    director = li.xpath('@data-director')[0]
    actors = li.xpath('@data-actors')[0]
    thumbnail = li.xpath('.//img/@src')[0]

    movie = {
        'title': title,
        'director': director,
        'actors': actors,
        'thumbnail': thumbnail
    }
    movies.append(movie)

print(movies)

output:[{'title': '奇妙王國之魔法奇緣', 'director': '陳設(shè)', 'actors': '盧瑤 / 張洋 / 陳新玥', 'thumbnail': 'https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2577837112.jpg'}, {'title': '大紅包', 'director': '李克龍', 'actors': '包貝爾 / 李成敏 / 賈冰', 'thumbnail': 'https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2581346773.jpg'}, {'title': '金禪降魔', 'director': '彭發(fā) 王凱程中豪', 'actors': '釋小龍 / 胡軍 / 姚星彤', 'thumbnail': 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2564190636.jpg'}, {'title': '82號古宅', 'director': '袁杰', 'actors': '葛天 / 扈天翼 / 黃心娣', 'thumbnail': 'https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2586838530.jpg'}, {'title': '親親噠', 'director': '馬雍', 'actors': '馬良博一 / 盧小路 / 尹恒', 'thumbnail': 'https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2579189777.jpg'}, {'title': '六月的秘密', 'director': '王暘', 'actors': '郭富城 / 苗苗 / 吳建飛', 'thumbnail': 'https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2522497098.jpg'}, {'title': '秘密訪客', 'director': '陳正道', 'actors': '郭富城 / 段奕宏 / 張子楓', 'thumbnail': 'https://img1.doubanio.com/view/photo/s_ratio_poster/public/p2579398648.jpg'}, {'title': '無名狂', 'director': '李云波', 'actors': '張曉晨 / 隋詠良 / 上白', 'thumbnail': 'https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2574800433.jpg'}, {'title': '刺殺小說家', 'director': '路陽', 'actors': '雷佳音 / 楊冪 / 董子健', 'thumbnail': 'https://img9.doubanio.com/view/photo/s_ratio_poster/public/p2580314674.jpg'}]

拉鉤網(wǎng)職位信息

#coding=utf-8
import requests
from lxml import etree

url = 'https://www.lagou.com/zhaopin/PHP/?labelWords=label'
headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                  'Chrome/83.0.4090.0 Safari/537.36 Edg/83.0.467.0 ',
    'Referer': 'https://www.lagou.com/'
}
response = requests.get(url, headers=headers)
html = response.text

htmlElement = etree.HTML(html)
ul = htmlElement.xpath("//ul[@class='item_con_list']")[1]

jobs = []
lis = ul.xpath('./li')
for li in lis:
    position = li.xpath('@data-positionname')[0]
    company = li.xpath('@data-company')[0]
    salary = li.xpath('@data-salary')[0]
    website = li.xpath(".//a/@href")[0]
    education = li.xpath(".//div[@class='li_b_l']//text()")[3]

    job = {
        'position': position,
        'company': company,
        'salary': salary,
        'website': website,
        'education': education
    }
    jobs.append(job)

print(jobs)

output:[{'position': '高級PHP開發(fā)工程師', 'company': '廣州微用科技', 'salary': '10k-18k', 'website': 'https://www.lagou.com/jobs/6813986.html?show=0d9c9879671444a5926680abc20322be', 'education': '經(jīng)驗3-5年 / 大專\n '}, {'position': '高級PHP開發(fā)工程師', 'company': '互愛（北京）科技股份有限公司', 'salary': '25k-40k', 'website': 'https://www.lagou.com/jobs/6585859.html?show=0d9c9879671444a5926680abc20322be', 'education': '經(jīng)驗3-5年 / 不限\n '}, {'position': 'php開發(fā)工程師', 'company': '金山云', 'salary': '20k-40k', 'website': 'https://www.lagou.com/jobs/6871122.html?show=0d9c9879671444a5926680abc20322be', 'education': '經(jīng)驗5-10年 / 不限\n '}, {'position': 'php開發(fā)工程師', 'company': '央視頻融媒體', 'salary': '12k-20k', 'website': 'https://www.lagou.com/jobs/6873952.html?show=0d9c9879671444a5926680abc20322be', 'education': '經(jīng)驗5-10年 / 本科\n '}, {'position': 'php開發(fā)工程師', 'company': '順豐同城科技', 'salary': '20k-40k', 'website': 'https://www.lagou.com/jobs/5788250.html?show=0d9c9879671444a5926680abc20322be', 'education': '經(jīng)驗1-3年 / 本科\n '}, {'position': 'php開發(fā)工程師', 'company': '紅布林', 'salary': '15k-30k', 'website': 'https://www.lagou.com/jobs/6963665.html?show=0d9c9879671444a5926680abc20322be', 'education': '經(jīng)驗3-5年 / 本科\n '}, {'position': 'php開發(fā)工程師', 'company': '吉城美家', 'salary': '7k-14k', 'website': 'https://www.lagou.com/jobs/5553859.html?show=0d9c9879671444a5926680abc20322be', 'education': '經(jīng)驗3-5年 / 本科\n '}, {'position': 'php開發(fā)工程師', 'company': '有咖互動', 'salary': '10k-15k', 'website': 'https://www.lagou.com/jobs/6959578.html?show=0d9c9879671444a5926680abc20322be', 'education': '經(jīng)驗1-3年 / 本科\n '}, {'position': 'php開發(fā)工程師', 'company': '微拍堂', 'salary': '20k-30k', 'website': 'https://www.lagou.com/jobs/6972670.html?show=0d9c9879671444a5926680abc20322be', 'education': '經(jīng)驗3-5年 / 本科\n '}, {'position': 'PHP開發(fā)工程師', 'company': '廣州市搜游網(wǎng)絡(luò)科技', 'salary': '20k-40k', 'website': 'https://www.lagou.com/jobs/6959547.html?show=0d9c9879671444a5926680abc20322be', 'education': '經(jīng)驗3-5年 / 大專\n '}, {'position': 'PHP高級開發(fā)工程師', 'company': '廣州市搜游網(wǎng)絡(luò)科技', 'salary': '25k-45k', 'website': 'https://www.lagou.com/jobs/6943052.html?show=0d9c9879671444a5926680abc20322be', 'education': '經(jīng)驗3-5年 / 大專\n '}, {'position': 'PHP高級開發(fā)工程師', 'company': '明源云', 'salary': '15k-25k', 'website': 'https://www.lagou.com/jobs/4619045.html?show=0d9c9879671444a5926680abc20322be', 'education': '經(jīng)驗3-5年 / 本科\n '}, {'position': 'PHP開發(fā)工程師', 'company': '微通', 'salary': '8k-16k', 'website': 'https://www.lagou.com/jobs/501604.html?show=0d9c9879671444a5926680abc20322be', 'education': '經(jīng)驗不限 / 不限\n '}, {'position': 'php開發(fā)工程師', 'company': '掌閱', 'salary': '15k-25k', 'website': 'https://www.lagou.com/jobs/6237409.html?show=0d9c9879671444a5926680abc20322be', 'education': '經(jīng)驗3-5年 / 本科\n '}, {'position': 'php開發(fā)工程師', 'company': '經(jīng)傳多贏', 'salary': '10k-18k', 'website': 'https://www.lagou.com/jobs/6820316.html?show=0d9c9879671444a5926680abc20322be', 'education': '經(jīng)驗1-3年 / 本科\n '}]

中國天氣網(wǎng)

# coding=utf-8
import requests
from bs4 import BeautifulSoup
from pyecharts.charts import Bar
from pyecharts import options as opts
from pyecharts.globals import ThemeType

Temp = []
def parsePages(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                      'Chrome/83.0.4090.0 Safari/537.36 Edg/83.0.467.0 '
    }
    response = requests.get(url, headers=headers)
    text =response.content.decode('utf-8')
    # soup = BeautifulSoup(text, 'lxml')
    soup = BeautifulSoup(text, 'html5lib')
    conMidtab = soup.find('div', attrs={'class': 'conMidtab'})
    tables = conMidtab.find_all('table')
    for table in tables:
        trs = table.find_all('tr')[2:]
        for index, tr in enumerate(trs):
            infors = list(tr.stripped_strings)
            city = infors[1] if index == 0 else infors[0]
            minTemp = infors[-2]
            Temp.append({'city': city, 'minTemp': int(minTemp)})

def main():
    urls = ['hb', 'db', 'hd', 'hz', 'hn', 'xn', 'xb', 'gat']
    for url in urls:
        url = f'http://www.weather.com.cn/textFC/{url}.shtml'
        parsePages(url)

    Temp.sort(key=lambda data:data['minTemp'])
    data = Temp[:10]
    cities = list(map(lambda x: x['city'], data))
    minTemp = list(map(lambda x:x['minTemp'], data))
    chart = Bar(init_opts=opts.InitOpts(theme=ThemeType.LIGHT, page_title='中國最低氣溫排行榜'))
    chart.add_xaxis(cities)
    chart.add_yaxis('最低氣溫', minTemp)
    chart.render()

if __name__ == '__main__':
    main()

古詩文網(wǎng)

# coding=utf-8
import requests
import re

def parsePages(url):
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) '
                      'Chrome/83.0.4090.0 Safari/537.36 Edg/83.0.467.0 '
    }
    response = requests.get(url, headers=headers)
    text = response.text
    rTitles = re.compile(r'<div class="cont">.*?<b>(.*?)</b>', re.S)
    rDynasties = re.compile(r'<div class="cont">.*?<p class="source"><a.*?>(.*?)</a>', re.S)
    rAuthors = re.compile(r'<div class="cont">.*?<p class="source"><a.*?>.*?<a.*?>(.*?)</a>', re.S)
    rContsons = re.compile(r'<div class="cont">.*?<div class="contson".*?>(.*?)</div>', re.S)
    titles = re.findall(rTitles, text)
    dynasties = re.findall(rDynasties, text)
    authors = re.findall(rAuthors, text)
    contsons = re.findall(rContsons, text)
    contents = []
    for contson in contsons:
        x = re.sub(r'<.*?>', '', contson)
        contents.append(x)

    poems = []
    for title, dynasty, author, content in zip(titles, dynasties, authors, contents):
        poems.append({
            'title': title,
            'dynasty': dynasty,
            'author': author,
            'content': content
        })

    print(poems)

def main():
    url = 'https://www.gushiwen.org/default_1.aspx'
    parsePages(url)

if __name__ == '__main__':
    main()

[{'title': '滿江紅·夜雨涼甚忽動從戎之興', 'dynasty': '宋代', 'author': '劉克莊', 'content': '\n金甲雕戈，記當(dāng)日、轅門初立。磨盾鼻、一揮千紙，龍蛇猶濕。鐵馬曉嘶營壁冷，樓船夜渡風(fēng)濤急。有誰憐、猿臂故將軍，無功級。平戎策，從軍什。零落盡，慵收拾。把茶經(jīng)香傳，時時溫習(xí)。生怕客談榆塞事，且教兒誦花間集。嘆臣之壯也不如人，今何及。\n'}, {'title': '阮郎歸·紹興乙卯大雪行鄱陽道中', 'dynasty': '宋代', 'author': '向子諲', 'content': '\n江南江北雪漫漫，遙知易水寒。同云深處望三關(guān)，斷腸山又山。天可老，海能翻，消除此恨難。頻聞遣使問平安，幾時鸞輅還？\n'}, {'title': '碧瓦', 'dynasty': '宋代', 'author': '范成大', 'content': '\n碧瓦樓頭繡幙遮，赤欄橋外綠溪斜。無風(fēng)楊柳漫天絮，不雨棠梨滿地花。\n'}, {'title': '減字浣溪沙·秋水斜陽演漾金', 'dynasty': '宋代', 'author': '賀鑄', 'content': '\n秋水斜陽演漾金，遠山隱隱隔平林。幾家村落幾聲砧。記得西樓凝醉眼，昔年風(fēng)物似如今。只無人與共登臨。\n'}, {'title': '次韻公實雷雨', 'dynasty': '宋代', 'author': '洪炎', 'content': '\n驚雷勢欲拔三山，急雨聲如倒百川。但作奇寒侵客夢，若為一震靜胡煙。田園荊棘漫流水，河洛腥膻今幾年。擬叩九關(guān)箋帝所，人非大手筆非椽。\n'}, {'title': '浣溪沙·雨歇梧桐淚乍收', 'dynasty': '清代', 'author': '納蘭性德', 'content': '\n雨歇梧桐淚乍收，遣懷翻自憶從頭。摘花銷恨舊風(fēng)流。簾影碧桃人已去，屧痕蒼蘚徑空留。兩眉何處月如鉤？\n'}, {'title': '交趾懷古', 'dynasty': '清代', 'author': '曹雪芹', 'content': '\n銅鑄金鏞振紀綱，聲傳海外播戎羌。馬援自是功勞大，鐵笛無煩說子房。\n'}, {'title': '野老', 'dynasty': '唐代', 'author': '杜甫', 'content': '\n野老籬前江岸回，柴門不正逐江開。漁人網(wǎng)集澄潭下，賈客船隨返照來。長路關(guān)心悲劍閣，片云何意傍琴臺。王師未報收東郡，城闕秋生畫角哀。\n'}, {'title': '碧城三首', 'dynasty': '唐代', 'author': '李商隱', 'content': '\n碧城十二曲闌干，犀辟塵埃玉辟寒。閬苑有書多附鶴，女床無樹不棲鸞。星沉海底當(dāng)窗見，雨過河源隔座看。若是曉珠明又定，一生長對水晶盤。\n對影聞聲已可憐，玉池荷葉正田田。不逢蕭史休回首，莫見洪崖又拍肩。紫鳳放嬌銜楚佩，赤鱗狂舞撥湘弦。鄂君悵望舟中夜，繡被焚香獨自眠。\n七夕來時先有期，洞房簾箔至今垂。玉輪顧兔初生魄，鐵網(wǎng)珊瑚未有枝。檢與神方教駐景，收將鳳紙寫相思。武皇內(nèi)傳分明在，莫道人間總不知。\n'}, {'title': '水龍吟·聽兮清佩瓊瑤些', 'dynasty': '宋代', 'author': '辛棄疾', 'content': '\n用“些語”再題瓢泉，歌以飲客，聲韻甚諧，客皆為之釂。\n聽兮清佩瓊瑤些。明兮鏡秋毫些。君無去此，流昏漲膩，生蓬蒿些。虎豹甘人，渴而飲汝，寧猿猱些。大而流江海，覆舟如芥，君無助、狂濤些。路險兮山高些。塊予獨處無聊些。冬槽春盎，歸來為我，制松醪些。其外芳芬，團龍片鳳，煮云膏些。古人兮既往，嗟予之樂，樂簞瓢些。\n'}]

04_HttpRunner通用_02_提取數(shù)據(jù)_02_regex方式

在 httprunner 中可以通過正則表達式進行文本內(nèi)容的提取：

從響應(yīng)的實體中進行正則提取
響應(yīng)的實體必須是“JSON”或者“HTML文本”類型
格式為： <左邊界>（提取內(nèi)容的正則表達式）<右邊界>
常用提取的正則： (.*) 代表邊界里面可以是“任意字符出現(xiàn)任意次數(shù)”

\httprunner-2.5.5\httprunner\response.py：

在源碼函數(shù)的注釋中說明，通過正則方式提取時，可以從響應(yīng)實體 JSON 或者 HTML 文本中提取

HttpRunner 源碼中對正則提取方式的描述

提取規(guī)則（類似LR的處理）：

從實體文本中找到待提取的內(nèi)容
界定其左右邊界
將待提取內(nèi)容用正則表達式方式代替并置入括號中

案例

目的：從返回響應(yīng)的 HTML 文本中，將某圖片的標(biāo)簽代碼提取出來
特點：

該圖片在一對 <td> 標(biāo)簽中，可以分別作為識別的左右邊界（要具備唯一性）
左側(cè) <td> 標(biāo)簽中有寬度屬性值，正好3個數(shù)字

- config:
    name: 用例 - 測試進銷存系統(tǒng)
  
- test:
    name: 步驟 - 打開登錄頁面

    request:
      url: http://localhost/myweb/jxc/index.asp
      method: GET

    extract:
      # 從以下響應(yīng) HTML 中提取 img 標(biāo)簽內(nèi)容
      # <td width="322"><img src="images/login_02.gif" width="323" height="340"></td>
      - x: <td width="\d{3}">(.*)</td>

    validate:
      # 驗證提取是否正確（判斷提取和預(yù)期的編碼字符串是否相等）。  
      - eq: [$x, <img src="images/login_02.gif" width="323" height="340">]

以上案例中提取說明：

前面（左邊界）是：<td width="\d{3}">，其中 \d 代表數(shù)字，{3}代表出現(xiàn)3次，即此處需要出現(xiàn)3個數(shù)字
后面（右邊界）是：</td>
中間括號中內(nèi)容即為需要提取的字符串，其中“.”代表任意字符，“*”代表出現(xiàn)任意次數(shù)

正則提取規(guī)則詳細分析

運行測試用例，查看測試報告中的日志信息，可以看到提取成功。

測試報告中展現(xiàn)提取結(jié)果 - 成功

在線咨詢

上一篇：解析關(guān)于Tomcat Servlet-request的獲取請求參數(shù)及幾種常用方法
下一篇：網(wǎng)頁設(shè)計HTML零基礎(chǔ)入門

您的項目需求

*請認真填寫需求信息，我們會在24小時內(nèi)與您取得聯(lián)系。

整合營銷服務(wù)商