Jsoup解析Html標(biāo)簽(Java后臺(tái)解析）

時(shí)候編輯器傳到后臺(tái)的內(nèi)容是帶Html標(biāo)簽的，或者有時(shí)候需要形成一個(gè)完整的Html文檔，也或者需要解析其中的文字(text()),Java后臺(tái)處理用Jsoup非常方便，也可以用選擇器快速獲取元素，類似于jQuery。獲取到文檔對(duì)此之后對(duì)其處理與JS處理DOM一樣方便，選取元素也類似于JS，也有類似于jQuery的語(yǔ)法，官方的解釋

1.最基本的解析Html字符串

 @Test
 public void testHtmlToString2() {
 String html = "<p>這是一個(gè)段落<img src=\"test.img\"/>內(nèi)容;</p>";
 Document doc = Jsoup.parse(html);
 System.out.println(doc); // 輸出帶標(biāo)簽的html文檔
 System.out.println("---------------------\n"+doc.text()); // 輸出內(nèi)容
 Elements element = doc.getElementsByTag("p");
 System.out.println("---------------------\n"+element.get(0).html());
 }

結(jié)果:

<html>
 <head></head>
 <body>
 <p>這是一個(gè)段落<img src="test.img">內(nèi)容;</p>
 </body>
</html>
---------------------
這是一個(gè)段落內(nèi)容;
---------------------
這是一個(gè)段落<img src="test.img">內(nèi)容;

2.解析字符串

 // 解析html字符串
 @Test
 public void testHtmlToString() {
 String html = "<html><head><title>First parse</title></head>"
 + "<body><p style='center'>Parsed HTML into a doc.</p></body></html>";
 Document doc = Jsoup.parse(html);
 System.out.println(doc); // 輸出帶標(biāo)簽的html文檔
 System.out.println("---------------------\n"+doc.text()); // 輸出內(nèi)容
 }

結(jié)果:

<html>

<head>

<title>First parse</title>

</head>

<body>

<p style="center">Parsed HTML into a doc.</p>

</body>

</html>

---------------------

First parse Parsed HTML into a doc.

3.// 解析body片段

 @Test
 public void test2() {
 String html = "<div><p>Lorem ipsum.</p>";
 Document doc = Jsoup.parseBodyFragment(html);
 System.out.println(doc);
 System.out.println(doc.text());
 }

結(jié)果:

<html>

<body>

<div>

<p>Lorem ipsum.</p>

</div>

</body>

</html>

Lorem ipsum.

4.// 解析一個(gè)url與用選擇器選擇元素(相當(dāng)于查看源碼)

 @Test
 public void test4() throws IOException {
 Document doc = Jsoup.connect("http://qiaoliqiang.cn:8080/").get();
 String title = doc.title();// 獲取title
 System.out.println(title);
 System.out.println("---------------------\n"+doc.toString()+"---------------------\n");// 輸出文檔全部
 Elements links = doc.getElementsByTag("a");
 for (Element ele : links) {
 System.out.println(ele.toString());
 }
 }

Apache Tomcat/7.0.72

---------------------

<!doctype html>

<head>

<title>Apache Tomcat/7.0.72</title>

..........

5.選擇器解析HTML并且提取input的value值:（獲取元素的屬性）

 /**
 * <span class="bigNum">二</span>、
 * <span><input class="el_modifiedTitle" value="多選題" type="text"> </span>
 * <span>（每到題 <input class="el_modifiedGrade" value="2" type="text"> </span>
 * <span> 分；共</span><span class="numTotal">4分/</span>
 * <span class="numQues">2題)</span>
 * 
 * @param html
 * @return
 */
 // 去掉大題的標(biāo)簽
 public static String removeBigQues(String html) {
 StringBuffer sb = new StringBuffer();
 Document doc = Jsoup.parse(html);
 System.out.println(doc);
 System.out.println(doc.text());
 sb.append(doc.select(".bigNum").get(0).text() + ". ");
 sb.append(doc.select(".el_modifiedTitle").get(0).attr("value"));
 sb.append(doc.select("span").get(2).text() + doc.select(".el_modifiedGrade").get(0).attr("value"));
 sb.append(doc.select("span").get(3).text());
 sb.append(doc.select("span").get(4).text());
 sb.append(doc.select("span").get(5).text());
 System.out.println(sb.toString());
 return sb.toString();
 }

補(bǔ)充:今天發(fā)現(xiàn)Jsoup竟然沒(méi)有解析元素style的方法，所以只能自己手寫(xiě)

先獲取到style屬性，再對(duì)style屬性進(jìn)行處理，例如:

 String style = "position: absolute; width: 500px; height: 552px;";
 String extract = "width";
 if (style.contains(extract)) {
 style = style.substring(style.indexOf(extract));
 System.out.println(style);
 style = style.substring(0, style.indexOf(";"));
 System.out.println(style);
 String attr = style.substring(style.indexOf(":") + 2);
 System.out.println(attr.substring(0, attr.indexOf("px")));
 }

補(bǔ)充:元素的html()與outerHtml()的區(qū)別

html()會(huì)返回包括子元素的內(nèi)容以及標(biāo)簽，不包括自己

outerHtml()會(huì)返回包括自己在內(nèi)的元素。

在jQuery中如果返回子元素的內(nèi)容也是html(),如果返回包括自己的內(nèi)容需要用$("#chartdiv").prop("outerHTML");//會(huì)返回包括自己在內(nèi)的內(nèi)容

發(fā)送HTTP請(qǐng)求：首先，你需要向目標(biāo)網(wǎng)頁(yè)發(fā)送HTTP請(qǐng)求以獲取其HTML內(nèi)容。這可以通過(guò)Java的內(nèi)置庫(kù)java.net.HttpURLConnection或者使用更高級(jí)的庫(kù)如Apache Http Client OkHttp等來(lái)完成。

·讀取響應(yīng)內(nèi)容：一旦你發(fā)送了請(qǐng)求并收到了響應(yīng)，你需要讀取響應(yīng)的內(nèi)容，這通常是HTML格式的字符串。

·解析HTML：然后，你需要解析HTML字符串以提取所需的信息，這可以通過(guò)正則表達(dá)式來(lái)完成。但通常建議使用專門(mén)的HTML解析庫(kù)，如Jsoup。Jsoup提供了一種非常方便的方式來(lái)解析HTML文檔，并可以通過(guò)類似于CSS或jQuery的選擇器語(yǔ)法來(lái)提取和操作數(shù)據(jù)。

·如果你需要處理更復(fù)雜的網(wǎng)頁(yè)或進(jìn)行更高級(jí)的網(wǎng)頁(yè)抓取和解析任務(wù)，你可能還需要考慮使用如Selenium這樣的瀏覽器自動(dòng)化工具來(lái)模擬真實(shí)的瀏覽器行為。但是請(qǐng)注意，頻繁或大規(guī)模地抓取網(wǎng)頁(yè)可能會(huì)違反網(wǎng)站的使用條款甚至可能構(gòu)成法律問(wèn)題。

ava中解析HTML框架之Jsoup

場(chǎng)景是這樣的，本來(lái)是想申請(qǐng)一個(gè)第三方支付接口判斷用戶支付是否成功，后來(lái)發(fā)現(xiàn)不需要申請(qǐng)接口也可以通過(guò)訂單號(hào)查詢頁(yè)面獲取支付結(jié)果，這樣就可以直接解析html來(lái)判斷支付結(jié)果了，這就引入了本文的主題，Jsoup解析html

當(dāng)然jsoup不只有上面的應(yīng)用場(chǎng)景，它還有一個(gè)應(yīng)用場(chǎng)景，就是爬蟲(chóng)！

題外話：上面場(chǎng)景中，使用支付接口其實(shí)才是最穩(wěn)當(dāng)?shù)霓k法，但是支付接口申請(qǐng)周期長(zhǎng)，而且一些情況下并不是免費(fèi)的，再者一些支付接口只支持一種語(yǔ)言，可能和本項(xiàng)目不是一個(gè)語(yǔ)言（比如項(xiàng)目是Java的，但是人家提供的支付接口只支持PHP），這樣增加了系統(tǒng)復(fù)雜度，如果業(yè)務(wù)量大且要求準(zhǔn)確的場(chǎng)景下應(yīng)當(dāng)使用支付接口，否則可以取巧解析html，解析html有一個(gè)不好的地方就是如果html結(jié)構(gòu)變化了，那么接口就得重寫(xiě)，都有優(yōu)缺點(diǎn)，看場(chǎng)景選擇。

Jsoup簡(jiǎn)介與使用

官網(wǎng)：https://jsoup.org/

jsoup 提供了簡(jiǎn)便的API，使用了HTML5 DOM方法和CSS選擇器用來(lái)解析HTML。其實(shí)現(xiàn)了WHATWG HTML5 規(guī)范，像瀏覽器一樣解析HTML。

從文件，URL，字符串抓取和解析HTML
使用DOM遍歷或者CSS選擇器來(lái)查找和提取數(shù)據(jù)
操作HTML元素，屬性和文字
清除用戶提交的安全白名單以外的內(nèi)容，以防止XSS攻擊
美化HTML

引入依賴

<dependency>
    <groupId>org.jsoup</groupId>
    <artifactId>jsoup</artifactId>
    <version>1.11.2</version>
</dependency>

解析HTML

從字符串解析

String html = "<html><head><title>First parse</title></head>"
  + "<body><p>Parsed HTML into a doc.</p></body></html>";
Document doc = Jsoup.parse(html);

從URL解析

// 簡(jiǎn)潔版
Document doc = Jsoup.connect("http://example.com/").get();
String title = doc.title();

//完整版
doc = Jsoup.connect("http://example.com")
  .data("query", "Java")
  .userAgent("Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.198 Safari/537.36")
  .cookie("auth", "token")
  .timeout(3000)
  .post();

從文件解析

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

提取HTML

遍歷HTML

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

Element content = doc.getElementById("content");
Elements links = content.getElementsByTag("a");
for (Element link : links) {
  String linkHref = link.attr("href");
  String linkText = link.text();
}

提取數(shù)據(jù)

String html = "<p>An <a href='http://example.com/'><b>example</b></a> link.</p>";
Document doc = Jsoup.parse(html);
Element link = doc.select("a").first();

String text = doc.body().text(); // "An example link"
String linkHref = link.attr("href"); // "http://example.com/"
String linkText = link.text(); // "example""

String linkOuterH = link.outerHtml(); 
    // "<a href="http://example.com"><b>example</b></a>"
String linkInnerH = link.html(); // "<b>example</b>"

相對(duì)路徑轉(zhuǎn)換成絕對(duì)路徑，一些a標(biāo)簽使用的是相對(duì)路徑，下面的代碼可以將其轉(zhuǎn)換成絕對(duì)路徑

Document doc = Jsoup.connect("http://jsoup.org").get();

Element link = doc.select("a").first();
String relHref = link.attr("href"); // == "/"
String absHref = link.attr("abs:href"); // "http://jsoup.org/"

CSS選擇器

Jsoup支持CSS選擇器，用的是 Element.select(String selector)方法

File input = new File("/tmp/input.html");
Document doc = Jsoup.parse(input, "UTF-8", "http://example.com/");

Elements links = doc.select("a[href]"); // a with href
Elements pngs = doc.select("img[src$=.png]");
  // img with src ending .png

Element masthead = doc.select("div.masthead").first();
  // div with class=masthead

Elements resultLinks = doc.select("h3.r > a"); // direct a after h3

如何快速定位頁(yè)面上元素的內(nèi)容？答案是打開(kāi)Chrome，按F12打開(kāi)開(kāi)發(fā)者工具，定位到想要的DOM節(jié)點(diǎn)，右鍵，copy，選擇Copy selector，即可生成CSS選擇器，類似于body > div > div.content > div.col2 > div > h3:nth-child(10)

遺憾的是Jsoup不支持Xpath選擇器，但是早就有人意識(shí)到這個(gè)問(wèn)題了，所以誕生了JsoupXpath

JsoupXpath 是一款純Java開(kāi)發(fā)的使用xpath解析提取html數(shù)據(jù)的解析器，針對(duì)html解析完全重新實(shí)現(xiàn)了W3C XPATH 1.0標(biāo)準(zhǔn)語(yǔ)法，xpath的Lexer和Parser基于Antlr4構(gòu)建，html的DOM樹(shù)生成采用Jsoup，故命名為JsoupXpath. 為了在java里也享受xpath的強(qiáng)大與方便但又苦于找不到一款足夠好用的xpath解析器，故開(kāi)發(fā)了JsoupXpath。JsoupXpath的實(shí)現(xiàn)邏輯清晰，擴(kuò)展方便，支持完備的W3C XPATH 1.0標(biāo)準(zhǔn)語(yǔ)法，W3C規(guī)范：http://www.w3.org/TR/1999/REC-xpath-19991116 ，JsoupXpath語(yǔ)法描述文件Xpath.g4

項(xiàng)目地址：https://github.com/zhegexiaohuozi/JsoupXpath

感興趣的可以看一下測(cè)試用例：里面包含了大量的使用場(chǎng)景：https://github.com/zhegexiaohuozi/JsoupXpath/blob/master/src/test/java/org/seimicrawler/xpath/JXDocumentTest.java

操作HTML

jsoup可以在插入、刪除、提取HTML，直接看例子代碼

設(shè)置屬性

//設(shè)置屬性
doc.select("div.comments a").attr("rel", "nofollow");
doc.select("div.masthead").attr("title", "jsoup").addClass("round-box");

插入html

//插入html
Element div = doc.select("div").first(); // <div></div>
div.html("<p>lorem ipsum</p>"); // <div><p>lorem ipsum</p></div>
div.prepend("<p>First</p>");
div.append("<p>Last</p>");
// now: <div><p>First</p><p>lorem ipsum</p><p>Last</p></div>

Element span = doc.select("span").first(); // <span>One</span>
span.wrap("<li><a href='http://example.com/'></a></li>");
// now: <li><a href="http://example.com"><span>One</span></a></li>

設(shè)置文本

//設(shè)置文本
Element div = doc.select("div").first(); // <div></div>
div.text("five > four"); // <div>five > four</div>
div.prepend("First ");
div.append(" Last");
// now: <div>First five > four Last</div>

避免XSS攻擊( cross-site scripting )

String unsafe = 
  "<p><a href='http://example.com/' onclick='stealCookies()'>Link</a></p>";
String safe = Jsoup.clean(unsafe, Whitelist.basic());
// now: <p><a href="http://example.com/" rel="nofollow">Link</a></p>

參考

https://jsoup.org/
http://www.wanghaomiao.cn/archives/25/
https://github.com/zhegexiaohuozi/JsoupXpath

在線咨詢

上一篇：JavaScript庫(kù)hxsfx.ajax之解決動(dòng)態(tài)加載HTML
下一篇：11款流行的HTML5網(wǎng)站模板（完全免費(fèi)，內(nèi)附下載地址）

您的項(xiàng)目需求

*請(qǐng)認(rèn)真填寫(xiě)需求信息，我們會(huì)在24小時(shí)內(nèi)與您取得聯(lián)系。

整合營(yíng)銷(xiāo)服務(wù)商

Jsoup解析Html標(biāo)簽(Java后臺(tái)解析）

ava中解析HTML框架之Jsoup

Jsoup簡(jiǎn)介與使用

引入依賴

解析HTML

提取HTML

CSS選擇器

操作HTML

避免XSS攻擊( cross-site scripting )

參考

您的項(xiàng)目需求