整合營銷服務(wù)商

          電腦端+手機(jī)端+微信端=數(shù)據(jù)同步管理

          免費(fèi)咨詢熱線:

          神奇的Python腳本pdf轉(zhuǎn)word、doc轉(zhuǎn)docx、word轉(zhuǎn)html各種格式都有

          迎點(diǎn)擊右上角關(guān)注小編,除了分享技術(shù)文章之外還有很多福利,私信學(xué)習(xí)資料可以領(lǐng)取包括不限于Python實(shí)戰(zhàn)演練、PDF電子文檔、面試集錦、學(xué)習(xí)資料等。

          前言

          對于PDF轉(zhuǎn)換成word文檔,我想很多人都了解過,那就是需要付費(fèi),而且很貴,但是如果你會Python,只要你會Python這么問題都不再是問題。

          pdf文件轉(zhuǎn)換為word文件

          Word文件轉(zhuǎn)換為pdf文件

          doc轉(zhuǎn)docx

          docx轉(zhuǎn)html

          .什么是tika?

          Tika是一個(gè)內(nèi)容分析工具,自帶全面的parser工具類,能解析基本所有常見格式的文件,得到文件的metadata,content等內(nèi)容,返回格式化信息。總的來說可以作為一個(gè)通用的解析工具。特別對于搜索引擎的數(shù)據(jù)抓去和處理步驟有重要意義。Tika是Apache的Lucene項(xiàng)目下面的子項(xiàng)目,在lucene的應(yīng)用中可以使用tika獲取大批量文檔中的內(nèi)容來建立索引,非常方便,也很容易使用。Apache Tika toolkit可以自動(dòng)檢測各種文檔(如word,ppt,xml,csv,ppt等)的類型并抽取文檔的元數(shù)據(jù)和文本內(nèi)容。Tika集成了現(xiàn)有的文檔解析庫,并提供統(tǒng)一的接口,使針對不同類型的文檔進(jìn)行解析變得更簡單。Tika針對搜索引擎索引、內(nèi)容分析、轉(zhuǎn)化等非常有用。

          Tika架構(gòu)

          應(yīng)用程序員可以很容易地在他們的應(yīng)用程序集成Tika。Tika提供了一個(gè)命令行界面和圖形用戶界面,使它比較人性化。在本章中,我們將討論構(gòu)成Tika架構(gòu)的四個(gè)重要模塊。下圖顯示了Tika的四個(gè)模塊的體系結(jié)構(gòu):

          • 語言檢測機(jī)制。
          • MIME檢測機(jī)制。
          • Parser接口。
          • Tika Facade 類.

          語言檢測機(jī)制

          每當(dāng)一個(gè)文本文件被傳遞到Tika,它將檢測在其中的語言。它接受沒有語言的注釋文件和通過檢測該語言添加在該文件的元數(shù)據(jù)信息。支持語言識別,Tika 有一類叫做語言標(biāo)識符在包org.apache.tika.language及語言識別資料庫里面包含了語言檢測從給定文本的算法。Tika 內(nèi)部使用N-gram算法語言檢測。

          MIME檢測機(jī)制

          Tika可以根據(jù)MIME標(biāo)準(zhǔn)檢測文檔類型。Tika默認(rèn)MIME類型檢測是使用org.apache.tika.mime.mimeTypes。它使用org.apache.tika.detect.Detector 接口大部分內(nèi)容類型檢測。內(nèi)部Tika使用多種技術(shù),如文件匹配替換,內(nèi)容類型提示,魔術(shù)字節(jié),字符編碼,以及其他一些技術(shù)。

          解析器接口

          org.apache.tika.parser 解析器接口是Tika解析文檔的主要接口。該接口從提取文檔中的文本和元數(shù)據(jù),并總結(jié)了其對外部用戶愿意寫解析器插件。采用不同的具體解析器類,具體為各個(gè)文檔類型,Tika 支持大量的文件格式。這些格式的具體類不同的文件格式提供支持,無論是通過直接實(shí)現(xiàn)邏輯分析器或使用外部解析器庫。

          Tika Facade 類

          使用的Tika facade類是從Java調(diào)用Tika的最簡單和直接的方式,而且也沿用了外觀的設(shè)計(jì)模式。可以在 Tika API的org.apache.tika包Tika 找到外觀facade類。通過實(shí)現(xiàn)基本用例,Tika作為facade的代理。它抽象了的Tika庫的底層復(fù)雜性,例如MIME檢測機(jī)制,解析器接口和語言檢測機(jī)制,并提供給用戶一個(gè)簡單的接口來使用。

          2.代碼工程

          實(shí)驗(yàn)?zāi)繕?biāo)

          實(shí)現(xiàn)word文檔轉(zhuǎn)html

          pom.xml

          <?xml version="1.0" encoding="UTF-8"?>
          <project xmlns="http://maven.apache.org/POM/4.0.0"
                   xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
                   xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
              <parent>
                  <artifactId>springboot-demo</artifactId>
                  <groupId>com.et</groupId>
                  <version>1.0-SNAPSHOT</version>
              </parent>
              <modelVersion>4.0.0</modelVersion>
          
              <artifactId>tika</artifactId>
          
              <properties>
                  <maven.compiler.source>8</maven.compiler.source>
                  <maven.compiler.target>8</maven.compiler.target>
              </properties>
              <dependencies>
          
                  <dependency>
                      <groupId>org.springframework.boot</groupId>
                      <artifactId>spring-boot-starter-web</artifactId>
                  </dependency>
          
                  <dependency>
                      <groupId>org.springframework.boot</groupId>
                      <artifactId>spring-boot-autoconfigure</artifactId>
                  </dependency>
                  <dependency>
                      <groupId>org.springframework.boot</groupId>
                      <artifactId>spring-boot-starter-test</artifactId>
                      <scope>test</scope>
                  </dependency>
                  <dependency>
                      <groupId>org.apache.tika</groupId>
                      <artifactId>tika-parsers</artifactId>
                      <version>1.17</version>
                  </dependency>
                  <dependency>
                      <groupId>org.projectlombok</groupId>
                      <artifactId>lombok</artifactId>
                  </dependency>
          
              </dependencies>
          </project>

          controller

          package com.et.tika.controller;
          
          import com.et.tika.convertor.WordToHtmlConverter;
          import com.et.tika.dto.ConvertedDocumentDTO;
          import lombok.extern.slf4j.Slf4j;
          import org.springframework.beans.factory.annotation.Autowired;
          import org.springframework.web.bind.annotation.RequestMapping;
          import org.springframework.web.bind.annotation.RequestMethod;
          import org.springframework.web.bind.annotation.RequestParam;
          import org.springframework.web.bind.annotation.RestController;
          import org.springframework.web.multipart.MultipartFile;
          
          import java.util.HashMap;
          import java.util.Map;
          
          @RestController
          @Slf4j
          public class HelloWorldController {
              @RequestMapping("/hello")
              public Map<String, Object> showHelloWorld(){
                  Map<String, Object> map = new HashMap<>();
                  map.put("msg", "HelloWorld");
                  return map;
              }
              @Autowired
              WordToHtmlConverter converter;
          
          
          
              /**
               * Transforms the Word document into HTML document and returns the transformed document.
               *
               * @return  The content of the uploaded document as HTML.
               */
              @RequestMapping(value = "/api/word-to-html", method = RequestMethod.POST)
              public ConvertedDocumentDTO convertWordDocumentIntoHtmlDocument(@RequestParam(value = "file", required = true) MultipartFile wordDocument) {
                  log.info("Converting word document into HTML document");
          
                  ConvertedDocumentDTO htmlDocument = converter.convertWordDocumentIntoHtml(wordDocument);
          
                  log.info("Converted word document into HTML document.");
                  log.trace("The created HTML markup looks as follows: {}", htmlDocument);
          
                  return htmlDocument;
              }
          }

          WordToHtmlConverter

          package com.et.tika.convertor;
          
          
          import com.et.tika.dto.ConvertedDocumentDTO;
          import com.et.tika.exception.DocumentConversionException;
          import lombok.extern.slf4j.Slf4j;
          import org.apache.tika.exception.TikaException;
          import org.apache.tika.metadata.Metadata;
          import org.apache.tika.parser.ParseContext;
          import org.apache.tika.parser.Parser;
          import org.apache.tika.parser.microsoft.ooxml.OOXMLParser;
          import org.slf4j.Logger;
          import org.slf4j.LoggerFactory;
          import org.springframework.stereotype.Component;
          import org.springframework.web.multipart.MultipartFile;
          import org.xml.sax.SAXException;
          
          import javax.xml.transform.OutputKeys;
          import javax.xml.transform.TransformerException;
          import javax.xml.transform.sax.SAXTransformerFactory;
          import javax.xml.transform.sax.TransformerHandler;
          import javax.xml.transform.stream.StreamResult;
          import java.io.IOException;
          import java.io.InputStream;
          import java.io.StringWriter;
          
          /**
           *
           */
          @Component
          @Slf4j
          public class WordToHtmlConverter {
          
          
              /**
               * Converts a .docx document into HTML markup. This code
               * is based on <a href="http://stackoverflow.com/a/9053258/313554">this StackOverflow</a> answer.
               *
               * @param wordDocument  The converted .docx document.
               * @return
               */
              public ConvertedDocumentDTO convertWordDocumentIntoHtml(MultipartFile wordDocument) {
                  log.info("Converting word document: {} into HTML", wordDocument.getOriginalFilename());
                  try {
                      InputStream input = wordDocument.getInputStream();
                      Parser parser = new OOXMLParser();
          
                      StringWriter sw = new StringWriter();
                      SAXTransformerFactory factory = (SAXTransformerFactory)
                              SAXTransformerFactory.newInstance();
                      TransformerHandler handler = factory.newTransformerHandler();
                      handler.getTransformer().setOutputProperty(OutputKeys.ENCODING, "utf-8");
                      handler.getTransformer().setOutputProperty(OutputKeys.METHOD, "html");
                      handler.getTransformer().setOutputProperty(OutputKeys.INDENT, "yes");
                      handler.setResult(new StreamResult(sw));
          
                      Metadata metadata = new Metadata();
                      metadata.add(Metadata.CONTENT_TYPE, "text/html;charset=utf-8");
                      parser.parse(input, handler, metadata, new ParseContext());
                      return new ConvertedDocumentDTO(wordDocument.getOriginalFilename(), sw.toString());
                  }
                  catch (IOException | SAXException | TransformerException | TikaException ex) {
                      log.error("Conversion failed because an exception was thrown", ex);
                      throw new DocumentConversionException(ex.getMessage(), ex);
                  }
              }
          }

          dto

          package com.et.tika.dto;
          
          import org.apache.commons.lang.builder.ToStringBuilder;
          
          /**
           *
           */
          public  class ConvertedDocumentDTO {
          
              private final String contentAsHtml;
              private final String filename;
          
              public ConvertedDocumentDTO(String filename, String contentAsHtml) {
                  this.contentAsHtml = contentAsHtml;
                  this.filename = filename;
              }
          
              public String getContentAsHtml() {
                  return contentAsHtml;
              }
          
              public String getFilename() {
                  return filename;
              }
          
              @Override
              public String toString() {
                  return new ToStringBuilder(this)
                          .append("filename", this.filename)
                          .append("contentAsHtml", this.contentAsHtml)
                          .toString();
              }
          }

          自定義異常

          package com.et.tika.exception;
          
          /**
           *
           */
          public final class DocumentConversionException extends RuntimeException {
          
              public DocumentConversionException(String message, Exception ex) {
                  super(message, ex);
              }
          }

          以上只是一些關(guān)鍵代碼,所有代碼請參見下面代碼倉庫

          代碼倉庫

          • https://github.com/Harries/springboot-demo

          3.測試

          啟動(dòng)Spring Boot應(yīng)用

          測試word轉(zhuǎn)html

          4.引用

          • https://tika.apache.org/
          • http://www.liuhaihua.cn/archives/710679.html

          、前言

          實(shí)現(xiàn)文檔在線預(yù)覽的方式除了上篇文章 文檔在線預(yù)覽新版(一)通過將文件轉(zhuǎn)成圖片實(shí)現(xiàn)在線預(yù)覽功能說的將文檔轉(zhuǎn)成圖片的實(shí)現(xiàn)方式外,還有轉(zhuǎn)成pdf,前端通過pdf.js、pdfobject.js等插件來實(shí)現(xiàn)在線預(yù)覽,以及本文將要說到的將文檔轉(zhuǎn)成html的方式來實(shí)現(xiàn)在線預(yù)覽。

          以下代碼分別提供基于aspose、pdfbox、spire來實(shí)現(xiàn)來實(shí)現(xiàn)txt、word、pdf、ppt、word等文件轉(zhuǎn)圖片的需求。

          1、aspose

          Aspose 是一家致力于.Net ,Java,SharePoint,JasperReports和SSRS組件的提供商,數(shù)十個(gè)國家的數(shù)千機(jī)構(gòu)都有用過aspose組件,創(chuàng)建、編輯、轉(zhuǎn)換或渲染 Office、OpenOffice、PDF、圖像、ZIP、CAD、XPS、EPS、PSD 和更多文件格式。注意aspose是商用組件,未經(jīng)授權(quán)導(dǎo)出文件里面都是是水印(尊重版權(quán),遠(yuǎn)離破解版)。

          需要在項(xiàng)目的pom文件里添加如下依賴

                  <dependency>
                      <groupId>com.aspose</groupId>
                      <artifactId>aspose-words</artifactId>
                      <version>23.1</version>
                  </dependency>
                  <dependency>
                      <groupId>com.aspose</groupId>
                      <artifactId>aspose-pdf</artifactId>
                      <version>23.1</version>
                  </dependency>
                  <dependency>
                      <groupId>com.aspose</groupId>
                      <artifactId>aspose-cells</artifactId>
                      <version>23.1</version>
                  </dependency>
                  <dependency>
                      <groupId>com.aspose</groupId>
                      <artifactId>aspose-slides</artifactId>
                      <version>23.1</version>
                  </dependency>
          

          2 、poi + pdfbox

          因?yàn)閍spose和spire雖然好用,但是都是是商用組件,所以這里也提供使用開源庫操作的方式的方式。

          POI是Apache軟件基金會用Java編寫的免費(fèi)開源的跨平臺的 Java API,Apache POI提供API給Java程序?qū)icrosoft Office格式檔案讀和寫的功能。

          Apache PDFBox是一個(gè)開源Java庫,支持PDF文檔的開發(fā)和轉(zhuǎn)換。 使用此庫,您可以開發(fā)用于創(chuàng)建,轉(zhuǎn)換和操作PDF文檔的Java程序。

          需要在項(xiàng)目的pom文件里添加如下依賴

          		<dependency>
                      <groupId>org.apache.pdfbox</groupId>
                      <artifactId>pdfbox</artifactId>
                      <version>2.0.4</version>
                  </dependency>
          		<dependency>
                      <groupId>org.apache.poi</groupId>
                      <artifactId>poi</artifactId>
                      <version>5.2.0</version>
                  </dependency>
                  <dependency>
                      <groupId>org.apache.poi</groupId>
                      <artifactId>poi-ooxml</artifactId>
                      <version>5.2.0</version>
                  </dependency>
                  <dependency>
                      <groupId>org.apache.poi</groupId>
                      <artifactId>poi-scratchpad</artifactId>
                      <version>5.2.0</version>
                  </dependency>
                  <dependency>
                      <groupId>org.apache.poi</groupId>
                      <artifactId>poi-excelant</artifactId>
                      <version>5.2.0</version>
                  </dependency>
          

          3 spire

          spire一款專業(yè)的Office編程組件,涵蓋了對Word、Excel、PPT、PDF等文件的讀寫、編輯、查看功能。spire提供免費(fèi)版本,但是存在只能導(dǎo)出前3頁以及只能導(dǎo)出前500行的限制,只要達(dá)到其一就會觸發(fā)限制。需要超出前3頁以及只能導(dǎo)出前500行的限制的這需要購買付費(fèi)版(尊重版權(quán),遠(yuǎn)離破解版)。這里使用免費(fèi)版進(jìn)行演示。

          spire在添加pom之前還得先添加maven倉庫來源

          		<repository>
                      <id>com.e-iceblue</id>
                      <name>e-iceblue</name>
                      <url>https://repo.e-iceblue.cn/repository/maven-public/</url>
                  </repository>
          

          接著在項(xiàng)目的pom文件里添加如下依賴

          免費(fèi)版:

          		<dependency>
                      <groupId>e-iceblue</groupId>
                      <artifactId>spire.office.free</artifactId>
                      <version>5.3.1</version>
                  </dependency>
          

          付費(fèi)版版:

          		<dependency>
                      <groupId>e-iceblue</groupId>
                      <artifactId>spire.office</artifactId>
                      <version>5.3.1</version>
                  </dependency>
          

          二、將文件轉(zhuǎn)換成html字符串

          1、將word文件轉(zhuǎn)成html字符串

          1.1 使用aspose

          public static String wordToHtmlStr(String wordPath) {
                  try {
                      Document doc = new Document(wordPath); // Address是將要被轉(zhuǎn)化的word文檔
                      String htmlStr = doc.toString();
                      return htmlStr;
                  } catch (Exception e) {
                      e.printStackTrace();
                  }
                  return null;
              }
          

          驗(yàn)證結(jié)果:

          1.2 使用poi

          public String wordToHtmlStr(String wordPath) throws TransformerException, IOException, ParserConfigurationException {
                  String htmlStr = null;
                  String ext = wordPath.substring(wordPath.lastIndexOf("."));
                  if (ext.equals(".docx")) {
                      htmlStr = word2007ToHtmlStr(wordPath);
                  } else if (ext.equals(".doc")){
                      htmlStr = word2003ToHtmlStr(wordPath);
                  } else {
                      throw new RuntimeException("文件格式不正確");
                  }
                  return htmlStr;
              }
          
              public String word2007ToHtmlStr(String wordPath) throws IOException {
                  // 使用內(nèi)存輸出流
                  try(ByteArrayOutputStream out = new ByteArrayOutputStream()){
                      word2007ToHtmlOutputStream(wordPath, out);
                      return out.toString();
                  }
              }
          
              private void word2007ToHtmlOutputStream(String wordPath,OutputStream out) throws IOException {
                  ZipSecureFile.setMinInflateRatio(-1.0d);
                  InputStream in = Files.newInputStream(Paths.get(wordPath));
                  XWPFDocument document = new XWPFDocument(in);
                  XHTMLOptions options = XHTMLOptions.create().setIgnoreStylesIfUnused(false).setImageManager(new Base64EmbedImgManager());
                  // 使用內(nèi)存輸出流
                  XHTMLConverter.getInstance().convert(document, out, options);
              }
          
          
              private String word2003ToHtmlStr(String wordPath) throws TransformerException, IOException, ParserConfigurationException {
                  org.w3c.dom.Document htmlDocument = word2003ToHtmlDocument(wordPath);
                  // Transform document to string
                  StringWriter writer = new StringWriter();
                  TransformerFactory tf = TransformerFactory.newInstance();
                  Transformer transformer = tf.newTransformer();
                  transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "no");
                  transformer.setOutputProperty(OutputKeys.METHOD, "html");
                  transformer.setOutputProperty(OutputKeys.INDENT, "yes");
                  transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
                  transformer.transform(new DOMSource(htmlDocument), new StreamResult(writer));
                  return writer.toString();
              }
          
          private org.w3c.dom.Document word2003ToHtmlDocument(String wordPath) throws IOException, ParserConfigurationException {
                  InputStream input = Files.newInputStream(Paths.get(wordPath));
                  HWPFDocument wordDocument = new HWPFDocument(input);
                  WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
                          DocumentBuilderFactory.newInstance().newDocumentBuilder()
                                  .newDocument());
                  wordToHtmlConverter.setPicturesManager((content, pictureType, suggestedName, widthInches, heightInches) -> {
                      System.out.println(pictureType);
                      if (PictureType.UNKNOWN.equals(pictureType)) {
                          return null;
                      }
                      BufferedImage bufferedImage = ImgUtil.toImage(content);
                      String base64Img = ImgUtil.toBase64(bufferedImage, pictureType.getExtension());
                      //  帶圖片的word,則將圖片轉(zhuǎn)為base64編碼,保存在一個(gè)頁面中
                      StringBuilder sb = (new StringBuilder(base64Img.length() + "data:;base64,".length()).append("data:;base64,").append(base64Img));
                      return sb.toString();
                  });
                  // 解析word文檔
                  wordToHtmlConverter.processDocument(wordDocument);
                  return wordToHtmlConverter.getDocument();
              }

          1.3 使用spire

           public String wordToHtmlStr(String wordPath) throws IOException {
                  try(ByteArrayOutputStream outputStream = new ByteArrayOutputStream()) {
                      Document document = new Document();
                      document.loadFromFile(wordPath);
                      document.saveToFile(outputStream, FileFormat.Html);
                      return outputStream.toString();
                  }
              }

          2、將pdf文件轉(zhuǎn)成html字符串

          2.1 使用aspose

          public static String pdfToHtmlStr(String pdfPath) throws IOException, ParserConfigurationException {
                  PDDocument document = PDDocument.load(new File(pdfPath));
                  Writer writer = new StringWriter();
                  new PDFDomTree().writeText(document, writer);
                  writer.close();
                  document.close();
                  return writer.toString();
              }

          驗(yàn)證結(jié)果:

          2.2 使用 poi + pbfbox

          public String pdfToHtmlStr(String pdfPath) throws IOException, ParserConfigurationException {
                  PDDocument document = PDDocument.load(new File(pdfPath));
                  Writer writer = new StringWriter();
                  new PDFDomTree().writeText(document, writer);
                  writer.close();
                  document.close();
                  return writer.toString();
              }

          2.3 使用spire

          public String pdfToHtmlStr(String pdfPath) throws IOException, ParserConfigurationException {
                  try(ByteArrayOutputStream outputStream = new ByteArrayOutputStream()) {
                      PdfDocument pdf = new PdfDocument();
                      pdf.loadFromFile(pdfPath);
                      return outputStream.toString();
                  }
              }
          

          3、將excel文件轉(zhuǎn)成html字符串

          3.1 使用aspose

          public static String excelToHtmlStr(String excelPath) throws Exception {
                  FileInputStream fileInputStream = new FileInputStream(excelPath);
                  Workbook workbook = new XSSFWorkbook(fileInputStream);
                  DataFormatter dataFormatter = new DataFormatter();
                  FormulaEvaluator formulaEvaluator = workbook.getCreationHelper().createFormulaEvaluator();
                  Sheet sheet = workbook.getSheetAt(0);
                  StringBuilder htmlStringBuilder = new StringBuilder();
                  htmlStringBuilder.append("<html><head><title>Excel to HTML using Java and POI library</title>");
                  htmlStringBuilder.append("<style>table, th, td { border: 1px solid black; }</style>");
                  htmlStringBuilder.append("</head><body><table>");
                  for (Row row : sheet) {
                      htmlStringBuilder.append("<tr>");
                      for (Cell cell : row) {
                          CellType cellType = cell.getCellType();
                          if (cellType == CellType.FORMULA) {
                              formulaEvaluator.evaluateFormulaCell(cell);
                              cellType = cell.getCachedFormulaResultType();
                          }
                          String cellValue = dataFormatter.formatCellValue(cell, formulaEvaluator);
                          htmlStringBuilder.append("<td>").append(cellValue).append("</td>");
                      }
                      htmlStringBuilder.append("</tr>");
                  }
                  htmlStringBuilder.append("</table></body></html>");
                  return htmlStringBuilder.toString();
              }
          

          返回的html字符串:

          <html><head><title>Excel to HTML using Java and POI library</title><style>table, th, td { border: 1px solid black; }</style></head><body><table><tr><td>序號</td><td>姓名</td><td>性別</td><td>聯(lián)系方式</td><td>地址</td></tr><tr><td>1</td><td>張曉玲</td><td>女</td><td>11111111111</td><td>上海市浦東新區(qū)xx路xx弄xx號</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦東新區(qū)xx路xx弄xx號</td></tr><tr><td>1</td><td>張曉玲</td><td>女</td><td>11111111111</td><td>上海市浦東新區(qū)xx路xx弄xx號</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦東新區(qū)xx路xx弄xx號</td></tr><tr><td>1</td><td>張曉玲</td><td>女</td><td>11111111111</td><td>上海市浦東新區(qū)xx路xx弄xx號</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦東新區(qū)xx路xx弄xx號</td></tr><tr><td>1</td><td>張曉玲</td><td>女</td><td>11111111111</td><td>上海市浦東新區(qū)xx路xx弄xx號</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦東新區(qū)xx路xx弄xx號</td></tr><tr><td>1</td><td>張曉玲</td><td>女</td><td>11111111111</td><td>上海市浦東新區(qū)xx路xx弄xx號</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦東新區(qū)xx路xx弄xx號</td></tr><tr><td>1</td><td>張曉玲</td><td>女</td><td>11111111111</td><td>上海市浦東新區(qū)xx路xx弄xx號</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦東新區(qū)xx路xx弄xx號</td></tr><tr><td>1</td><td>張曉玲</td><td>女</td><td>11111111111</td><td>上海市浦東新區(qū)xx路xx弄xx號</td></tr><tr><td>2</td><td>王小二</td><td>男</td><td>1222222</td><td>上海市浦東新區(qū)xx路xx弄xx號</td></tr></table></body></html>
          

          3.2 使用poi + pdfbox

          public String excelToHtmlStr(String excelPath) throws Exception {
                  FileInputStream fileInputStream = new FileInputStream(excelPath);
                  try (Workbook workbook = WorkbookFactory.create(new File(excelPath))){
                      DataFormatter dataFormatter = new DataFormatter();
                      FormulaEvaluator formulaEvaluator = workbook.getCreationHelper().createFormulaEvaluator();
                      org.apache.poi.ss.usermodel.Sheet sheet = workbook.getSheetAt(0);
                      StringBuilder htmlStringBuilder = new StringBuilder();
                      htmlStringBuilder.append("<html><head><title>Excel to HTML using Java and POI library</title>");
                      htmlStringBuilder.append("<style>table, th, td { border: 1px solid black; }</style>");
                      htmlStringBuilder.append("</head><body><table>");
                      for (Row row : sheet) {
                          htmlStringBuilder.append("<tr>");
                          for (Cell cell : row) {
                              CellType cellType = cell.getCellType();
                              if (cellType == CellType.FORMULA) {
                                  formulaEvaluator.evaluateFormulaCell(cell);
                                  cellType = cell.getCachedFormulaResultType();
                              }
                              String cellValue = dataFormatter.formatCellValue(cell, formulaEvaluator);
                              htmlStringBuilder.append("<td>").append(cellValue).append("</td>");
                          }
                          htmlStringBuilder.append("</tr>");
                      }
                      htmlStringBuilder.append("</table></body></html>");
                      return htmlStringBuilder.toString();
                  }
              }
          

          3.3 使用spire

          public String excelToHtmlStr(String excelPath) throws Exception {
                  try(ByteArrayOutputStream outputStream = new ByteArrayOutputStream()) {
                      Workbook workbook = new Workbook();
                      workbook.loadFromFile(excelPath);
                      workbook.saveToStream(outputStream, com.spire.xls.FileFormat.HTML);
                      return outputStream.toString();
                  }
              }
          

          三、將文件轉(zhuǎn)換成html,并生成html文件

          有時(shí)我們是需要的不僅僅返回html字符串,而是需要生成一個(gè)html文件這時(shí)應(yīng)該怎么做呢?一個(gè)改動(dòng)量小的做法就是使用org.apache.commons.io包下的FileUtils工具類寫入目標(biāo)地址:

          FileUtils類將html字符串生成html文件示例:

          首先需要引入pom:

          		<dependency>
                      <groupId>commons-io</groupId>
                      <artifactId>commons-io</artifactId>
                      <version>2.8.0</version>
                  </dependency>
          

          相關(guān)代碼:

          String htmlStr = FileConvertUtil.pdfToHtmlStr("D:\\書籍\\電子書\\小說\\歷史小說\\最后的可汗.doc");
          FileUtils.write(new File("D:\\test\\doc.html"), htmlStr, "utf-8");
          

          除此之外,還可以對上面的代碼進(jìn)行一些調(diào)整,已實(shí)現(xiàn)生成html文件,代碼調(diào)整如下:

          1、將word文件轉(zhuǎn)換成html文件

          word原文件效果:

          1.1 使用aspose

          public static void wordToHtml(String wordPath, String htmlPath) {
                  try {
                      File sourceFile = new File(wordPath);
                      String path = htmlPath + File.separator + sourceFile.getName().substring(0, sourceFile.getName().lastIndexOf(".")) + ".html";
                      File file = new File(path); // 新建一個(gè)空白pdf文檔
                      FileOutputStream os = new FileOutputStream(file);
                      Document doc = new Document(wordPath); // Address是將要被轉(zhuǎn)化的word文檔
                      HtmlSaveOptions options = new HtmlSaveOptions();
                      options.setExportImagesAsBase64(true);
                      options.setExportRelativeFontSize(true);
                      doc.save(os, options);
                  } catch (Exception e) {
                      e.printStackTrace();
                  }
              }
          

          轉(zhuǎn)換成html的效果:

          1.2 使用poi + pdfbox

          public void wordToHtml(String wordPath, String htmlPath) throws TransformerException, IOException, ParserConfigurationException {
                  htmlPath = FileUtil.getNewFileFullPath(wordPath, htmlPath, "html");
                  String ext = wordPath.substring(wordPath.lastIndexOf("."));
                  if (ext.equals(".docx")) {
                      word2007ToHtml(wordPath, htmlPath);
                  } else if (ext.equals(".doc")){
                      word2003ToHtml(wordPath, htmlPath);
                  } else {
                      throw new RuntimeException("文件格式不正確");
                  }
              }
          
              public void word2007ToHtml(String wordPath, String htmlPath) throws TransformerException, IOException, ParserConfigurationException {
                  //try(OutputStream out = Files.newOutputStream(Paths.get(path))){
                  try(FileOutputStream out = new FileOutputStream(htmlPath)){
                      word2007ToHtmlOutputStream(wordPath, out);
                  }
              }
          
              private void word2007ToHtmlOutputStream(String wordPath,OutputStream out) throws IOException {
                  ZipSecureFile.setMinInflateRatio(-1.0d);
                  InputStream in = Files.newInputStream(Paths.get(wordPath));
                  XWPFDocument document = new XWPFDocument(in);
                  XHTMLOptions options = XHTMLOptions.create().setIgnoreStylesIfUnused(false).setImageManager(new Base64EmbedImgManager());
                  // 使用內(nèi)存輸出流
                  XHTMLConverter.getInstance().convert(document, out, options);
              }
          
              public void word2003ToHtml(String wordPath, String htmlPath) throws TransformerException, IOException, ParserConfigurationException {
                  org.w3c.dom.Document htmlDocument = word2003ToHtmlDocument(wordPath);
                  // 生成html文件地址
          
                  try(OutputStream outStream = Files.newOutputStream(Paths.get(htmlPath))){
                      DOMSource domSource = new DOMSource(htmlDocument);
                      StreamResult streamResult = new StreamResult(outStream);
                      TransformerFactory factory = TransformerFactory.newInstance();
                      Transformer serializer = factory.newTransformer();
                      serializer.setOutputProperty(OutputKeys.ENCODING, "utf-8");
                      serializer.setOutputProperty(OutputKeys.INDENT, "yes");
                      serializer.setOutputProperty(OutputKeys.METHOD, "html");
                      serializer.transform(domSource, streamResult);
                  }
              }
          
              private org.w3c.dom.Document word2003ToHtmlDocument(String wordPath) throws IOException, ParserConfigurationException {
                  InputStream input = Files.newInputStream(Paths.get(wordPath));
                  HWPFDocument wordDocument = new HWPFDocument(input);
                  WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
                          DocumentBuilderFactory.newInstance().newDocumentBuilder()
                                  .newDocument());
                  wordToHtmlConverter.setPicturesManager((content, pictureType, suggestedName, widthInches, heightInches) -> {
                      System.out.println(pictureType);
                      if (PictureType.UNKNOWN.equals(pictureType)) {
                          return null;
                      }
                      BufferedImage bufferedImage = ImgUtil.toImage(content);
                      String base64Img = ImgUtil.toBase64(bufferedImage, pictureType.getExtension());
                      //  帶圖片的word,則將圖片轉(zhuǎn)為base64編碼,保存在一個(gè)頁面中
                      StringBuilder sb = (new StringBuilder(base64Img.length() + "data:;base64,".length()).append("data:;base64,").append(base64Img));
                      return sb.toString();
                  });
                  // 解析word文檔
                  wordToHtmlConverter.processDocument(wordDocument);
                  return wordToHtmlConverter.getDocument();
              }
          

          轉(zhuǎn)換成html的效果:

          1.3 使用spire

          public void wordToHtml(String wordPath, String htmlPath) {
                  htmlPath = FileUtil.getNewFileFullPath(wordPath, htmlPath, "html");
                  Document document = new Document();
                  document.loadFromFile(wordPath);
                  document.saveToFile(htmlPath, FileFormat.Html);
              }
          

          轉(zhuǎn)換成html的效果:

          因?yàn)槭褂玫氖敲赓M(fèi)版,存在頁數(shù)和字?jǐn)?shù)限制,需要完整功能的的可以選擇付費(fèi)版本。PS:這回76頁的文檔居然轉(zhuǎn)成功了前50頁。

          2、將pdf文件轉(zhuǎn)換成html文件

          圖片版pdf原文件效果:

          文字版pdf原文件效果:

          2.1 使用aspose

          public static void pdfToHtml(String pdfPath, String htmlPath) throws IOException, ParserConfigurationException {
                  File file = new File(pdfPath);
                  String path = htmlPath + File.separator + file.getName().substring(0, file.getName().lastIndexOf(".")) + ".html";
                  PDDocument document = PDDocument.load(new File(pdfPath));
                  Writer writer = new PrintWriter(path, "UTF-8");
                  new PDFDomTree().writeText(document, writer);
                  writer.close();
                  document.close();
              }
          

          圖片版PDF文件驗(yàn)證結(jié)果:

          文字版PDF文件驗(yàn)證結(jié)果:

          2.2 使用poi + pdfbox

          public void pdfToHtml(String pdfPath, String htmlPath) throws IOException, ParserConfigurationException {
                  String path = FileUtil.getNewFileFullPath(pdfPath, htmlPath, "html");
                  PDDocument document = PDDocument.load(new File(pdfPath));
                  Writer writer = new PrintWriter(path, "UTF-8");
                  new PDFDomTree().writeText(document, writer);
                  writer.close();
                  document.close();
              }
          

          圖片版PDF文件驗(yàn)證結(jié)果:

          文字版PDF原文件效果:

          2.3 使用spire

          public void pdfToHtml(String pdfPath, String htmlPath) throws IOException, ParserConfigurationException {
                  htmlPath = FileUtil.getNewFileFullPath(pdfPath, htmlPath, "html");
                  PdfDocument pdf = new PdfDocument();
                  pdf.loadFromFile(pdfPath);
                  pdf.saveToFile(htmlPath, com.spire.pdf.FileFormat.HTML);
              }
          

          圖片版PDF文件驗(yàn)證結(jié)果:
          因?yàn)槭褂玫氖敲赓M(fèi)版,所以只有前三頁是正常的。。。有超過三頁需求的可以選擇付費(fèi)版本。

          文字版PDF原文件效果:

          報(bào)錯(cuò)了無法轉(zhuǎn)換。。。

          java.lang.NullPointerException
          	at com.spire.pdf.PdfPageWidget.spr┢?(Unknown Source)
          	at com.spire.pdf.PdfPageWidget.getSize(Unknown Source)
          	at com.spire.pdf.PdfPageBase.spr???—(Unknown Source)
          	at com.spire.pdf.PdfPageBase.getActualSize(Unknown Source)
          	at com.spire.pdf.PdfPageBase.getSection(Unknown Source)
          	at com.spire.pdf.general.PdfDestination.spr︻┎?—(Unknown Source)
          	at com.spire.pdf.general.PdfDestination.spr┻┑?—(Unknown Source)
          	at com.spire.pdf.general.PdfDestination.getElement(Unknown Source)
          	at com.spire.pdf.primitives.PdfDictionary.setProperty(Unknown Source)
          	at com.spire.pdf.bookmarks.PdfBookmark.setDestination(Unknown Source)
          	at com.spire.pdf.bookmarks.PdfBookmarkWidget.spr┭┘?—(Unknown Source)
          	at com.spire.pdf.bookmarks.PdfBookmarkWidget.getDestination(Unknown Source)
          	at com.spire.pdf.PdfDocumentBase.spr??(Unknown Source)
          	at com.spire.pdf.widget.PdfPageCollection.spr┦?(Unknown Source)
          	at com.spire.pdf.widget.PdfPageCollection.removeAt(Unknown Source)
          	at com.spire.pdf.PdfDocumentBase.spr┞?(Unknown Source)
          	at com.spire.pdf.PdfDocument.loadFromFile(Unknown Source)
          

          3、將excel文件轉(zhuǎn)換成html文件

          excel原文件效果:

          3.1 使用aspose

          public void excelToHtml(String excelPath, String htmlPath) throws Exception {
                  htmlPath = FileUtil.getNewFileFullPath(excelPath, htmlPath, "html");
                  Workbook workbook = new Workbook(excelPath);
                  com.aspose.cells.HtmlSaveOptions options = new com.aspose.cells.HtmlSaveOptions();
                  workbook.save(htmlPath, options);
              }
          

          轉(zhuǎn)換成html的效果:

          3.2 使用poi

          public void excelToHtml(String excelPath, String htmlPath) throws Exception {
                  String path = FileUtil.getNewFileFullPath(excelPath, htmlPath, "html");
                  try(FileOutputStream fileOutputStream = new FileOutputStream(path)){
                      String htmlStr = excelToHtmlStr(excelPath);
                      byte[] bytes = htmlStr.getBytes();
                      fileOutputStream.write(bytes);
                  }
              }
          
          
              public String excelToHtmlStr(String excelPath) throws Exception {
                  FileInputStream fileInputStream = new FileInputStream(excelPath);
                  try (Workbook workbook = WorkbookFactory.create(new File(excelPath))){
                      DataFormatter dataFormatter = new DataFormatter();
                      FormulaEvaluator formulaEvaluator = workbook.getCreationHelper().createFormulaEvaluator();
                      org.apache.poi.ss.usermodel.Sheet sheet = workbook.getSheetAt(0);
                      StringBuilder htmlStringBuilder = new StringBuilder();
                      htmlStringBuilder.append("<html><head><title>Excel to HTML using Java and POI library</title>");
                      htmlStringBuilder.append("<style>table, th, td { border: 1px solid black; }</style>");
                      htmlStringBuilder.append("</head><body><table>");
                      for (Row row : sheet) {
                          htmlStringBuilder.append("<tr>");
                          for (Cell cell : row) {
                              CellType cellType = cell.getCellType();
                              if (cellType == CellType.FORMULA) {
                                  formulaEvaluator.evaluateFormulaCell(cell);
                                  cellType = cell.getCachedFormulaResultType();
                              }
                              String cellValue = dataFormatter.formatCellValue(cell, formulaEvaluator);
                              htmlStringBuilder.append("<td>").append(cellValue).append("</td>");
                          }
                          htmlStringBuilder.append("</tr>");
                      }
                      htmlStringBuilder.append("</table></body></html>");
                      return htmlStringBuilder.toString();
                  }
              }
          

          轉(zhuǎn)換成html的效果:

          3.3 使用spire

          public void excelToHtml(String excelPath, String htmlPath) throws Exception {
                  htmlPath = FileUtil.getNewFileFullPath(excelPath, htmlPath, "html");
                  Workbook workbook = new Workbook();
                  workbook.loadFromFile(excelPath);
                  workbook.saveToFile(htmlPath, com.spire.xls.FileFormat.HTML);
              }
          

          轉(zhuǎn)換成html的效果:

          四、總結(jié)

          從上述的效果展示我們可以發(fā)現(xiàn)其實(shí)轉(zhuǎn)成html效果不是太理想,很多細(xì)節(jié)樣式?jīng)]有還原,這其實(shí)是因?yàn)檫@類轉(zhuǎn)換往往都是追求目標(biāo)是通過使用文檔中的語義信息并忽略其他細(xì)節(jié)來生成簡單干凈的 HTML,所以在轉(zhuǎn)換過程中復(fù)雜樣式被忽略,比如居中、首行縮進(jìn)、字體,文本大小,顏色。舉個(gè)例子在轉(zhuǎn)換是 會將應(yīng)用標(biāo)題 1 樣式的任何段落轉(zhuǎn)換為 h1 元素,而不是嘗試完全復(fù)制標(biāo)題的樣式。所以轉(zhuǎn)成html的顯示效果往往和原文檔不太一樣。這意味著對于較復(fù)雜的文檔而言,這種轉(zhuǎn)換不太可能是完美的。但如果都是只使用簡單樣式文檔或者對文檔樣式不太關(guān)心的這種方式也不妨一試。

          PS:如果想要展示效果好的話,其實(shí)可以將上篇文章《文檔在線預(yù)覽(一)通過將txt、word、pdf轉(zhuǎn)成圖片實(shí)現(xiàn)在線預(yù)覽功能》說的內(nèi)容和本文結(jié)合起來使用,即將文檔里的內(nèi)容都生成成圖片(很可能是多張圖片),然后將生成的圖片全都放到一個(gè)html頁面里 ,用html+css來保持樣式并實(shí)現(xiàn)多張圖片展示,再將html返回。開源組件kkfilevie就是用的就是這種做法。

          kkfileview展示效果如下:

          下圖是kkfileview返回的html代碼,從html代碼我們可以看到kkfileview其實(shí)是將文件(txt文件除外)每頁的內(nèi)容都轉(zhuǎn)成了圖片,然后將這些圖片都嵌入到一個(gè)html里,再返回給用戶一個(gè)html頁面。


          主站蜘蛛池模板: av无码人妻一区二区三区牛牛| 国产激情无码一区二区app| 清纯唯美经典一区二区| 精品无人区一区二区三区| 国产成人久久一区二区三区| 亚洲一区二区在线免费观看| 天天综合色一区二区三区| 国产91精品一区二区麻豆网站 | 无码中文人妻在线一区二区三区 | 波多野结衣av高清一区二区三区| 狠狠综合久久AV一区二区三区 | 一区二区三区人妻无码| 精品人妻一区二区三区四区在线| 国精品无码一区二区三区在线| 五十路熟女人妻一区二区 | 精品日韩在线视频一区二区三区| 无码午夜人妻一区二区不卡视频| 国产一区二区三区在线看片| 精品国产一区二区二三区在线观看| 亚洲av鲁丝一区二区三区| 香蕉一区二区三区观| 国产成人久久精品麻豆一区| 国产一区二区三区电影| 日本免费一区二区三区最新 | 99久久精品国产一区二区成人| 一区二区乱子伦在线播放| 国产熟女一区二区三区四区五区 | 乱精品一区字幕二区| 乱色精品无码一区二区国产盗| 久久精品动漫一区二区三区| 无码AV天堂一区二区三区| 精品不卡一区二区| 3d动漫精品啪啪一区二区免费| 日韩高清国产一区在线| 中文精品一区二区三区四区| 国产高清视频一区二区| 久久久精品日本一区二区三区| 亚洲国产成人久久一区WWW | 亚洲AV无码一区二区三区人 | 成人精品视频一区二区三区不卡| 国产一区在线电影|