word完美转html（doc、docx 图片转base64编码）

近期在做一个项目，里面涉及到关于word转html的需求。要求上传一个word文档，转换成html进行在线预览编辑个功能。由于我选择将预览修改后的文档保存到S3里面，所以我选择将word中的图片直接转成base64编码，上传到S3中，好处就是不需要额外的地方（例如 mongo）存储保存这些图片，缺点就是转成的html文本的大小会比相应的word文件要大一些。至于那种方案好就看自己实际情况了。

由于网上关于word转html的文件一搜一大堆，所以我这里就不展示将word中的图片保存到文件夹中转换的方法了。只贴出将image转成base64编码的代码，供有需要的同学参考。如有问题，请指出。大家共同进步

先上需要的jar包

dependencies {
    compile group: 'fr.opensagres.xdocreport', name: 'xdocreport', version: '2.0.2'
    // https://mvnrepository.com/artifact/org.apache.poi/poi
    compile group: 'org.apache.poi', name: 'poi', version: '4.1.0'
    // https://mvnrepository.com/artifact/org.apache.poi/poi-scratchpad
    compile group: 'org.apache.poi', name: 'poi-scratchpad', version: '4.1.0'
    // https://mvnrepository.com/artifact/org.apache.poi/poi-ooxml
    compile group: 'org.apache.poi', name: 'poi-ooxml', version: '4.1.0'
    // https://mvnrepository.com/artifact/org.apache.poi/poi-ooxml-schemas
    compile group: 'org.apache.poi', name: 'poi-ooxml-schemas', version: '4.1.0'
    // https://mvnrepository.com/artifact/org.apache.poi/ooxml-schemas
    compile group: 'org.apache.poi', name: 'ooxml-schemas', version: '1.4'
}

我这里用的是gradle进行管理的项目，与使用maven没啥两样

由于doc与docx转html方法不一致，我分着贴出代码

doc转html

import org.apache.commons.io.FileUtils;
import org.apache.poi.hwpf.HWPFDocumentCore;
import org.apache.poi.hwpf.converter.WordToHtmlConverter;
import org.apache.poi.hwpf.converter.WordToHtmlUtils;
import org.apache.poi.hwpf.usermodel.Picture;
import org.w3c.dom.Document;
import org.w3c.dom.Element;

import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.*;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import java.io.*;
import java.util.Base64;

public class DocToHtml{
    public static void main(String[] args) throws ParserConfigurationException, TransformerException, IOException {
        DocToHtml docToHtml = new DocToHtml();
        docToHtml.docToHtml();
    }
    public void docToHtml() throws IOException, ParserConfigurationException, TransformerException {
        HWPFDocumentCore wordDocument = WordToHtmlUtils.loadDoc(new FileInputStream("D:\345.doc"));
        WordToHtmlConverter wordToHtmlConverter = new ImageConverter(
                DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument()
        );
        wordToHtmlConverter.processDocument(wordDocument);
        Document htmlDocument = wordToHtmlConverter.getDocument();
        ByteArrayOutputStream out = new ByteArrayOutputStream();
        DOMSource domSource = new DOMSource(htmlDocument);
        StreamResult streamResult = new StreamResult(out);
        TransformerFactory transformerFactory = TransformerFactory.newInstance();
        Transformer serializer = transformerFactory.newTransformer();
        serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
        serializer.setOutputProperty(OutputKeys.INDENT, "yes");
        serializer.setOutputProperty(OutputKeys.METHOD, "html");
        serializer.transform(domSource, streamResult);
        out.close();
        String result = new String(out.toByteArray());
        FileUtils.writeStringToFile(new File("D:\", "a.html"), result, "utf-8");
    }
    
    public class ImageConverter extends WordToHtmlConverter{

        public ImageConverter(Document document) {
            super(document);
        }
        @Override
        protected void processImageWithoutPicturesManager(Element currentBlock, boolean inlined, Picture picture){
            Element imgNode = currentBlock.getOwnerDocument().createElement("img");
            StringBuffer sb = new StringBuffer();
            sb.append(Base64.getMimeEncoder().encodeToString(picture.getRawContent()));
            sb.insert(0, "data:" + picture.getMimeType() + ";base64,");
            imgNode.setAttribute("src", sb.toString());
            currentBlock.appendChild(imgNode);
        }
    }
}

效果如下:

转换后效果图

docx转html

import fr.opensagres.poi.xwpf.converter.xhtml.Base64EmbedImgManager;
import fr.opensagres.poi.xwpf.converter.xhtml.XHTMLConverter;
import fr.opensagres.poi.xwpf.converter.xhtml.XHTMLOptions;
import org.apache.commons.io.FileUtils;

import org.apache.poi.xwpf.usermodel.XWPFDocument;
import java.io.*;

/**
 * Created by liushiyu
 * docx转html处理
 */
public class DocxToHtml {

    //docx转换html
    public String docxToHtml(String fileName) throws IOException {
        XWPFDocument docxDocument = new XWPFDocument(new FileInputStream(fileName));
        XHTMLOptions options = XHTMLOptions.create();
        //图片转base64
        options.setImageManager(new Base64EmbedImgManager());
        // 转换htm11
        ByteArrayOutputStream htmlStream = new ByteArrayOutputStream();
        XHTMLConverter.getInstance().convert(docxDocument, htmlStream, options);
        String htmlStr = htmlStream.toString();
        return htmlStr;
    }


    public static void main(String arg[]) throws Exception {
        DocxToHtml test = new DocxToHtml();
        FileUtils.writeStringToFile(new File("D:\", "a2.html"), test.docxToHtml("D:\567.docx").toString(), "utf-8");
    }
}

效果如下

html转换效果

版权声明：本文来源CSDN，感谢博主原创文章，遵循 CC 4.0 by-sa 版权协议，转载请附上原文出处链接和本声明。
原文链接：https://blog.csdn.net/shiyu18045181748/article/details/99841319
站方申明：本站部分内容来自社区用户分享，若涉及侵权，请联系站方删除。

发表于 2020-02-25 01:21:21
阅读 ( 1842 )
分类：前端

word完美转html（doc、docx 图片转base64编码）

doc转html

docx转html

你可能感兴趣的文章

精选的优质文章

0 条评论

官方社群

GO教程

推荐文章

猜你喜欢

随便看看