社区微信群开通啦,扫一扫抢先加入社区官方微信群
社区微信群
近期在做一个项目,里面涉及到关于word转html的需求。要求上传一个word文档,转换成html进行在线预览编辑个功能。由于我选择将预览修改后的文档保存到S3里面,所以我选择将word中的图片直接转成base64编码,上传到S3中,好处就是不需要额外的地方(例如 mongo)存储保存这些图片,缺点就是转成的html文本的大小会比相应的word文件要大一些。至于那种方案好就看自己实际情况了。
由于网上关于word转html的文件一搜一大堆,所以我这里就不展示将word中的图片保存到文件夹中转换的方法了。只贴出将image转成base64编码的代码,供有需要的同学参考。如有问题,请指出。大家共同进步
先上需要的jar包
dependencies {
compile group: 'fr.opensagres.xdocreport', name: 'xdocreport', version: '2.0.2'
// https://mvnrepository.com/artifact/org.apache.poi/poi
compile group: 'org.apache.poi', name: 'poi', version: '4.1.0'
// https://mvnrepository.com/artifact/org.apache.poi/poi-scratchpad
compile group: 'org.apache.poi', name: 'poi-scratchpad', version: '4.1.0'
// https://mvnrepository.com/artifact/org.apache.poi/poi-ooxml
compile group: 'org.apache.poi', name: 'poi-ooxml', version: '4.1.0'
// https://mvnrepository.com/artifact/org.apache.poi/poi-ooxml-schemas
compile group: 'org.apache.poi', name: 'poi-ooxml-schemas', version: '4.1.0'
// https://mvnrepository.com/artifact/org.apache.poi/ooxml-schemas
compile group: 'org.apache.poi', name: 'ooxml-schemas', version: '1.4'
}
我这里用的是gradle进行管理的项目,与使用maven没啥两样
由于doc与docx转html方法不一致,我分着贴出代码
import org.apache.commons.io.FileUtils;
import org.apache.poi.hwpf.HWPFDocumentCore;
import org.apache.poi.hwpf.converter.WordToHtmlConverter;
import org.apache.poi.hwpf.converter.WordToHtmlUtils;
import org.apache.poi.hwpf.usermodel.Picture;
import org.w3c.dom.Document;
import org.w3c.dom.Element;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.transform.*;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import java.io.*;
import java.util.Base64;
public class DocToHtml{
public static void main(String[] args) throws ParserConfigurationException, TransformerException, IOException {
DocToHtml docToHtml = new DocToHtml();
docToHtml.docToHtml();
}
public void docToHtml() throws IOException, ParserConfigurationException, TransformerException {
HWPFDocumentCore wordDocument = WordToHtmlUtils.loadDoc(new FileInputStream("D:\345.doc"));
WordToHtmlConverter wordToHtmlConverter = new ImageConverter(
DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument()
);
wordToHtmlConverter.processDocument(wordDocument);
Document htmlDocument = wordToHtmlConverter.getDocument();
ByteArrayOutputStream out = new ByteArrayOutputStream();
DOMSource domSource = new DOMSource(htmlDocument);
StreamResult streamResult = new StreamResult(out);
TransformerFactory transformerFactory = TransformerFactory.newInstance();
Transformer serializer = transformerFactory.newTransformer();
serializer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
serializer.setOutputProperty(OutputKeys.INDENT, "yes");
serializer.setOutputProperty(OutputKeys.METHOD, "html");
serializer.transform(domSource, streamResult);
out.close();
String result = new String(out.toByteArray());
FileUtils.writeStringToFile(new File("D:\", "a.html"), result, "utf-8");
}
public class ImageConverter extends WordToHtmlConverter{
public ImageConverter(Document document) {
super(document);
}
@Override
protected void processImageWithoutPicturesManager(Element currentBlock, boolean inlined, Picture picture){
Element imgNode = currentBlock.getOwnerDocument().createElement("img");
StringBuffer sb = new StringBuffer();
sb.append(Base64.getMimeEncoder().encodeToString(picture.getRawContent()));
sb.insert(0, "data:" + picture.getMimeType() + ";base64,");
imgNode.setAttribute("src", sb.toString());
currentBlock.appendChild(imgNode);
}
}
}
效果如下:
转换后效果图
import fr.opensagres.poi.xwpf.converter.xhtml.Base64EmbedImgManager;
import fr.opensagres.poi.xwpf.converter.xhtml.XHTMLConverter;
import fr.opensagres.poi.xwpf.converter.xhtml.XHTMLOptions;
import org.apache.commons.io.FileUtils;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import java.io.*;
/**
* Created by liushiyu
* docx转html处理
*/
public class DocxToHtml {
//docx转换html
public String docxToHtml(String fileName) throws IOException {
XWPFDocument docxDocument = new XWPFDocument(new FileInputStream(fileName));
XHTMLOptions options = XHTMLOptions.create();
//图片转base64
options.setImageManager(new Base64EmbedImgManager());
// 转换htm11
ByteArrayOutputStream htmlStream = new ByteArrayOutputStream();
XHTMLConverter.getInstance().convert(docxDocument, htmlStream, options);
String htmlStr = htmlStream.toString();
return htmlStr;
}
public static void main(String arg[]) throws Exception {
DocxToHtml test = new DocxToHtml();
FileUtils.writeStringToFile(new File("D:\", "a2.html"), test.docxToHtml("D:\567.docx").toString(), "utf-8");
}
}
效果如下
html转换效果
如果觉得我的文章对您有用,请随意打赏。你的支持将鼓励我继续创作!