Apache Tika - Apache Tika
发现和抽取文档元数据、文本内容,文件编码,字符集工具
The Apache Tika™ toolkit detects and extracts metadata and structured text content from various documents using existing parser libraries. You can find the latest release on the download page. See the Getting Started guide for instructions on how to start using Tika.