<< 解析获取Xml Encoding字符集Charset | 首页 | 使用 Spring 2.5 基于注解驱动的 Spring MVC >>

RSS | Atom | 电子邮件

搜索

分类 | 标签 | 高级搜索

分类

AppServer (26)

Database (61)

健康 (4)

生活 (25)

UNIX (38)

Mobile (23)

Tech (70)

Web前端 (0)

随笔 (0)

数据库 (0)

Java技术 (0)

收藏夹 (0)

标签

最新文章

陈爱云：打造坚如磐石的搜索架构 - 中生代技术 | 十条
对于一个在线系统而言，性能和稳定性是永远要追求的两个方向，如果是分布式系统，性能不够可以用机器来凑（当然这不是最好的方法，性能的提升不是本文的关注点，所以这里不对提升性能的方法赘述），但是稳定性不能靠机器来堆，并且机器越来越多可能会带来更多的稳定性的问题。做在线系统的同学应该会对墨菲定理感触特别深，...
Fix certificate problem in HTTPS - Real's Java How-to
HTTPS protocol is supported since JDK1.4 (AFAIK), you have nothing special to do. import java.io.InputStreamReader; import java.io.Reader; import java.net.URL; import java.net.URLConnection; public class ConnectHttps { public static void main(String[...
爬取百度网盘用户分享 | Guodong
获取用户订阅: http://yun.baidu.com/pcloud/friend/getfollowlist?query_uk=%s&limit=24&start=%s&bdstoken=e6f1efec456b92778e70c55ba5d81c3d&channel=chunl...

Log me in using Google

Handle UTF8 file with BOM - Real's Java How-to

From Wikipedia, the byte order mark (BOM) is a Unicode character used to signal the endianness (byte order) of a text file or stream. Its code point is U+FEFF. BOM use is optional, and, if used, should appear at the start of the text stream. Beyond its specific use as a byte-order indicator, the BOM character may also indicate which of the several Unicode representations the text is encoded in.

The common BOMs are :

Encoding	Representation (hexadecimal)	Representation (decimal)
UTF-8	EF BB BF	239 187 191
UTF-16 (BE)	FE FF	254 255
UTF-16 (LE)	FF FE	255 254
UTF-32 (BE)	00 00 FE FF	0 0 254 255
UTF-32 (LE)	FF FE 00 00	255 254 0 0

UTF8 file are a special case because it is not recommended to add a BOM to them because it can break other tools like Java. In fact, Java assumes the UTF8 don't have a BOM so if the BOM is present it won't be discarded and it will be seen as data.

To create an UTF8 file with a BOM, open the Windows create a simple text file and save it as utf8.txt with the encoding UTF-8.

Now if you examine the file content as binary, you see the BOM at the beginning.

If we read it with Java.

import java.io.*;

public class x {

  public static void main(String args[]) {
    try {
        FileInputStream fis = new FileInputStream("c:/temp/utf8.txt");
        BufferedReader r = new BufferedReader(new InputStreamReader(fis,
                "UTF8"));
        for (String s = ""; (s = r.readLine()) != null;) {
            System.out.println(s);
        }
        r.close();
        System.exit(0);
    }

    catch (Exception e) {
        e.printStackTrace();
        System.exit(1);
    }
  }
}

The output contains a strange character at the beginning because the BOM is not discarded :

?helloworld

This behaviour is documented in the Java bug database, here and here. There will be no fix for now because it will break existing tools like javadoc ou xml parsers.

The Apache IO Commons provides some tools to handle this situation. The BOMInputStream class detects the BOM and, if required, can automatically skip it and return the subsequent byte as the first byte in the stream.

Or you can do it manually. The next example converts an UTF8 file to ANSI. We check the first line for the presence of the BOM and if present, we simply discard it.

import java.io.*;

public class UTF8ToAnsiUtils {

    // FEFF because this is the Unicode char represented by the UTF-8 byte order mark (EF BB BF).
    public static final String UTF8_BOM = "\uFEFF";

    public static void main(String args[]) {
        try {
            if (args.length != 2) {
                System.out
                        .println("Usage : java UTF8ToAnsiUtils utf8file ansifile");
                System.exit(1);
            }

            boolean firstLine = true;
            FileInputStream fis = new FileInputStream(args[0]);
            BufferedReader r = new BufferedReader(new InputStreamReader(fis,
                    "UTF8"));
            FileOutputStream fos = new FileOutputStream(args[1]);
            Writer w = new BufferedWriter(new OutputStreamWriter(fos, "Cp1252"));
            for (String s = ""; (s = r.readLine()) != null;) {
                if (firstLine) {
                    s = UTF8ToAnsiUtils.removeUTF8BOM(s);
                    firstLine = false;
                }
                w.write(s + System.getProperty("line.separator"));
                w.flush();
            }

            w.close();
            r.close();
            System.exit(0);
        }

        catch (Exception e) {
            e.printStackTrace();
            System.exit(1);
        }
    }

    private static String removeUTF8BOM(String s) {
        if (s.startsWith(UTF8_BOM)) {
            s = s.substring(1);
        }
        return s;
    }
}

阅读全文……

标签 : xml

发表评论

IT瘾于2013年9月12日下午02时36分00秒发布 #

发表评论发送引用通报

Re: Handle UTF8 file with BOM - Real's Java How-to Anonymous于2026年7月16日下午10时59分54秒评论 #
标题
正文	HTML : b, strong, i, em, blockquote, br, p, pre, a href="", ul, ol, li, sub, sup
OpenID Login	(Not me?)
姓名
电子邮件
网站
记住我	是否
电邮地址不会公开在网页上，您留下的电子邮件仅用于本文有新评论时通知您（以后可以随时拿掉）。

Handle UTF8 file with BOM - Real's Java How-to

Re: Handle UTF8 file with BOM - Real's Java How-to