使用NekoHTML和XPath获取网页特定标签

如果需要在HTML页面中提取数据，那么NekoHTML 是个不错的工具。因为HTML跟XML不一样，可能存在一些格式不完整的元素，譬如没有end tag的table等，这个时候，NekoHTML是个很尽责的清道夫和修理工，可以帮助我们整理这些缺陷数据，最终生成一个DOM Tree。得到DOM Tree话，使用XPath就可以轻松获取所需数据了：-）

下面是几个需要注意的问题：

1、如何使用NekoHTML？

必须在 Java Build Path里加入 nekohtml.jar , xercesImpl.jar 以及xalan.jar。下载的NekoHTML目录中并没有xercesImpl.jar和xalan.jar，需要自己下载。

2、如何获取XPath?

当然大家可以把网页的代码下下来，然后使用“人工智能”的方式获取，但是过程确实让人眼花缭乱、心力交瘁。使用FireBug吧，可以自动生成XPath。

updated:

Firebug生成的XPath中如果含有TBODY标签，需要把TBODY去掉，否则不能正确获取网页的内容。如Firebug生成的XPath为/html/body/table/tbody/tr，那么则需要修改为/html/body/table/tr。

4、如何正确结合NekoHTML和XPath？

XPath的Tag必须大写。如

String divXpath = "//DIV";//正确

String divXpath = "//div";//错误

下面举个例子，提取当当网图书的ISBN信息：

Java代码

DOMParser parser = new DOMParser();
try {
//设置网页的默认编码
parser.setProperty("http://cyberneko.org/html/properties/default-encoding","gb2312");
/*The Xerces HTML DOM implementation does not support namespaces
and cannot represent XHTML documents with namespace information.
Therefore, in order to use the default HTML DOM implementation with NekoHTML's
DOMParser to parse XHTML documents, you must turn off namespace processing.*/
parser.setFeature("http://xml.org/sax/features/namespaces", false);
String strURL = "http://product.dangdang.com/product.aspx?product_id=9317290";
BufferedReader in = new BufferedReader(
new InputStreamReader(
new URL(strURL).openStream()));
parser.parse(new InputSource(in));
in.close();
} catch (Exception e) {
e.printStackTrace();
}
Document doc = parser.getDocument();
// tags should be in upper case
String productsXpath = "/HTML/BODY/DIV[2]/DIV[4]/DIV[2]/DIV/DIV[3]/UL[@class]/LI[9]";
NodeList products;
try {
products = XPathAPI.selectNodeList(doc, productsXpath);
System.out.println("found: " + products.getLength());
Node node = null;
for(int i=0; i< products.getLength();i++)
{
node = products.item(i);
System.out.println( i + ":\n" + node.getTextContent());
}
}catch (TransformerException e) {
e.printStackTrace();
}

DOMParser parser = new DOMParser();     try {     	   //设置网页的默认编码     	   parser.setProperty("http://cyberneko.org/html/properties/default-encoding","gb2312");     	   /*The Xerces HTML DOM implementation does not support namespaces      	   and cannot represent XHTML documents with namespace information.      	   Therefore, in order to use the default HTML DOM implementation with NekoHTML's      	   DOMParser to parse XHTML documents, you must turn off namespace processing.*/     	   parser.setFeature("http://xml.org/sax/features/namespaces", false);      	   String strURL = "http://product.dangdang.com/product.aspx?product_id=9317290";     	   BufferedReader in = new BufferedReader(     			   new InputStreamReader(     					   new URL(strURL).openStream()));     	   parser.parse(new InputSource(in));     	   in.close();     	  } catch (Exception e) {     	   e.printStackTrace();     	  }     	  Document doc = parser.getDocument();     	  // tags should be in upper case     	  String productsXpath = "/HTML/BODY/DIV[2]/DIV[4]/DIV[2]/DIV/DIV[3]/UL[@class]/LI[9]";     	  NodeList products;     	  try {     	      products = XPathAPI.selectNodeList(doc, productsXpath);     	      System.out.println("found: " + products.getLength());     	      Node node = null;     	      for(int i=0; i< products.getLength();i++)     	      {     	    	  node = products.item(i);     	    	  System.out.println( i + ":\n" + node.getTextContent());     	      }     	  }catch (TransformerException e) {     	      e.printStackTrace();     	  }

一些有用的链接：

1、Java HTML Parser 比较

2、java XPATH

3、XPath定位

标签 : html, java, xml

发表评论

IT瘾于2009年9月11日下午01时13分38秒发布 #

发表评论发送引用通报

Re: 使用NekoHTML和XPath获取网页特定标签 Anonymous于2026年7月16日下午08时52分27秒评论 #
标题
正文	HTML : b, strong, i, em, blockquote, br, p, pre, a href="", ul, ol, li, sub, sup
OpenID Login	(Not me?)
姓名
电子邮件
网站
记住我	是否
电邮地址不会公开在网页上，您留下的电子邮件仅用于本文有新评论时通知您（以后可以随时拿掉）。

使用NekoHTML和XPath获取网页特定标签

Re: 使用NekoHTML和XPath获取网页特定标签