一个可用的使用htmlparser抽取html文档文本的方法
Extract strings from a URL.
Text within <SCRIPT></SCRIPT> tags is removed.
Text within <STYLE></STYLE> tags is removed.
The text within <PRE></PRE> tags is not altered.
The property Strings
, which is the output property is null * until a URL is set. So a typical usage is:
StringBean sb = new StringBean (); sb.setLinks (false); sb.setReplaceNonBreakingSpaces (true); sb.setCollapse (true); sb.setURL ("http://www.netbeans.org"); // the HTTP is performed here String s = sb.getStrings ();
You can also use the StringBean as a NodeVisitor on your own parser, * in which case you have to refetch your page if you change one of the * properties because it resets the Strings property:
StringBean sb = new StringBean (); Parser parser = new Parser ("http://cbc.ca"); //或者Parser parser = Parser.createParser("<html>...</html>","GBK"); parser.visitAllNodesWith (sb); String s = sb.getStrings (); sb.setLinks (true); parser.reset (); parser.visitAllNodesWith (sb); String sl = sb.getStrings ();
According to Nick Burch, who contributed the patch, this is handy if you * don't want StringBean to wander off and get the content itself, either * because you already have it, it's not on a website etc.