一个可用的使用htmlparser抽取html文档文本的方法
Extract strings from a URL.
Text within <SCRIPT></SCRIPT> tags is removed.
Text within <STYLE></STYLE> tags is removed.
The text within <PRE></PRE> tags is not altered.
The property Strings, which is the output property is null * until a URL is set. So a typical usage is:
StringBean sb = new StringBean ();
sb.setLinks (false);
sb.setReplaceNonBreakingSpaces (true);
sb.setCollapse (true);
sb.setURL ("http://www.netbeans.org"); // the HTTP is performed here
String s = sb.getStrings ();
You can also use the StringBean as a NodeVisitor on your own parser, * in which case you have to refetch your page if you change one of the * properties because it resets the Strings property:
StringBean sb = new StringBean ();
Parser parser = new Parser ("http://cbc.ca");
//或者Parser parser = Parser.createParser("<html>...</html>","GBK");
parser.visitAllNodesWith (sb);
String s = sb.getStrings ();
sb.setLinks (true);
parser.reset ();
parser.visitAllNodesWith (sb);
String sl = sb.getStrings ();
According to Nick Burch, who contributed the patch, this is handy if you * don't want StringBean to wander off and get the content itself, either * because you already have it, it's not on a website etc.
