<< ORA-12500, TNS:listener failed to start a dedicated server process | 首页 | telnet,ftp,jdbc等网络连接一些UNIX/Linux为什么这么慢 >>

一个可用的使用htmlparser抽取html文档文本的方法

Extract strings from a URL.

Text within <SCRIPT></SCRIPT> tags is removed.

Text within <STYLE></STYLE> tags is removed.

The text within <PRE></PRE> tags is not altered.

The property Strings, which is the output property is null * until a URL is set. So a typical usage is:

      StringBean sb = new StringBean ();
      sb.setLinks (false);
      sb.setReplaceNonBreakingSpaces (true);
      sb.setCollapse (true);
      sb.setURL ("http://www.netbeans.org"); // the HTTP is performed here
      String s = sb.getStrings ();
  

You can also use the StringBean as a NodeVisitor on your own parser, * in which case you have to refetch your page if you change one of the * properties because it resets the Strings property:

 

      StringBean sb = new StringBean ();
      Parser parser = new Parser ("http://cbc.ca");
//或者Parser parser = Parser.createParser("<html>...</html>","GBK");
      parser.visitAllNodesWith (sb);
      String s = sb.getStrings ();
      sb.setLinks (true);
      parser.reset ();
      parser.visitAllNodesWith (sb);
      String sl = sb.getStrings ();
  

According to Nick Burch, who contributed the patch, this is handy if you * don't want StringBean to wander off and get the content itself, either * because you already have it, it's not on a website etc.

标签 : ,



发表评论 发送引用通报