<< ORA-12500, TNS:listener failed to start a dedicated server process | 首页 | telnet,ftp,jdbc等网络连接一些UNIX/Linux为什么这么慢 >>

RSS | Atom | 电子邮件

搜索

分类 | 标签 | 高级搜索

分类

AppServer (26)

Database (61)

健康 (4)

生活 (25)

UNIX (38)

Mobile (23)

Tech (70)

Web前端 (0)

随笔 (0)

数据库 (0)

Java技术 (0)

收藏夹 (0)

标签

最新文章

陈爱云：打造坚如磐石的搜索架构 - 中生代技术 | 十条
对于一个在线系统而言，性能和稳定性是永远要追求的两个方向，如果是分布式系统，性能不够可以用机器来凑（当然这不是最好的方法，性能的提升不是本文的关注点，所以这里不对提升性能的方法赘述），但是稳定性不能靠机器来堆，并且机器越来越多可能会带来更多的稳定性的问题。做在线系统的同学应该会对墨菲定理感触特别深，...
Fix certificate problem in HTTPS - Real's Java How-to
HTTPS protocol is supported since JDK1.4 (AFAIK), you have nothing special to do. import java.io.InputStreamReader; import java.io.Reader; import java.net.URL; import java.net.URLConnection; public class ConnectHttps { public static void main(String[...
爬取百度网盘用户分享 | Guodong
获取用户订阅: http://yun.baidu.com/pcloud/friend/getfollowlist?query_uk=%s&limit=24&start=%s&bdstoken=e6f1efec456b92778e70c55ba5d81c3d&channel=chunl...

Log me in using Google

一个可用的使用htmlparser抽取html文档文本的方法

Extract strings from a URL.

Text within <SCRIPT></SCRIPT> tags is removed.

Text within <STYLE></STYLE> tags is removed.

The text within <PRE></PRE> tags is not altered.

The property Strings, which is the output property is null * until a URL is set. So a typical usage is:

      StringBean sb = new StringBean ();
      sb.setLinks (false);
      sb.setReplaceNonBreakingSpaces (true);
      sb.setCollapse (true);
      sb.setURL ("http://www.netbeans.org"); // the HTTP is performed here
      String s = sb.getStrings ();

You can also use the StringBean as a NodeVisitor on your own parser, * in which case you have to refetch your page if you change one of the * properties because it resets the Strings property:

      StringBean sb = new StringBean ();
      Parser parser = new Parser ("http://cbc.ca");
//或者Parser parser = Parser.createParser("<html>...</html>","GBK");
      parser.visitAllNodesWith (sb);
      String s = sb.getStrings ();
      sb.setLinks (true);
      parser.reset ();
      parser.visitAllNodesWith (sb);
      String sl = sb.getStrings ();

According to Nick Burch, who contributed the patch, this is handy if you * don't want StringBean to wander off and get the content itself, either * because you already have it, it's not on a website etc.

标签 : html, java

发表评论

IT瘾于2007年10月2日下午12时18分53秒发布 #

发表评论发送引用通报

Re: 一个可用的使用htmlparser抽取html文档文本的方法 Anonymous于2025年8月8日下午12时13分46秒评论 #
标题
正文	HTML : b, strong, i, em, blockquote, br, p, pre, a href="", ul, ol, li, sub, sup
OpenID Login	(Not me?)
姓名
电子邮件
网站
记住我	是否
电邮地址不会公开在网页上，您留下的电子邮件仅用于本文有新评论时通知您（以后可以随时拿掉）。

一个可用的使用htmlparser抽取html文档文本的方法

Re: 一个可用的使用htmlparser抽取html文档文本的方法