发现很多人搞爬虫会把python作为首选技术,理由是简单;但是本人最熟悉的还是java,所以对java内存浏览器技术htmlunit做了一次研究,发现原生的htmlunit的性能及对多线程的支持不是那么友好,特别是使用代理ip后,oom是很正常的,监控程序并查看源码总结问题原因:
 
 
  - 1、js执行器执行js是使用多线程执行,在关闭js执行线程的时候,使用com.gargoylesoftware.htmlunit.javascript.background.DefaultJavaScriptExecutor这个类的时候,有段代码。
 
  引用
 private void killThread() {
  
        if (eventLoopThread_ == null) {
  
            return;
  
        }
  
        try {
  
            eventLoopThread_.interrupt();
  
            eventLoopThread_.join(10_000);
  
        }
  
        catch (final InterruptedException e) {
  
            LOG.warn("InterruptedException while waiting for the eventLoop thread to join ", e);
  
            // ignore, this doesn't matter, we want to stop it
  
        }
  
        if (eventLoopThread_.isAlive()) {
  
            if (LOG.isWarnEnabled()) {
  
                LOG.warn("Event loop thread "
  
                        + eventLoopThread_.getName()
  
                        + " still alive at "
  
                        + System.currentTimeMillis());
  
                LOG.warn("Event loop thread will be stopped");
  
            }
  
  
            // Stop the thread
  
            eventLoopThread_.stop();
  
        }
  
    }
  
 上面代码的问题:
  
引用
 eventLoopThread_.interrupt();
  
eventLoopThread_.join(10_000);
 不要觉得interrupt真的就会关闭线程,比如正在执行io操作或者同样该线程在处于sleep状态,interrupt就不会终止线程,所以这里住线程要等待eventLoopThread执行10秒才会继续往下跑。
 
 
2、DefaultJavaScriptExecutor在使用外部线程池开启webclient抓取网页的时候,经常会出现线程不关闭的情况,问题代码如下:
 
   
引用
 public void run() {
  
        final boolean trace = LOG.isTraceEnabled();
  
        // this has to be a multiple of 10ms
  
        // otherwise the VM has to fight with the OS to get such small periods
  
        final long sleepInterval = 10;
  
        while (!shutdown_.get() && !Thread.currentThread().isInterrupted() && webClient_.get() != null) {
  
            final JavaScriptJobManager jobManager = getJobManagerWithEarliestJob();
  
  
            if (jobManager != null) {
  
                final JavaScriptJob earliestJob = jobManager.getEarliestJob();
  
                if (earliestJob != null) {
  
                    final long waitTime = earliestJob.getTargetExecutionTime() - System.currentTimeMillis();
  
  
                    // do we have to execute the earliest job
  
                    if (waitTime < 1) {
  
                        // execute the earliest job
  
                        if (trace) {
  
                            LOG.trace("started executing job at " + System.currentTimeMillis());
  
                        }
  
                        jobManager.runSingleJob(earliestJob);
  
                        if (trace) {
  
                            LOG.trace("stopped executing job at " + System.currentTimeMillis());
  
                        }
  
  
                        // job is done, have a look for another one
  
                        continue;
  
                    }
  
                }
  
            }
  
  
            // check for cancel
  
            if (shutdown_.get() || Thread.currentThread().isInterrupted() || webClient_.get() == null) {
  
                break;
  
            }
  
  
            // nothing to do, let's sleep a bit
  
            try {
  
                Thread.sleep(sleepInterval);
  
            }
  
            catch (final InterruptedException e) {
  
                Thread.currentThread().interrupt();
  
            }
  
        }
  
    }
  
 
 此处问题代码:
  
引用
 while (!shutdown_.get() && !Thread.currentThread().isInterrupted() && webClient_.get() != null) 
 外部线程内部关闭webclient.close()的时候,当外部线程要主动关闭本线程的时候,就像outStream没把out.close()写在finally里面,永远不会关闭js执行器线程。
 
  
  - 3、其实htmlunit性能差还有一个最重要的问题所在,就是每次抓取同一个页面,都要去下载相同的资源,htmlunit下载页面的代码是在类com.gargoylesoftware.htmlunit.HttpWebConnection里面(js,css,jpg)
 
 
 主要的方法代码如下:
  
引用
   /**
  
     * Reads the content of the stream and saves it in memory or on the file system.
  
     * @param is the stream to read
  
     * @param maxInMemory the maximumBytes to store in memory, after which save to a local file
  
     * @return a wrapper around the downloaded content
  
     * @throws IOException in case of read issues
  
     */
  
    public static DownloadedContent downloadContent(final InputStream is, final int maxInMemory) throws IOException {
  
        if (is == null) {
  
            return new DownloadedContent.InMemory(null);
  
        }
  
  
        try (ByteArrayOutputStream bos = new ByteArrayOutputStream()) {
  
            final byte[] buffer = new byte[1024];
  
            int nbRead;
  
            try {
  
                while ((nbRead = is.read(buffer)) != -1) {
  
                    bos.write(buffer, 0, nbRead);
  
                    if (bos.size() > maxInMemory) {
  
                        // we have exceeded the max for memory, let's write everything to a temporary file
  
                        final File file = File.createTempFile("htmlunit", ".tmp");
  
                        file.deleteOnExit();
  
                        try (OutputStream fos = Files.newOutputStream(file.toPath())) {
  
                            bos.writeTo(fos); // what we have already read
  
                            IOUtils.copyLarge(is, fos); // what remains from the server response
  
                        }
  
                        return new DownloadedContent.OnFile(file, true);
  
                    }
  
                }
  
            }
  
            catch (final ConnectionClosedException e) {
  
                LOG.warn("Connection was closed while reading from stream.", e);
  
                return new DownloadedContent.InMemory(bos.toByteArray());
  
            }
  
            catch (final EOFException e) {
  
                // this might happen with broken gzip content
  
                LOG.warn("EOFException while reading from stream.", e);
  
                return new DownloadedContent.InMemory(bos.toByteArray());
  
            }
  
  
            return new DownloadedContent.InMemory(bos.toByteArray());
  
        }
  
    }
  
 
 修改代码,只要把重复下载的代码缓存起来,就可以大大增加抓取性能,同时还可以动态修改网页js。
 
 
4、htmlunit设置requestTimeout时是无法单独设置conectiontimeout和socketTimeout,比方说设置requestTimeout=10000,那么 htmlclient的conectiontimeout=10000和socketTimeout=10000,这是有问题的,conectiontimeout一般情况应该设置低于100毫秒为宜,设置代码在 com.gargoylesoftware.htmlunit.HttpWebConnection
 
方法:
  
引用
  private static RequestConfig.Builder createRequestConfigBuilder(final int timeout, final InetAddress localAddress) {
  
        final RequestConfig.Builder requestBuilder = RequestConfig.custom()
  
                .setCookieSpec(HACKED_COOKIE_POLICY)
  
                .setRedirectsEnabled(false)
  
                .setLocalAddress(localAddress)
  
  
                // timeout
  
                .setConnectTimeout(timeout)
  
                .setConnectionRequestTimeout(timeout)
  
                .setSocketTimeout(timeout);
  
        return requestBuilder;
  
    }
 
 
 综上,把上述几点都完善,htmlunit不比只能多进程的python爬虫性能差,而且能够做黑帽。
          
            
          
             
已有   0 人发表留言,猛击->>  这里<<-参与讨论
          
             
ITeye推荐