<< 一个使用NekoHTML解析html的例子 | 首页 | 未成年人上网可借鉴日本做法 >>

NekoHTML FAQ

Table of Contents


Why are the DOM element names always uppercase?

The HTML DOM specification explicitly states that element and attribute names follow the semantics, including case-sensitivity, specified in the HTML 4 specification. In addition, section 1.2.1 of the HTML 4.01 specification states:

Element names are written in uppercase letters (e.g., BODY). Attribute names are written in lowercase letters (e.g., lang, onsubmit).

The Xerces HTML DOM implementation (used by default in the NekoHTML DOMParser class) follows this convention. Therefore, even if the "http://cyberneko.org/html/properties/names/elems" property is set to "lower", the DOM will still uppercase the element names.

To get around this problem, instantiate a Xerces2 DOMParser object using the NekoHTML parser configuration. By default, the Xerces DOM parser class creates a standard XML DOM tree, not an HTML DOM tree. Therefore, the element and attribute names will follow the settings for the "http://cyberneko.org/html/properties/names/elems" and "http://cyberneko.org/html/properties/names/attrs" properties. However, realize that the application will not be able to cast the document nodes to the HTML DOM interfaces for accessing the document's information.

The following sample code shows how to instantiate a DOM parser using the NekoHTML parser configuration:

// import org.apache.xerces.parsers.DOMParser; // import org.cyberneko.html.HTMLConfiguration;

DOMParser parser = new DOMParser(new HTMLConfiguration());

Why do I get a hierarchy request error using DOM?

Using the NekoHTML DOM parser to parse HTML documents with namespace information can result in a hierarchy request error to be thrown. For example:

org.w3c.dom.DOMException: HIERARCHY_REQUEST_ERR: An attempt was made to insert a node where it is not permitted.

The Xerces HTML DOM implementation does not support namespaces and cannot represent XHTML documents with namespace information. Therefore, in order to use the default HTML DOM implementation with NekoHTML's DOMParser to parse XHTML documents, you must turn off namespace processing. For example:

// import org.cyberneko.html.parsers.DOMParser;

DOMParser parser = new DOMParser();
parser.setFeature("http://xml.org/sax/features/namespaces", false); 

If your application requires namespace processing to be turned on and uses the DOM API, another option is to add a custom filter to the parsing pipeline to remove namespace

information before the DOMParser constructs the document. For example:

// import org.cyberneko.html.filters.DefaultFilter; // import org.cyberneko.html.parsers.DOMParser; // import org.apache.xerces.xni.*; // import org.apache.xerces.xni.parser.XMLDocumentFilter;

DOMParser parser = new DOMParser();
parser.setProperty("http://cyberneko.org/html/properties/filters", 
  new XMLDocumentFilter[] { new DefaultFilter() {
    public void startElement(QName element, XMLAttributes attrs,
                             Augmentations augs) throws XNIException {
      element.uri = null;
      super.startElement(element, attrs, augs);
    }
    // ...etc...
  } });

How do I add filters before the tag balancer?

The NekoHTML parser has a property that allows you to append custom filter components at the end of the parser pipeline as detailed in the Pipeline Filters documentation. But this means that processing occurs after the tag-balancer does its job. However, the same property can also be used to insert custom components before the tag-balancer as well.

The secret is to disable the tag-balancing feature and then add another instance of the HTMLTagBalancer component at the end of your custom filter pipeline. The following example shows how to add a custom filter before the tag-balancer in the DOM parser. (This also works on all other types of parsers that use the HTMLConfiguration.)

// import org.cyberneko.html.HTMLConfiguration; // import org.cyberneko.html.parsers.DOMParser; // import org.apache.xerces.xni.parser.XMLDocumentFilter;

DOMParser parser = new DOMParser();
parser.setFeature("http://cyberneko.org/html/features/balance-tags", false);
XMLDocumentFilter[] filters = { new MyFilter(), new HTMLTagBalancer() };
parser.setProperty("http://cyberneko.org/html/properties/filters", filters);

How do I parse HTML document fragments?

Frequently, HTML is used within applications and online forms to allow users to enter rich-text. In these situations, it is useful to be able to parse the entered text as a document fragment. In other words, the entered text represents content within the HTML <body> element — it is not a full HTML document.

Starting with version 0.7.0, NekoHTML has added a feature that allows the application to parse HTML document fragments. Setting the "http://cyberneko.org/features/document-fragment" feature to true instructs the tag-balancer to balance only tags found within the HTML <body> element. The surrounding <body> and <html> elements are not inserted.

Note: The document-fragment feature should not be used on the DOMParser class since it relies on balanced elements in order to correctly construct the DOM tree. However, a new parser class has been added to NekoHTML to allow you parser DOM document fragments. Please refer to the Usage Instructions for more information.

How can I get the location of document information?

Many applications are interested in knowing where elements, attributes, and character data appear within the source document. To aid these applications, NekoHTML has a feature that reports the starting and ending character offsets of each piece of information in the document.

In order to tell NekoHTML to report the character offsets for document information, the augmentations feature needs to be turned on. For example:

 

// import org.cyberneko.html.parsers.SAXParser;

String AUGMENTATIONS = "http://cyberneko.org/html/features/augmentations";

SAXParser parser = new SAXParser();
parser.setFeature(AUGMENTATIONS, true);

Once the feature is enabled, the location information can be obtained by querying the HTMLEventInfo object in the Augmentations parameter passed to all XNI callbacks. This dependency is required because DOM and SAX lack the ability to communicate this detailed information to the application.

The XNI dependence does not restrict applications to only using the Xerces Native Interface, however. The best way to use this information is by extending one of the parsers in the org.cyberneko.html.parsers package and overriding the methods of interest. The following example extends the SAXParser class to retrieve the event information for start elements:

 

public class MySAXParser extends SAXParser {

    static final String AUGMENTATIONS =
        "http://cyberneko.org/html/features/augmentations";

    public MySAXParser() {
        setFeature(AUGMENTATIONS, true);     }

    public void startElement(QName element, XMLAttributes attrs,
                             Augmentations augs) throws XNIException {

        // get offset information
        HTMLEventInfo info =            (HTMLEventInfo)augs.getItem(AUGMENTATIONS);

        boolean synthesized = info.isSynthesized();
        int beginRow = info.getBeginLineNumber();
        int beginCol = info.getBeginColumnNumber();
        int endRow = info.getEndLineNumber();
        int endCol = info.getEndColumnNumber();

        // perform default processing
        super.startElement(element, attrs, augs);     }  }

Note: The NekoHTML parser reports character offsets and is unable to report the byte offsets that map to the resulting characters. The parser takes advantage of the character decoders present in the JVM which do not report byte offsets. And because these decoders buffer blocks of bytes internally for performance reasons, it is not possible to write a custom input stream to perform this mapping between byte and character offsets. If you control the source documents and can restrict them to a single character encoding, then writing a custom reader to perform this mapping is more feasible.

Note: Currently, only the start and end row and column information can be queried. In the future, NekoHTML will be able to report character offsets from the beginning of the file. This does not, however, mean that byte offsets will also be supported at a future date.

Do I have to use all of Xerces2?

While NekoHTML is a rather small library, many users complain about the size of the Xerces2 library. However, the full Xerces2 library is not required in order to use the NekoHTML parser. Because the CyberNeko HTML parser is written using the Xerces Native Interface (XNI) framework that forms the foundation of the Xerces2 implementation, only that part is required to write applications using NekoHTML.

For convenience, a small Jar file containing only the necessary parts of the framework and utility classes from Xerces2 is distributed with the NekoHTML package. The Jar file, called xercesMinimal.jar, can be found in the lib/ directory of the distribution. Simply add this file to your classpath along with nekohtml.jar.

However, there are a few restrictions if you choose to use the xercesMinimal.jar file instead of the full Xerces2 package. First, you cannot use the DOM and SAX parsers included with NekoHTML because they use the Xerces2 base classes. Second, because you cannot use the convenience parser classes, your application must be written using the XNI framework. However, using the XNI framework is not difficult for programmers familiar with SAX. [Note: future versions of NekoHTML may include custom implementations of the DOM and SAX parsers to avoid this dependence on the Xerces2 library.]

Most users of the CyberNeko HTML parser will not have a problem including the full Xerces2 package because the application is likely to need an XML parser implementation. However, for those users that are concerned about Jar file size, then using the xercesMinimal.jar file may be a useful alternative.

What version of NekoHTML am I using?

Since version 0.9.3, NekoHTML includes a class that can be used to query the product version within application code. The Version class in the org.cyberneko.html package contains a method, getVersion that returns the NekoHTML version as a string. For example:

// import org.cyberneko.html.Version;

System.err.println(Version.getVersion());

The Version also includes a main method that prints the version information to standard output.

The version and product information can also be queried using the Java package API. For example:

Class cls = Class.forName("org.cyberneko.html.HTMLConfiguration");
Package pkg = cls.getPackage();

String name = pkg.getName();

String specTitle   = pkg.getSpecificationTitle();
String specVendor  = pkg.getSpecificationVendor();
String specVersion = pkg.getSpecificationVersion();

String implTitle   = pkg.getImplementationTitle();
String implVendor  = pkg.getImplementationVendor();
String implVersion = pkg.getImplementationVersion();
标签 : ,



发表评论 发送引用通报