Parser parser = new Parser ("http://yadda");
parser.parse (new HasAttributeFilter ("id"));
These filters can be combined to yield powerful extraction capabilities.
For example, to get a list of links where the contents is an image, you could use:
NodeList list = new NodeList ();
NodeFilter filter =
new AndFilter (
new TagNameFilter ("A"),
new HasChildFilter (
new TagNameFilter ("IMG")));
for (NodeIterator e = parser.elements (); e.hasMoreNodes (); )
e.nextNode ().collectInto (list, filter);
| Class | Description |
|---|---|
| AndFilter |
Accepts nodes matching all of its predicate filters (AND operation).
|
| CssSelectorNodeFilter |
A NodeFilter that accepts nodes based on whether they match a CSS2 selector.
|
| HasAttributeFilter |
This class accepts all tags that have a certain attribute,
and optionally, with a certain value.
|
| HasChildFilter |
This class accepts all tags that have a child acceptable to the filter.
|
| HasParentFilter |
This class accepts all tags that have a parent acceptable to another filter.
|
| HasSiblingFilter |
This class accepts all tags that have a sibling acceptable to another filter.
|
| IsEqualFilter |
This class accepts only one specific node.
|
| LinkRegexFilter |
This class accepts tags of class LinkTag that contain a link matching a given
regex pattern.
|
| LinkStringFilter |
This class accepts tags of class LinkTag that contain a link matching a given
pattern string.
|
| NodeClassFilter |
This class accepts all tags of a given class.
|
| NotFilter |
Accepts all nodes not acceptable to it's predicate filter.
|
| OrFilter |
Accepts nodes matching any of its predicates filters (OR operation).
|
| RegexFilter |
This filter accepts all string nodes matching a regular expression.
|
| StringFilter |
This class accepts all string nodes containing the given string.
|
| TagNameFilter |
This class accepts all tags matching the tag name.
|
| XorFilter |
Accepts nodes matching an odd number of its predicates filters (XOR operation).
|
HTML Parser is an open source library released under LGPL.