HTMLDocument


Objective-C wrapper for HTML parser of libxml2


This HTML parser gives access to libxml2 with Objective-C in Mac OS (Leopard and higher) and iOS. An optional category provides XPath support.

libxml2 is very fast, for less overhead all recursive tasks are realized with C functions. 

The naming is similar to NSXMLDocument (which lacks in iOS).

Unlike NSXMLDocument HTMLDocument does not inherit from HTMLNode, there is no HTMLElement class and you can't create new documents nor change nodes.


All methods returning a value/object without parameter(s) are declared as read-only properties for providing dot syntax.


Objective-C classes:


HTMLDocument

HTMLNode


Optional category of HTMLNode for XPath support:


HTMLNode+XPath



How to use:


• Add the class files and the (optional) category files to your project

• Add libxml2.dylib to frameworks (Link Binary With Libraries)

• Add $SDKROOT/usr/include/libxml2 to target -> Build Settings > Header Search Paths

• Add -lxml2 to target ->  Build Settings -> other linker flags 

• import HTMLDocument.h and HTMLNode+XPath.h (if needed) header files 



HTMLDocument:


Create an HTMLDocument with one of these init methods


- (id)initWithData:(NSData *)data encoding:(NSStringEncoding )encoding error:(NSError **)error; // designated initializer

- (id)initWithContentsOfURL:(NSURL *)url encoding:(NSStringEncoding )encoding error:(NSError **)error;

- (id)initWithHTMLString:(NSString *)string encoding:(NSStringEncoding )encoding error:(NSError **)error;


The corresponding initializer methods without the encoding parameter assume UTF-8 encoding.

For each initializer method there is also a convenience class method


+ (HTMLDocument *)documentWith…


Get the root node (actually the <html> node ) or the <body> node of the document with 


@property (readonly) HTMLNode *rootNode

@property (readonly) HTMLNode *body



HTMLNode:


In HTMLNode search for node(s) only within the first level of children of the current node with the prefix


- (HTMLNode *)child…

- (NSArray *)children…


or perform a deep search within all descendants of the current node


- (HTMLNode *)descendant…

- (NSArray *)descendants…


The appropriate methods to search with XPath within all descendants are


- (HTMLNode *)node…

- (NSArray *)nodes…


Generic methods to search for a custom XPath are


- (HTMLNode *)nodeForXPath:(NSString *)query error:(NSError **)error;

- (NSArray *)nodesForXPath:(NSString *)query error:(NSError **)error;



Methods to navigate within the nodes


@property (readonly) HTMLNode *parent; // returns the current node

@property (readonly) HTMLNode *nextSibling; // returns the next sibling

@property (readonly) HTMLNode *previousSibling; // returns the previous sibling

@property (readonly) HTMLNode *firstChild; // returns the first child

@property (readonly) HTMLNode *lastChild; // returns the last child

@property (readonly) NSArray *children; // returns the first level of children

@property (readonly) NSUInteger childCount; // returns the number of children

- (HTMLNode *)childAtIndex:(NSUInteger)index; // returns the child at given index



Methods to get tag and attribute names and values.


- (NSString *)attributeForName:(NSString *)attributeName; // returns the attribute value matching the name

@property (readonly) NSDictionary *attributes; // returns all attributes and values as dictionary

@property (readonly) NSString *tagName; // returns the tag name

@property (readonly) NSString *className; // returns the value for the class attribute

@property (readonly) NSString *hrefValue; // returns the value for the href attribute

@property (readonly) NSString *srcValue; // returns the value for the src attribute


Methods to get node values.


@property (readonly) NSInteger integerValue; // returns the integer value

@property (readonly) double doubleValue; // returns the double value

- (double )doubleValueOfString:(NSString *)string forLocaleIdentifier:(NSString *)identifier; // returns the double value of a string for a given locale identifier e.g. en_US or fr_CH

- (double )doubleValueForLocaleIdentifier:(NSString *)identifier; // returns the double value of the string value for a given locale identifier

- (double )contentDoubleValueForLocaleIdentifier:(NSString *)identifier; // returns the double value of the text content for a given locale identifier

- (NSDate *)dateValueFromString:(NSString *)string format:(NSString *)dateFormat timeZone:(NSTimeZone *)timeZone; // returns the date value of a string for a given date format and time zone

- (NSDate *)dateValueForFormat:(NSString *)dateFormat timeZone:(NSTimeZone *)timeZone; // returns the date value of the string value for a given date format and time zone

- (NSDate *)contentDateValueForFormat:(NSString *)dateFormat timeZone:(NSTimeZone *)timeZone; // returns the date value of the text content for a given date format and time zone

- (NSDate *)dateValueForFormat:(NSString *)dateFormat; // returns the date value of the string value for a given date format and system time zone

- (NSDate *)contentDateValueForFormat:(NSString *)dateFormat; // returns the date value  of the text content for a given date format and system time zone

@property (readonly) NSString *rawStringValue; // returns the raw string value

@property (readonly) NSString *stringValue; // returns the string value trimmed by whitespace and newline characters

@property (readonly) NSString *stringValueCollapsingWhitespace; // returns the string value trimmed by whitespace and newline characters and collapsing all multiple occurrences of whitespace and newline characters within the string into a single space

@property (readonly) NSString *HTMLString; // returns the raw html text dump

@property (readonly) NSArray *textContentOfChildren; // returns an array of all text content of children each array item is trimmed by whitespace and newline characters

@property (readonly) xmlElementType elementType; // returns the element type

@property (readonly) BOOL isAttributeNode; // Boolean check for specific xml node types

@property (readonly) BOOL isDocumentNode;

@property (readonly) BOOL isElementNode;

@property (readonly) BOOL isTextNode;

@property (readonly) NSString *rawTextContent; // returns the raw text content of descendant-or-self

@property (readonly) NSString *textContent; // returns the text content of descendant-or-self trimmed by whitespace and newline characters

@property (readonly) NSString *textContentCollapsingWhitespace; // returns the text content of descendant-or-self trimmed by whitespace and newline characters and collapsing all multiple occurrences of whitespace and newline characters within the string into a single space

@property (readonly) NSArray *textContentOfDescendants; // returns an array of all text content of descendant-or-self // each array item is trimmed by whitespace and newline characters

@property (readonly) NSString *HTMLContent; // returns the raw html text dump of descendant-or-self



The class files are available on GitHub



© 2011 Stefan Klieme