Parsing HTML easily in Objective-C with ObjectiveGumbo

Previously if you wanted to parse HTML on iOS/OSX using Objective-C you had to use an XML parser however I have now written an Objective-C wrapper around Gumbo, Google’s new C HTML5 parsing library, so that it is really easy to get parse HTML with minimum hassle.

To get started you will firstly need to download the Gumbo source code from their GitHub repo and follow their getting started instructions.

Once you have a working local copy of Gumbo (you can validate this by running one of the samples) you should download the source code forObjectiveGumbo from GitHub. To add ObjectiveGumbo to your Xcode projects do the following:

  1. Add the ObjectiveGumbo folder from the repository – this contains the source code for the framework, which is basically just a few classes
  2. Add the src folder from the Gumbo repo (only the .c and .h files)
  3. Ensure that Xcode is set to compile all .c and .m files (Xcode 5 doesn’t do this by default when adding files to a project
  4. Validate that the project builds correctly
  5. Import “ObjectiveGumbo.h” when you want to use it in your app

Usage for ObjectiveGumbo is pretty simple. You can either parse a whole document – which will gain you access to the DOCTYPE information – or just the root element (i.e. the body). These can be loaded from an instance of NSData, NSURL or NSString.

Simple example fetching the current top stories on Hacker News:

OGNode * data = [ObjectiveGumbo parseDocumentWithUrl:[NSURL URLWithString:@""]];
NSArray * tableRows = [data elementsWithClass:@"title"];
for (OGElement * tableRow in tableRows)
    if (tableRow.children.count > 1)
         OGElement * link = tableRow.children[0];
         NSLog(@"%@", link.attributes[@"href"]);

More detail is explained in the README and there is also a more developed Hacker News example (iOS) and simple Markdown to HTML parser (OSX – not complete) in the GitHub repo.