Parsing HTML easily in Objective-C with ObjectiveGumbo

Previously if you wanted to parse HTML on iOS/OSX using Objective-C you had to use an XML parser however I have now written an Objective-C wrapper around Gumbo, Google’s new C HTML5 parsing library, so that it is really easy to get parse HTML with minimum hassle.

To get started you will firstly need to download the Gumbo source code from their GitHub repo and follow their getting started instructions.

Once you have a working local copy of Gumbo (you can validate this by running one of the samples) you should download the source code forObjectiveGumbo from GitHub. To add ObjectiveGumbo to your Xcode projects do the following:

  1. Add the ObjectiveGumbo folder from the repository – this contains the source code for the framework, which is basically just a few classes
  2. Add the src folder from the Gumbo repo (only the .c and .h files)
  3. Ensure that Xcode is set to compile all .c and .m files (Xcode 5 doesn’t do this by default when adding files to a project
  4. Validate that the project builds correctly
  5. Import “ObjectiveGumbo.h” when you want to use it in your app

Usage for ObjectiveGumbo is pretty simple. You can either parse a whole document – which will gain you access to the DOCTYPE information – or just the root element (i.e. the body). These can be loaded from an instance of NSData, NSURL or NSString.

Simple example fetching the current top stories on Hacker News:

OGNode * data = [ObjectiveGumbo parseDocumentWithUrl:[NSURL URLWithString:@""]];
NSArray * tableRows = [data elementsWithClass:@"title"];
for (OGElement * tableRow in tableRows)
    if (tableRow.children.count > 1)
         OGElement * link = tableRow.children[0];
         NSLog(@"%@", link.attributes[@"href"]);

More detail is explained in the README and there is also a more developed Hacker News example (iOS) and simple Markdown to HTML parser (OSX – not complete) in the GitHub repo.


You’re not using ARC

ARC (Automatic Reference Counting) is a great addition to Objective-C, allowing for the automatic release of objects after they are no longer being used. After having come to Objective-C from C# earlier this year it seemed completely natural to me for it to be there; memory management had always been very good in .Net so I came with the expectation that the compiler would worry about memory management for me. This meant that when I learned Objective-C, I got completely used to using ARC.

But am I using ARC? I wasn’t writing any extra code that I wouldn’t have written normally. I’m not thinking any differently. ARC has effectively become so embedded into Objective-C that I would argue that now it seems like you’re using ‘extras’ when writing code with old school memory management.

When browsing for arguments for not using ARC I found that generally the consensus was that ARC does everything you could possibly need it to do (‘Use it. Do it today’) and the only real reason to not use it is if you are perfectly capable of writing good ‘old school’ code or if you need to support two/three year old versions of iOS/OSX. Even then, the suggestion was that you are only going to write better code if you use ARC.

I struggled to find any other strong arguments for not using ARC, because of the simplicity that it brings to Objective-C. One suggestion was that it would make older code harder to maintain, but at the end of the day it doesn’t – I have written code that integrates absolutely fine into my existing code because I can just set Xcode not to use ARC on specific files.

Ultimately I think that because ARC is such a standard thing to ‘use’ now we shouldn’t think that we are ‘using’ it any more. It should be accepted as a component of Objective-C rather than thinking of it as an ‘extra’ feature (even though, technically it is an addition Apple made to Clang).

N.B: ARC is great for learning and also terrible for learning. It is great because it makes Objective-C a hell of a lot easier to learn and if you are a beginner it makes code in general a lot simpler to understand. On the other hand, it is awful because it doesn’t teach memory management or garbage collection at all. Developers should have an understanding of where their objects are at any point in the event loop, and by having all of the release and retain calls in your code it is a lot easier to understand, but in theory ARC should mean that you don’t have to understand it anymore.

Play Time

This morning I was scrolling through iTunes having a look at my top played songs and it occurred to me that iTunes doesn’t actually offer a way of seeing how much time you’ve dedicated to listening to a song or your entire library. My first approach to solving this problem was to make a spreadsheet and type in the length of each song and the play count however after a couple of songs I got bored so came up with a simpler solution.

iTunes stores a list of all your music in an XML file in the iTunes folder so I put together a simple app in Xcode that reads this file (or the user can select one if the iTunes file couldn’t be found) and does all the math for you. Play Time should work on 64-bit Lion fine.

Play Time for Lion (DMG)