Last weekend I wrote a C# program that will analyze text reasonably efficiently and output the word frequencies to a CSV file (admittedly tab separated at the moment). I then released all of the source on GitHub but I hadn’t really done any proper analysis with it other than the books I was using from Project Gutenberg.
I was having a look at Reddit’s API today and I stumbled across the option to stick .json on the end of most Reddit URLs to get the API data. I then wrote a relatively simple Python script that would poll this page once a minute (the page in theory gets updated every thirty seconds but I found it seemed to be updated once a minute). I then had a dictionary with a key for each comment’s ID and I kept a copy of the comment’s content in the dictionary. I also stripped out all HTML to ensure text analysis would work properly. Each minute it would also save the dictionary to a JSON file and the text to a text file.
I than ran Word Frequency over it to build up this CSV file – to view the full thing make sure you press download (I then recommend importing it into Excel so you can sort it). I collected a total of 10,091 comments which amounted to a total of 289,713 words (an average of 28.71 words per comment) and found that the most common word was, unsurprisingly, the. Pronouns, connectives and simple verbs (is, see, etc) make up the top 100 however nouns tend to creep in further down. Incredibly, 9gag is only used seven times across all of the comments.