December 2, 2009

urban dictionary's greatest hits

I was curious about what words were the most popular on urbandictionary.com. So I scraped the popular section for each letter in the alphabet and came up with the following table where the ranking is based on the number of upvotes. I didn't want to have my blog associated with any of the words below and I didn't want to give backlinks to the urban dictionary so here's an image of my results.

Here is the code I used to get to the information. It scrapes all the links provided in the popular section for each letter and writes a dump of all entries. The dumps can later be used to study the data.

require 'rubygems'
require 'hpricot'
require 'open-uri'
require 'entry'
require 'net/http'
require 'json'

def retrieve_votes(doc)
  # in-browser, votes are retrieved through an ajax call after the web page is loaded
  (doc/"td.tools").each do |tools|
    id = tools[:id]
    uncacheable_id = id.scan(/\d+/)[0]

    json_response = Net::HTTP.post_form(URI.parse('http://www.urbandictionary.com/uncacheable.php'), {'ids'=> uncacheable_id})
    thumbs = JSON.parse(json_response.body)['thumbs'][0]
    return [thumbs['thumbs_up'], thumbs['thumbs_down']]
  end
end

def retrieve_links letter
  anchors = []

  doc = Hpricot(open(letter))
  (doc/"table#columnist//tr").each do |row|
    (row/"td").each do |cell|
      (cell/"ul"/"li").each do |li|
        (li/"a").each do |anchor|
          anchors <<  anchor.get_attribute(:href)
        end
      end
    end
  end

  anchors
end

def build_entry(doc)
    word = "no words found"
    definition = "no definitions found"

    up, down = retrieve_votes(doc)
    (doc/"td.word").each do |wrd|
      word = wrd.to_plain_text
      break
    end
    (doc/"div.definition").each do |defined|
      definition = defined.to_plain_text
      break
    end

    Entry.new(word, definition, up, down)
end

"ABCDEFGHIJKLMNOPQRSTUVWXYZ".each_char do |letter|
  links = retrieve_links("http://www.urbandictionary.com/popular.php?character=#{letter}")
  entries = []
  links.each do |link|
    sleep 5
    puts "fetching #{link}"
    doc = Hpricot(open("http://www.urbandictionary.com" + link))
    entries << build_entry(doc)
  end

  File.open("letter#{letter}.dump",'w'){|file|
    file.write(Marshal.dump(entries))
  }
end

The Entry class is defined like so.

class Entry
  attr_accessor :word, :definition, :up, :down
  def initialize word, definition, up, down
    @word = word
    @definition = definition
    @up = up
    @down = down
  end

  def <=>(other)
    (other.up - other.down) - (@up - @down)
  end

  def to_s
    "#{word} (#{up} up, #{down} down): #{definition}"
  end
end

In order to get some kind of greatest hits, I used the following script. I had to filter out entries with too many down votes to get rid of the most childish entries.

require 'entry'

entries = []
Dir['letter*.dump'].sort.each do |filename|
  entries.concat(Marshal.load(File::open(filename).read))
end

entries.sort!
entries.reject! {|entry| entry.down > 1000}
html="<table>"
50.times do |i|
  i = i + 1
  html += "<tr><td><b>#{i}</b></td><td style='padding: 0 2em 0 2em;'>#{entries[i].word}</td><td><a href=\"http://www.urbandictionary.com/define.php?term=#{entries[i].word}\">read more</a></td></tr>"
end
html+="</table>"

puts html