I was curious about what words were the most popular on urbandictionary.com. So I scraped the popular section for each letter in the alphabet and came up with the following table where the ranking is based on the number of upvotes. I didn't want to have my blog associated with any of the words below and I didn't want to give backlinks to the urban dictionary so here's an image of my results.
Here is the code I used to get to the information. It scrapes all the links provided in the popular section for each letter and writes a dump of all entries. The dumps can later be used to study the data.
require 'rubygems' require 'hpricot' require 'open-uri' require 'entry' require 'net/http' require 'json' def retrieve_votes(doc) # in-browser, votes are retrieved through an ajax call after the web page is loaded (doc/"td.tools").each do |tools| id = tools[:id] uncacheable_id = id.scan(/\d+/)[0] json_response = Net::HTTP.post_form(URI.parse('http://www.urbandictionary.com/uncacheable.php'), {'ids'=> uncacheable_id}) thumbs = JSON.parse(json_response.body)['thumbs'][0] return [thumbs['thumbs_up'], thumbs['thumbs_down']] end end def retrieve_links letter anchors = [] doc = Hpricot(open(letter)) (doc/"table#columnist//tr").each do |row| (row/"td").each do |cell| (cell/"ul"/"li").each do |li| (li/"a").each do |anchor| anchors << anchor.get_attribute(:href) end end end end anchors end def build_entry(doc) word = "no words found" definition = "no definitions found" up, down = retrieve_votes(doc) (doc/"td.word").each do |wrd| word = wrd.to_plain_text break end (doc/"div.definition").each do |defined| definition = defined.to_plain_text break end Entry.new(word, definition, up, down) end "ABCDEFGHIJKLMNOPQRSTUVWXYZ".each_char do |letter| links = retrieve_links("http://www.urbandictionary.com/popular.php?character=#{letter}") entries = [] links.each do |link| sleep 5 puts "fetching #{link}" doc = Hpricot(open("http://www.urbandictionary.com" + link)) entries << build_entry(doc) end File.open("letter#{letter}.dump",'w'){|file| file.write(Marshal.dump(entries)) } end
The Entry class is defined like so.
class Entry attr_accessor :word, :definition, :up, :down def initialize word, definition, up, down @word = word @definition = definition @up = up @down = down end def <=>(other) (other.up - other.down) - (@up - @down) end def to_s "#{word} (#{up} up, #{down} down): #{definition}" end end
In order to get some kind of greatest hits, I used the following script. I had to filter out entries with too many down votes to get rid of the most childish entries.
require 'entry' entries = [] Dir['letter*.dump'].sort.each do |filename| entries.concat(Marshal.load(File::open(filename).read)) end entries.sort! entries.reject! {|entry| entry.down > 1000} html="<table>" 50.times do |i| i = i + 1 html += "<tr><td><b>#{i}</b></td><td style='padding: 0 2em 0 2em;'>#{entries[i].word}</td><td><a href=\"http://www.urbandictionary.com/define.php?term=#{entries[i].word}\">read more</a></td></tr>" end html+="</table>" puts html
