I was thinking idle thoughts about the community at
news.yc and I decided to pull out a few statistics. I found
the following page with some stats on it but I wanted more.
The following was deduced from a sample of about
130 users, including the 50
leaders. This is a small sample and I can only access 180 posts per user, so this is not perfect, but still interesting. Note: for a few of those statistics, I only considered users with more than 8 posts.
- Total number of posts: 8317
- Total number of points on those posts: 42458
- Minimum points per post ratio: 0.72
- Maximum points per post ratio: 20.2
- Mean points per post ratio: 5.75
- Std deviation on the ratio: 3.22
- Overall points per post ratio: 5.10
- Mean number of posts: 62.00
- Std deviation on the number of posts: 68.36
When a user has a high points per post ratio, he is probably representative of the community. All kinds of guidelines could be derived from this: if a user's points per post ratio is too low below the average, they probably are posting too much, or not a strong link in the community as a submitter. If the standard deviation on the mean ratio is low, then the community is rather cohesive: lots of people interact without leading or counting on others. If the standard deviation is high, than the users are possibly separated in two groups, one reading the posts of the other. These kinds of statistics might be useful as an early index of community 'dilution'.
Here's a graph showing the points per post against the number of posts.

We can see that it's very hard to maintain a good points per post ratio when the number of posts reaches about 40. Also, if a newcomer wanted to know what kind of users are on news YC, we should point them to paul, palish, sharpshoot, or pg.
Here's the same graph but only showing the leaders.

Zoom on the users with 180 posts:

Zoom on the bottom left cluster for all users:

Read the comments on this at
news YC.
Here is the ruby code that I used. You can tweak it to find out your own points per post ratio.
go.rb
#!/usr/bin/env ruby
require 'open-uri'
HOSTNAME = "http://news.ycombinator.com/"
LOG = "... "
def compute(user, url, total_points, total_posts)
more_url = nil
open(HOSTNAME + url) {|doc|
doc.read.each {|line|
more_url = find_more_url line if more_url.nil?
line.scan(/(\d*) point[s]{0,1}<\/span>/) do |match|
total_points += (match[0].to_i - 1)
total_posts += 1
end
}
}
begin
sleep(5)
puts LOG + "page done"
$stdout.flush
total_points, total_posts = compute(user, more_url, total_points, total_posts) unless more_url.nil?
rescue
puts "caught #{$!}"
puts "stopped after #{total_posts} posts for user #{user}"
end
return [total_points, total_posts]
end
def go users
users.each do |user|
puts LOG + "processing #{user}"
$stdout.flush
begin
total_points, total_posts = compute(user, "submitted?id=#{user}", 0, 0)
ratio = total_points.to_f / total_posts
puts "#{user} #{total_points} #{total_posts} #{sprintf("%.2f", ratio)}"
rescue
puts "caught #{$!}"
puts "user #{user} not processed"
end
end
end
def find_more_url line
if line =~ /<a href="([^"]*?)" rel="nofollow">More<\/a>/
$1
else
nil
end
end
def file2list file
users = []
File.open(file) {|f|
f.each {|line|
users.push line.chomp unless line.nil?
}
}
users
end
def extract_names users, url=""
more_url = nil
open(HOSTNAME + url) {|doc|
puts 'transferring data'
doc.readlines.each {|line|
more_url = find_more_url line if more_url.nil?
line.scan(/<a href="user\?id=([^"]*)">/) do |match|
users.push match[0] unless users.include? match[0]
end
}
}
users = extract_names(users, more_url) unless more_url.nil?
return users
end
go(file2list(ARGV[0]))
extract.rb
def std_deviation data, mean, index
sq_errors = 0.0
data.each {|row|
sq_errors += (row[index] - mean)**2
}
puts "Std deviation: #{Math::sqrt(sq_errors/data.length)}"
end
def num_posts_metrics data
min = 99999
max = 0
total_posts = 0
data.each {|row|
total_posts += row[2]
min = row[2] if row[2] < min
max = row[2] if row[2] > max
}
puts "Total posts: #{total_posts}"
puts "Min # posts: #{min}"
puts "Max # posts: #{max}"
mean_num_posts = total_posts/data.length
puts "Mean # posts: #{sprintf("%.2f", mean_num_posts)}"
std_deviation(data, mean_num_posts, 2)
end
def ratio_metrics data
min = 9999
max = 0
total_posts = 0
total_points = 0
sum_ratio = 0.0
data.each {|row|
total_posts += row[2]
total_points += row[1]
min = row[3] if row[3] < min
max = row[3] if row[3] > max
sum_ratio += row[3]
}
puts "Total posts: #{total_posts}"
puts "Total points: #{total_points}"
puts "Min ratio: #{min}"
puts "Max ratio: #{max}"
mean_ratio = sum_ratio/data.length
puts "Mean ratio: #{sprintf("%.2f", mean_ratio)}"
overall_ratio = total_points.to_f / total_posts
puts "Overall ratio: #{sprintf("%.2f", overall_ratio)}"
std_deviation(data, mean_ratio, 3)
end
data = []
File.open(ARGV[0]){|file|
file.each do |line|
data << line.chomp.split
end
}
data = data.map {|row|
[row[0], row[1].to_i, row[2].to_i, row[3].to_f]
}
ratio_metrics(data.select {|row|
row[2] > 8
})
puts
num_posts_metrics data