May 13, 2014

Stack Overflow and its Discontents

LIKE many others, I’ve come to rely on Stack Overflow (SO) as an amazing repository of technical knowledge. How did it become so comprehensive? My guess is it was the ‘many eyes make bugs shallow’ principle. I.e., many contributors building something together, a lot like Wikipedia. SO is nearly always great when you’re searching for an answer, but not that great when you’re trying to contribute and build a profile for yourself among the other members. I’ll explain why but first let me clarify something as I see it: SO may look like a simple Q&A site, but it’s really a permanent knowledge base with the knowledge organised in the form of questions and answers. That’s important: the Q&A format is just that, a knowledge delivery mechanism. So with that in mind, we can examine how that affects peoples’ actions on the site.

Lots of people have discussed why SO punishes or rewards contributors the way it does, but one widely-held belief is that there is a subset of users (usually the moderators and other high-reputation users) that is intent on seeing SO’s core mission carried out: that the site becomes a repository of useful and generally applicable questions and answers. To keep it that way, this subset performs triage on the questions that are asked: they apply judgment calls on whether or not the questions are good ones. When you’re a beginner, there are no bad questions. But when you’re building a long-lasting repository, there are bad questions.

Generally speaking, bad questions on SO could be any of:

  • Duplicate of an already-answered question
  • Not phrased as a question
  • Not clear what the asker wants to find out
  • Asker shows no signs of having done any research or made any attempt to solve the problem
  • Question is about a specific corner case which can be easily solved if asker understood a more general principle
  • Code examples are incomplete and can’t be compiled, error messages are not quoted verbatim but only vaguely described
  • And the other end of the spectrum: many screenfuls of code or log output pasted verbatim, in entirety, without any attempt to zoom in on the source of the issue.

Any of these questions will annoy the mods because they’ll invariably get answered (because people are incentivised to answer no matter how bad the question), and then those bad questions and their answers will raise the noise level and make it difficult for people trying to find answers to the genuinely good questions. (Good search features, and even Google, can only take you so far.)

So with this in mind, we can start to understand the mods’ behaviour of seemingly penalising usually newer users of the site, those who haven’t absorbed the culture yet and are treating SO as a source of answers to one-off questions. It’s not–the questions are meant to become part of a knowledge base on the subject and potentially live forever. Under these conditions, it’s very difficult to justify questions with the above bad qualities, especially if we can guide the new users towards improving their question quality (and lots of people are trying to do this).

So, remember, the mods’ goal is to build a generally-useful knowledge base. With this as a given, the questions (subjects of knowledge) that are of a low quality will tend to get weeded out: either by being downvoted, or closed. The people who’re doing the downvoting and closing don’t have the end goal of punishing the askers; their goal is to weed out the bad questions. That the askers lose rep points is a side effect of the voting and rep system. Which is fair: if my peers judge me as not contributing good material, then I should have less cred. But the primary goal on everyone’s mind is having good material on the site, not punishing me.

Having said all that, I want to address the fact that votes on each question and answer are essentially on an infinite scale going in both directions. So, a given question or answer can potentially be upvoted or downvoted many times over, and every one of those votes affects the poster’s rep. But the effects on rep are all coming from a single posting. That’s skewed, because users on the site are more likely to see higher-voted questions and answers than they are to see lower-voted ones. That’s simply how the search facilities work by default: understandably and helpfully, users get to see highly-regarded content before they see the dregs of the site. But this means that highly-upvoted content will always get more exposure, and therefore continuously be exposed to more upvotes, while downvoted content will get less exposure and less downvotes. This skewness is disproportionately rewarding the experts and inflating their rep.

Let’s ignore the downvoted content for a second and think about the upvoted content: it is continuously getting upvoted, just for being there. Meanwhile, the person who posted that content could very well have not made a single contribution after that (taking this to the logical extreme). That’s an unfair advantage, and thus a bad indicator of that person’s cred in the community.

It’s clear at this point that the SO rep system is not going to be recalibrated yet again (barring some really drastic decision) to fix this bias, so let’s imagine for a second what a rep system would look like that actually did fix it. My idea is that such a rep system would reward (or punish) a user for a single contribution by a single point only, to be determined as the net number of up (or down) votes. So, if a post had +1 and -1, the reward to the contributor is nothing. If the post has +100 and -12, the reward to the contributor is +1. And if the post has +3 and -5, the reward is -1. If there’s a tie, the next person who comes along has the power to break that tie and thus ‘reward’ or ‘punish’ the contributor. Usually, of course, the situation won’t be a tie–usually there’s pretty good consensus about whether a contribution is good or not (to verify this, simply log in to Stack Overflow and click on the score of any question or answer on the question’s page–it’ll show you the upvotes and downvotes separately).

The sum of the net effect on reputation from each of a contributor’s posts shouldn’t become the final measure of their profile rep, though. That doesn’t give the community an easy way to tell apart a person with +100/-99 rep (a polarising figure) from someone with +1/0 (a beginner, an almost-unknown). Instead, users should be able to judge contributions as +1 (helpful), -1 (not helpful), or 0 (undecided). And each net reputation change from a contribution should form a part of a triune measure of rep: percentage helpful, percentage not helpful, and percentage undecided.

The three parts of the measure are equally important here. The vote on the merit of a contribution is nothing but a survey; and any statistician will tell you that in a survey question you must offer the respondent the choice of giving a ‘Don’t know’ answer. In this case that’s the ‘undecided’ option. If we don’t offer this option, we are missing critical information about the response to the contribution–we can’t tell apart people who simply didn’t vote, or those who tried to express their uncertainty in the question/answer by not voting.

This way, everyone immediately sees how often a particular user contributes valuable content, as opposed to unhelpful or dubious content. And the primary measure of rep is therefore not something that can grow in an unbounded way: the most anyone can ever achieve is 100% helpfulness. That too, I think, should be quite rare. The best contributors will naturally tend to have higher helpfulness percentages, but it won’t so much a rat race but rather a level marker within the community, tempered by their levels of ‘unhelpfulness’ or people’s indecision about their contributions.

So much for user profile rep. I do think that the scores on questions and answers should behave in the traditional SO way: all votes should be counted as individual votes, instead of (as with user profile rep) being summed into a net positive/negative/undecided vote. The reason for the difference is that the votes are the measure of each contribution’s merit; and if (for example) you have two similar contributions, their vote scores should be a good way to judge between them. Again, vote score should be presented in the form of a three-fold percentage measure of helpfulness, unhelpfulness, and undecidedness (with vote counts available on, say, mouseover). This keeps conformity with user profile rep and puts an upper bound on the score shown on each question or answer. The reason why this is a good thing is that most of the time, to a site user, unbounded scores are simply extra information they won’t really process. The site itself can easily search and present content in order of most highly-voted; but the reader just needs to judge merit on a fixed scale. Anything extra is just cognitive burden on the reader.

So to recap, if we’re to implement an unbiased user profile rep, we need to count the net helpfulness of each contribution once only. But for content scoring, we can count each vote individually. And to present everything in a uniform way and with low cognitive burden, we should show all scores as a three-fold measure of percentage helpfulness, unhelpfulness, and undecidedness.

Once we have this system in place, we can start doing interesting things, like automatically closing questions which sink below a threshold of, say, 10% helpfulness (they’ll start out with 100% helpfulness because the original asker’s vote is automatically counted as helpful–otherwise they wouldn’t have asked). And we can do away with reputation loss from downvoting, since a downvote will usually have no effect on the recipient’s profile rep, and only one unit of negative effect on the contribution being downvoted.

Achieving an unbiased measure of rep is tricky. But I think we can do better than SO’s current system by rebalancing the ‘rewards’ and ‘punishments’, and bounding the primary measures between 0 and 100% helpfulness so that we don’t end up with another Jon Skeet on our hands (I kid–Jon Skeet is great :-)