Subreddit discourse at a glance

Ideally, quantitative measurements and data visualization help us make sense of and see patterns in massive amounts of data that would take us thousands of hours to go through piece by piece. Like a telescope or microscope, these tools can be more than just ways to test hypotheses; they can be used to see part of the world in a new way.

There are many such tools that researchers use to make sense of the billions of online utterances that comprise message boards, chats, comments, and Reddit. Increasingly sophisticated natural language processing techniques and software like Latent Dirichlet Allocation and Word2Vec are automating this process, making it easier to scale up analyses of meaning in text, allowing us to see distilled clusters of topics, themes, or tones that are used by certain groups at certain times.

Still, it can be useful, particularly at the earlier stages of studying online communities, to see the comments as they are and as users encounter them. One approach to that would be to just go to the online community in question and start reading comments or posts, but this approach has limitations. We typically only see what's on the surface of the community's discourse - the most popular or recent comments or posts, or comments or posts that have been tailored to our interests. If we want to understand the dynamics of entire communities, we need to get underneath that surface.

TAKING UPVOTES INTO ACCOUNT

One increasingly common feature of online discourse is some variation on the "upvote" - a metric used to sort content based on social approval. The more upvotes a comment or post gets, the more visible it will be to other users. Many lament this feature's effects on qualities of online discourse and wellbeing of users, which implies that it matters how many upvotes a comment or post gets - in terms of who is being heard and how they feel. And yet many large-scale analyses of online discourse ignore this information, treating a comment that received 0 upvotes the same as one that received 1,000. To better understand the place of a comment within an online community's discourse, we need to take this information into account.

TYPES OF USERS

Aside from getting a sense of what is being said in an online community, it's also useful to have some idea of who is speaking. In anonymous or pseudonymous communities, we have limited information about users' individual or collective identities. One thing we can know is how experienced a participant in discourse is - are they an infrequent contributor, a dedicated regular, or somewhere in between? Making such distinctions helps us understand influence, diversity, and power as they play out in online communities. Such participatory communities are often advertised as (or hoped to be) free and open forums. While those concerned with the extent to which online communities truly allow free and inclusive expression tend to focus on censorship by moderators and owners, a subtler and - I would argue - more powerful influence occurs regardless of moderation policies or ownership, through voting. Do a handful of dedicated, experienced users create the majority of a community's highly visible discourse? Do less experienced users have any hope of reaching a larger audience? Do these two groups of people - the dedicated contributor and the novice - tend to say different things, or express themselves in different ways?

VISUALIZING VOTE COUNT AND COMMENTER EXPERIENCE LEVELS

By graphing comments based on vote scores (number of upvotes minus number of downvotes) and the number of times the commenter commented in a given time period to a given subreddit, we can quickly scan a large number of comments and begin to answer these questions. We can easily see comments that would have remained hidden "under the surface" - comments that failed to resonate with the voting users - and get a sense of why they didn't resonate. Was it because they expressed an unpopular opinion, because of the way that they expressed it (i.e., tone), or because they were being hostile toward another user? We can also see if the comments that get high scores are really all that different than ones that get low scores, helping us to determine whether comments become highly visible for some inherent quality (their wit, their brevity, or their ability to articulate inchoate righteous anger) or for some idiosyncratic luck-of-the-draw (a few early voters happening to see the comment at the right time, setting into motion a snowball effect).

Below are graphs for several subreddits over several windows of time. By hovering over the dots, you can see what comments the community rewards with upvotes (on the right side of the graph) and which ones it punishes with downvotes (on the left side). If you draw a box around some of the dots, you can 'zoom in' on them (by selecting 'Keep Only') or exclude them from the graph (by selecting 'Exclude'), which is particularly useful in cases where there are outliers that expand the graph's axes and prevent us from differentiating among clusters of comments.

Right away, we can see that most of the comments in this subreddit during this time period are from less experienced commenters (under 50 comments) and comments are more likely to achieve a positive score than a negative one. Both of these qualities seem to be universally true across subreddits and across time. We can also see that, collectively, the numerous less experienced commenters generated more high-scoring comments than the rarer more experienced commenters. That suggests that discourse is not being controlled by a handful of power users.

There are more dots in the lower-left than in the upper-left, which makes sense - less experienced commenters may wander into the community, post a comment that doesn't resonate and, feeling a sense that their opinion or tone isn't appreciated, they won't bother to keep contributing. Less obvious is the question of why an experienced commenter had comments with negative scores. Hovering over some of those dots gives us a sense of what happened.

Along with questions of how different subreddits differ in terms of these social discourse dynamics, we might also be interested in how those dynamics change as communities grow and/or age. When hundreds of thousands of new users start voting on comments, how does it change the discourse? Do communities become more egalitarian, open-minded, and kinder as they age, or does the reverse happen?

Our abilities to understand the context in which the comments occurred is limited - some comments appear to be responses to other comments or a post, and without that context, it's hard to know exactly what the commenter meant or what the voters were responding to. There are also some comments that I've hovered over that are blank. I excised deleted comments before visualization (see note below), so it's hard to know what's going on in these cases. Despite their limitations, I hope that these graphs are an early version of a telescope for looking at large-scale online social dynamics, ones that will help us to answer these questions.

Below are several interactive graphs that allow you to select from two stages of these subreddits' developments. Why these particular subreddits? Well, they were chosen by members of our research team who were encouraged to select subreddits that piqued their interests! And it just so happened that this created some nice variance in size and modes of expression for us to observe.

We look forward to aiming our "telescope" toward other parts of the online universe in the near future!

r/personalfinance October 2017 and October 2020 comments

r/Conservative March 2013 and March 2021 comments

Note: I eliminated comments that were deleted by the commenter or removed by moderators. Typically, this amounted to fewer than 5% of the comments.

Thanks to Jason Baumgartner for making Reddit data available through the PushShift database, to Jack Horvath who designed the tool we used to obtain this data (RedDCAT), and to the rest of the ARRG team at the University of Alabama.