πŸŽƒHacktoberfest

To commemorate the tenth anniversary of Hacktoberfest and acknowledge the contributors, repos, and organisations that have contributed to the success of this event, we recently launched a set of Global Hacktoberfest Leaderboards 🌐.

Creating leaderboards for Hacktoberfest posed a few challenges. With such a large number of participants, it was crucial to ensure that the leaderboards accurately represented the pool of participants. At the same time, it was important to guarantee fairness by considering only valid contributions in the rankings. To achieve this we had to design a solution that caters for these two requirements.

1️⃣ Excluding β€œbad repos”

In recent years, it has become increasingly common to see "Bad repos" that encourage contributions solely for the purpose of inflating PR counts or vanity metrics. While there is no formal definition of what constitutes a "Bad repo," the exclusion rule mentioned on the official webpage (https://hacktoberfest.com/participation/) serves as a useful guideline.

To filter out the "bad repos", @Alex B trained a cute machine learning model to automatically identify offending repositories based on the definition provided above. In general, we intentionally exclude contributions where the author of the pull request is the primary beneficiary of the contribution. This includes but is not limited to, trivial or learning-based repos (e.g. implementing a random number guessing game), repos with uninformative or nonexistent readmes, arbitrary curation of code, collections of text (e.g. short stories), algorithms and data structures, and competitive coding.

It is important to note that only DigitalOcean is in a position to determine whether a repository is officially considered bad or not. If you believe that your repository has been unfairly excluded, please contact us at help@quine.sh.

2️⃣ Approximating fairness in the rankings

Once the β€œbad repos” are excluded, it is important to ensure that contribution activity is scored fairly. It should be noted that only DigitalOcean has a record of Hacktoberfest signups, so we do not have prior knowledge of who is a participant. This raises a few questions... Should we consider only peer-reviewed contributions? Should we exclude activity from full-time contributors? If so, how do we define a full-time contributor?

The answers to these questions may vary depending on one's perspective. Our opinionated approach to the leaderboards is to include all contributors (regardless of their full-time status) and exclude self-merged pull requests.

We understand that fairness is subjective, so we have made the decision to shift the responsibility of choice to the end user πŸ˜‡. To do this, we have provided them with a set of filters that allow them to adjust the ranking rules to what they consider to be the fairest view. Specifically, we offer three filters:

  • Include self-merged PRs: If this is switched on, then PRs that are self-merges are included. We chose to exclude self-merged PRs by default because they speak about a contribution process that wasn’t peer reviewed.

  • hacktoberfest labelled PRs only: DigitalOcean allows users to label certain PRs as hacktoberfest-accepted. By selecting this filter, only contributions associated with these labeled issues will be considered.

  • Contributors: You can also choose to rank contributors based on three categories: all contributors, new contributors, or casual contributors (those with less than 10 PRs before October on a per-repo basis). Segmenting contributors in this way helps us estimate whether the observed activity is directly attributable to participating in Hacktoberfest.

3️⃣ Bots

We do not expect bots to participate in Hacktoberfest, yet they are among the busiest contributors on GitHub πŸ™ƒ. Their activity can inflate metrics for repositories that use bots and can pollute user and leaderboards. Unfortunately, many machine accounts on GitHub (such as web-flow, renovate-bot, gitter-badger, microsoftopensource) are not labeled as Bot-type. As a result, we have to identify and remove them to the best of our ability. We regularly update our bot block-list and use statistics and regex searches to exclude machine activity.

What do you think about these leaderboards? πŸ‘€ We're open to receive any feedback on our approach and methodology. If you have any thoughts, questions, or suggestions please write us to help@quine.sh

Last updated