Reading between the lies: using leak sites to analyse ransomware trends

In September 2024, a group called Valencia Ransomware announced their presence to the world by posting data from five alleged victims, including an Indian paper producer, a Malaysian pharmaceutical company, and a Californian municipality, on their leak site’s ‘wall of shame.’

Since 2020, operating a leak or shame site has practically become a matter of course for the ransomware business model. These sites play an essential role in ransomware groups’ double-extortion schemes. The groups not only encrypt victims’ systems – often rendering them inoperable – but also siphon off some of their data, which they can later threaten to release on leak sites. The goal is simple: the threat of releasing data often significantly increases the cost of non-compliance with ransom demands and might pressure the companies to pay even if they are able to restore access to the encrypted data independently. Such threats can be especially effective if the stolen data includes sensitive information, such as customer purchase information, employee addresses and payroll details, or patient health records.

Data from leak sites are frequently used by journalists, practitioners, scholars, and think tank experts to shed light on various aspects of ransomware incidents, such as current ransomware activities, timing with geopolitical events, geographic spread, or type of companies targeted.

In Ransom War: How Cyber Crime Became a Threat to National Security, one of the authors of this post, Max Smeets, used data from the leak site Conti News to assess trends in targeting. Data from Conti News suggested that the group, Conti, had begun to explore targets in markets beyond the Western world. In 2020, Conti only very sporadically released leaked data from victims in non-Western regions. For example, in early December 2020, they put up for sale data from a relatively small Indian company called Ixsight Technologies. They also offered data from a smaller information technology firm in the United Arab Emirates, CORE Information Technology Consultants. But these were exceptions to the rule. In 2021, a shift seems to have occurred. Conti began expanding its operations into other markets, with an emphasis on actively targeting organisations in Latin America. Data from various Latin American companies began to appear on their leak site.

This type of data can be tempting to use due to its tantalising accessibility and broadness – it can even be obtained in bulk from scraping websites like ecrime.ch. However, manipulation by ransomware groups, selection biases, and inaccuracy necessitate a cautious approach – one that too often is not taken.

Selection bias

A key limitation in using leak site data is its inherent selection bias. These sites only showcase victims who do not meet ransom demands, meaning we see a skewed picture of ransomware activity. A highly effective ransomware group that secures a high percentage of ransom payments may appear less active than a less successful group that posts more victims online. This hinders cross-group comparisons, complicating the analysis of groups like BlackCat, the culprit behind the UnitedHealth incident; Akira, which has received at least $42 million in ransom payments from hundreds of victims; and their competitors.

This bias also impedes within-group comparisons, making it difficult to discern activity trends of ransomware groups over time. A decline in the volume of leaked data does not necessarily indicate a reduction in ransomware operations; it could also imply that more victims are yielding to ransom demands as the ransomware group becomes better at stealing critical data.

Similarly, certain sectors, such as government institutions, may be overrepresented in the data simply because they are less likely to pay due to regulatory restrictions, further distorting trend analyses.

Overestimating and underestimating

In a 2022 article, cyber threat analyst Will Thomas illustrated how relying only on leak site data can lead to a stark underestimation of actual ransomware cases. The example of REvil, the ransomware gang behind the now infamous Kaseya supply-chain attack, is indicative. As Thomas points out, the group’s leak site showed only 288 victims, yet, when seven of its members were arrested by Europol in 2021, they alone were accused of around 7,000 cases of ransomware use.

Overestimation is also a risk. Ransomware groups have incentives to inflate their presence on leak sites. They may fake listings, post old data, or post other groups’ victims. These tactics can create a false sense of scale, making it look like they have targeted more victims than they have. These inflated numbers serve several purposes. A larger list of victims helps build a stronger reputation within the criminal ecosystem, making the group appear more successful and attracting more affiliates who want to profit from their perceived dominance. An inflated reputation also gives ransomware groups leverage in negotiations. Victims are more likely to pay if they believe the group is powerful and capable of causing widespread damage.

Imagine you are a cybercrime officer relying on leak site data to assess whether you are effectively combating the ransomware threat. Are you looking for an increase in the number of victims listed, which could suggest that fewer victims are paying ransoms, perhaps due to better defences? Or would you rather see fewer entries, possibly indicating a reduction in overall ransomware activity? It is clear that, unless perhaps you are strictly focused on minimising data leaks, the number of leak-site posts should not be trusted as an indicator for success in combatting ransomware.

The more we use it, the less useful leak data becomes

The incentives for these groups to manipulate data increase the more the research community, journalists, and industry practitioners rely on this data to analyse their behaviour and describe trends.

Cybersecurity journalist Valéry Riess-Marchive has pointed out the uncritical use of this data not only skews research results but may even serve ransomware groups’ interests by amplifying their perceived power.

Ransomware groups are observant. They do not operate in a vacuum but constantly engage with the public. They are aware of the influence their leak sites can exude.

Need for a careful approach

Despite these issues, leak site data can still offer useful insights, for example as early indicators of new or resurgent ransomware groups. However, we must proceed with caution. We should not give in to the temptation of using this data for our analyses simply because it is easily available. It is important to dissect the data critically, recognising both its value and its limitations. If we choose to include leak site data in our analyses, we must clearly acknowledge its potential shortcomings and ensure these are discussed transparently. Ultimately, we are relying on a source maintained by criminal actors who make their living through deception.