Blockchain Forensics Requires Offchain Detective Work

Blockchain analytics services are one of the few services with an actual solid income stream. Customers are financial services, government monitoring, and law enforcement.

Though apparently straight-forward data pipeline technology, a blockchain forensics service is only as good as its offchain detectives.

Why Forensics: Anti-Money Laundering Regulations

All businesses who convert or transfer money must comply with Anti-Money Laundering (AML) Rules in order to be granted Licence to operate.

Anti-Money Laundering (AML) first requires you to know the identity of you client (KYC). Good client onboarding requests Identity documentation, specially if the amounts transferred are over several thousand dollars.

Additionally, you must monitor client deposits and transfers for suspicious activity. The source and destination of monies must be scrutinised for links with illegal monies.

To comply with the AML, you will need:

  1. Coin Scoring: Risk of illicit activity for client’s transfers. The scoring measures the money’s links to fraudulent origins.
  2. Forensic Investigation: Ability to trace money backwards on the ledger’s transfer graph
  3. De-anonymisation of transactions and addresses. Forensics needs real identities to ascertain fraud or illegal activity

The analytics service may need to serve as evidence in legal proceedings, and must provide evidence for its conclusions.


Blocks forensics needs two types of data; (1) the blockchain data and (2) off-chain identity intelligence.

The crypto blockchain is freely available and provides a ledger of all transactions by all actors in the ecosystem.

In contrast, off-chain intelligence on the owners behind crypto addresses is more difficult to come by. It is generally private, and unstructured. Collecting offchain identity data is more like detective work than data science.

Graph Database

Clustering Addresses

Blockchains use cryptographic hashes, public keys, to identify owners. Crypto money is nominally anonymous. Trillions of these public keys define the origins and destinations of transactions on the ledger.

Graph network of addresses (blue) and transactions (red) in a typical bitcoin cluster

Navigating such a linked network of transactions between addresses is complex. The first step in mapping crypto transactions is to cluster addresses into wallets owned by the same entity.

Two principal heuristic rules are used to cluster addresses; (1) shared spending authority, and (2) the change address rule.

First, the multi-input heuristic, when an input to a transaction uses two or more addresses and is cryptographically signed by the same owner. The two addresses are linked and belong to the same wallet, the same owner.

Second, when the output of a transaction has a logically clear change address, it belong to the transaction author and is linked to the same wallet as the input addresses.

Cluster analysis using neo4j graph algorithms[1] results in graphs like those shown.

Figure 1: Graph of an address cluster showing addresses (blue) that are linked to transactions (red)

Taint Rank

A crypto’s taint rank measures how closely it is linked to an illicit source of funds. Under AML procedures, a customer transferring highly tainted crypto will have his account locked, and his funds investigated.

Taint scores of crypto addresses are computed with graph algorithms. Graph algorithms propagate taint labels from input addresses to output addresses in a transaction, and cascade a weighted taint tag through the transaction graph. Any address can be assigned a risk value of how closely it is linked to illegal addresses[6].

Essential for taint scoring is a complete set of illegal activity addresses. No central repository exists for illegal addresses as yet.

Machine Learning

Most blockchain transactions are semi-automatic, generated by custodial wallet and trading system software. Machine learning is able to identify many of these transaction patterns and assign tags to many of these clusters and wallets [7].

The types of actors are:

  • Exchanges
  • Gambling
  • Marketplaces
  • Mining Pools
  • Mixers
  • Services

All illicit activity tries to launder the proceeds of their crimes. Obfuscating transactions so as to obscure and difficult forensics tracing is an increasingly sophisticated game of cat and mouse.

Obfuscation is scripted and automated to generate complexity and mis-diretion. Machine learning is particularly adept at reverse engineering the patterns and clarify the ultimate source and destination of the moneys.

A critical tool to tracing money laundering are tagging all the on-ramps and off-ramps to FIAT money, where criminals can make off with their gains.

Linking Crypto Addresses to Real People

Linking blockchain addresses to real world identities requires off-chain intelligence.

Many organizations publish their crypto addresses in forums, wikipedia, and their own websites. These addresses are collected through website scrapers. Google can find location of crypto addresses across the open web, but does not allow automated scraping.

Noting addresses while interacting with marketplaces and services [4] is a right of passage for any analytics team.

Law enforcement has illicit activity intelligence but no standards exist as yet.

Finally, exchanges all posses KYC customer information linking addresses to accounts. These can be subpoenaed to provide the address to identity links.

The end result is a database termed ground truth. The quality of a forensic service is closely tied to breadth and depth of the ground truth it has collected.

Public blacklists or reliable ground truths do not exist. is the seminal reference often used by academia, but it is a best effort and does not provide provenance or guarantees.

Collecting ground truth is strategic activity, and offchain detective work is ultimately what sets a service apart.


Collecting ground truth is strategic activity, and offchain detective work is ultimately what sets a service apart.

Different forensics suppliers, during the course of their detective work, arrive at different set of ground truths. So clients, like tax authorities, inevitably contract several blockchain forensics services in order to consult alternate databases. Triangulating forensics services.

An “Open Standard” for forensics ground truth, and a public databases of fraudulent addresses held by public organisations would benefit the field as a whole.


[1] The Neo4j graph algorithms.

[2] “The Unreasonable Effectiveness of Address Clustering“, Martin Harrigan, Christoph Fetter, September 2018.

[3] “Evolution of the Bitcoin Address Graph”, E. Filtz, A. Polleres, R. Karl, B. Haslhofer, Data Science Conference, Vienna, 2017

[4] “A Fistful of Bitcoins: Characterizing Payments among Men with No Names”, S. Meiklejohn, G. Jordan, K. Levchenko, D. McCoy, G. Voelker, S. Savage, Proc. Internet Measurement Conf., 2013

[5] “Automatic Bitcoin Address Clustering“, D. Ermilov, M. Panov, Y. Yanovich, IEEE Conference on Machine Learning and applications, 2017

[6] “Effective Cryptocurrency Regulation Through Blacklisting“, M. Moser, A. Narayanan, Princeton, 2019.

[7] “An Evalution of Bitcoin Address Classification based on Transaction History Summarization“, Y.J Lin, P.W Wu, C.H Hsu, I.P Tu, S.W Liao, IEEE, 2019