The Panama Papers is a very large data set that was leaked from a Panamanian law firm and corporate service provider named “Mossack Fonseca”. The documents were allegedly leaked by an anonymous whistleblower only known to date as “John Doe” to a journalist of the German paper “Süddeutsche Zeitung”.
The motive for releasing the papers is claimed to be an inequality of pay and the realization of the whistleblower that there were many practices that were immoral. The documents revealed various illegal activities varying from tax evasion to drug money trails.
The papers
have played, and are still playing, key roles in investigations involving
fraud, tax evasion and other criminal activities. They also involved many
wealthy people as well as public figures and entities from 200 nations.
One of the biggest tasks is to convert the leaked data to a searchable database.
When the data was leaked originally it was effectively no more than a collection of ones and zeroes. The database had to be reverse engineered in order to make any sense of the data at all. If you consider that there were terabytes of data this was no easy task by itself let alone for the small team behind International Consortium of Investigative Journalists (ICIJ).
After making sense of the raw data it needed to be made searchable. A simple table-based database such as MySQL would have run into performance issues. The solution was a graph-based database that exposed links between people and organizations and documents that were involved.
What is very interesting however, is that The Panama Papers do not appear to have been exposed to advanced data science techniques. This is surprising since the total amount of data involved in the leak is about 2.6 Terabytes and contains about 11.5 million leaked documents. A volume that is near impossible to investigate by hand.
Furthermore, many of the documents in The Panama Papers are unstructured. With unstructured documents we mean documents such as e-mails, pdfs and other written text documents that do not adhere to a single standard format. An example of structured data would be a database. In a database every information field has pre-defined restrictions, for example whether it is numerical data, text or an image and what the field should contain (e.g. a first name, last name, e-mail address and so on).
Unfortunately, both humans and computers are not that good at working with these unstructured documents. Especially if there are so many of them! Luckily, the field of Data Science is well equipped to find relations in both structured and unstructured data. Often Data Science can even go a step further and expose relations that humans cannot see initially. We see that the major role of Data Science lies in bringing structure into the massive number of files and provide an overview through good data visualization.
Currently,
when a data journalist searches for a person, ICIJ defined a proximity that can
be set while searching. This proximity allows some tolerance on the search
term. For example, when searching for “John Doe” with a proximity of 2 the
search will match results such as “Doe, John”, “John Middlename Doe” and so on.
Whether this is also able to find instances of J. Doe is unclear from their information.
A problem that arises while searching through the leak is that names are not unique. Therefore, a match on e.g. “John Doe” does not mean that all results involve the same person.
By exploring the concept of named entity disambiguation these faulty matches can already be filtered out to some degree. If you look at the context of the name, e.g. the address or companies associated with the name important separations can be created in the search results.
Another crucial contribution data science can provide is bringing structure to unstructured data. Entities in letters can be detected in order to link these unstructured files to the people that are involved.
Alternatively, similar documents can be grouped together. An approach for this we discussed in our previous blog post: Organizing your documents the AI way.
It is also likely that there are duplicate documents in the unstructured data. There are several techniques to identify these duplicate documents, even if they are in different file formats or the information in them is structured differently.
A final judgement, however, should remain with the investigative journalists. While Data Science can bring (more) structure to the data and perhaps make patterns clearer, it is unlikely that the structuring is flawless.
The named entity disambiguation can give false results when marking people as unique but also when marking two users or companies as the same entity.
Furthermore, the ICIJ organization also stresses that if someone’s name occurs in The Panama Papers it does not mean that he or she is practising unlawful business. After all, there are many legitimate reasons to have an offshore account.
We now know what The Panama Papers involve and some things that we can try to do better. But what is the exact size and impact of The Panama Papers? We could say it included 11.5 million files or 1.2 billion dollars in taxes has been recovered but those numbers by themselves do not say much.
Let’s put those 1.2 billion dollars in perspective:
It’s enough to buy the most expensive private island in the world almost four times!
Or perhaps we should use the recovered taxes to build 6 copies of the world’s fastest computer. (Think of all the data wrangling and further tax recovery that could be done).
Alternatively, you can send 14 people to the moon (and back)!
The most taxes have been recovered for the UK, followed by Germany, Spain and France. While the number is already very impressive there is no doubt that there is more tax money hidden in the leak.
What is interesting, is that The Panama Papers leak is not the largest data leak when looking at the number of files involved. The leak is one of the biggest leaks with 11.5 million documents but that is surpassed by The Paradise Papers with 13.4 million documents.
What is surprising however, is that The Panama Papers appear to involve a lot more entities than The Paradise Papers
The Panama Papers have been leaked 3 years ago and the impact of the leak continues to grow. Even today many investigations are still ongoing and the amount of money that has been recovered keeps growing.
All the leaks are made publicly accessible by the ICIJ organization in the form of a Datashare, a search tool. It is important to note that their work is fully funded by donations.
Back to blogs