Scraping the List of Data Breaches from Wikipedia

Keli
5 min readJan 24, 2021

A few weeks ago, I’ve received two have-I-been-pwned notifications in just one week. This means that my personal data (my email and password) have been leaked twice through two data breach incidents. Now, I have been “pwned” eight times in total through various data breach incidents. I got curious about how many data breaches actually happened. A list of data breaches from the year 2004 is put together through collective efforts on Wikipedia. As I scrolled down and down through the exhausted list, I wandered what sorts of insights I can possibly make out of this table. This simple table could address a few questions — What is the most used method to breach data? Which industry has more data breaches? Are there more or fewer breach incidents as the years go by? I used Scrapy to scrape data from the list for a quick explorative analysis. You can find the code in the last section.

Partial list of data breaches. Source: Wikipedia

What are the most common ways data breaches occur?

Not surprisingly, hacking is by far the most frequent used way to breach data. Poor security practices are the second most common way that data got breached. In this case, data are exposed not necessary by malicious attacks, but by, for example, being in the process of unproper transmission. Lost or stolen media (portable disk drive, laptop…etc) are the third in the list. The forth is accidental publication of data, usually by internal employees. Note that it has happened more than deliberate attempts of stealing data internally, which is the fifth on the list.

Which industries are the most affected?

The web industry is the most affected industry. “Web” as an industry is broad and vague. Having a look at the list, the companies categorized in web can actually be further categorize to more specific industries. What are they have in common is that they all offer their services or products through the Internet. Healthcare is the second most affected industry, followed by finance, government and retail.

Are there more and more data breaches each year?

The year of 2011 is the year that had the most data breaches happened. The trend went down to the lowest in 2017, and bounced back high in 2018. The trend went up until 2019 then the number of incidents dropped in 2020. Although I have received two data breach notifications this year, there is no incident recorded on Wikipedia, yet.

Method: Web Scraping by Scrapy

Disclaimer:

  1. Wikipedia allows web scraping as long as the bot is not going to fast. See here.
  2. Wikipedia actually offers their entire database for direct download. See here.

Scrapy and Beautifulsoup are two of the most popular web scraping tools today. Compared to Beautifulsoup, Scrapy is considered a full-ledge tool that allows bigger scale scraping project. Using Beautifulsoup is enough for scraping a single table from a single web page. I am using Scrapy because I want to be more familiar with this tool.

Data collection: Create a Scrapy Spider

Following the Scrapy tutorial, I created a simply spider that scrapes the table of the list of data breaches on Wikipedia. The scraped data is then stored as a JSON line object. The code to create the spider is the following.

After the scraped data is store, it is loaded to a Python notebook as a data frame for further exploration. I used read_json() from pandas to do the job. lines is set to be True because the JSON line file is separated by line.

import pandas as pd
df = pd.read_json("breaches.jl", lines=True)

Data Cleaning: Text Preprocessing

In the data frame, some of the values have extra ‘\n’ symbols to be removed. Also, the first row somehow is just a list of None values, we will remove those as well.

df = df.replace('\n','', regex=True)
df = df[1:]

The data after cleaning will look like this.

Data Exploration: Plot the data

A function is created to plot the bar charts. The arguments are the column, the colour of the bars, and the title of the plot. Then the function was called to showcase the incidents per industry and per data breach method respectively.

from matplotlib import pyplot as pltdef plot_counts_bar(col, colour, title):
fig = plt.figure()
fig, axs = plt.subplots(1, 1, figsize=(10,6))
ax = col.value_counts().sort_values(ascending = False)
[0:5].plot(kind='barh', color = colour)
for i in ax.patches:
ax.text(i.get_width()+.1, i.get_y()+.1,
str(round((i.get_width()), 0)), fontsize=10, color=colour)
plt.title(title)
plt.xlabel('Counts')
# Incidents per method
plot_counts_bar(df['Method'], "#F35B04", 'Top 5 Data Breach Methods')
# Incidents per industry
plot_counts_bar(df['Type'], "#1653f3", 'Top 5 Affected Industries by Data Breach')

Finally, a new data frame is created. The new data frame df_year has two columns — the years and the counts of incidents in each year. The data frame is used to create a line chart that showcases the trend of the incident frequency over the years from 2004 to 2020.

# New data frame
df_year = df["Year"].value_counts().rename_axis('year').reset_index(name='counts')
df_year = df_year.sort_values(by='year')
# Line chart
fig = plt.figure()
fig, axs = plt.subplots(1, 1, figsize=(10,6))
plt.plot(df_year["year"], df_year["counts"], marker='o', color = "#3DDC97")
plt.title('Data Breach Incident Counts Per Year')
plt.xlabel('Year')
plt.ylabel('Counts')
plt.grid()
plt.show()

--

--

Keli

Writing the good, the bad, and the ugly about data. | kelichiu.com