python tip

Determining the Percentage of Missing Data

For illustration, I'm going to use a synthetic dataset with the contact information of 500 fictitious subjects from the US. Let's imagine that this is our client base. Here's what the dataset looks like:

clients.head()

As you can see, it includes information on each person's first name, last name, company name, address, city, county, state, zip code, phone numbers, email, and web address. Our first task is to check for missing data. You can use clients.info() to get an overview of the number of complete entries in each of the columns. However, if you want a clearer picture, here's how you can get the percentage of missing entries for each of the features in descending order:

# Getting percentange of missing data for each column
(clients.isnull().sum()/clients.isnull().count()).sort_values(ascending=False)

As you may recall, isnull() returns an array of True and False values that indicate whether a given entry is present or missing, respectively. In addition, True is considered as 1 and False is considered as 0 when we pass this boolean object to mathematical operations. Thus, clients.isnull().sum() gives us the number of missing values in each of the columns (the number of True values), while clients.isnull().count() is the total number of values in each column. After we divide the first value by the second and sort our results in descending order, we get the percentage of missing data entries for each column, starting with the column that has the most missing values. In our example, we see that we miss the second phone number for 51.6% of our clients.