Cleaning Data: Missing data, Sample 2

Introduction

We will be working with the unicorn companies dataset, once again. Our tasks would be as follows.

The investor has asked us to provide them with the following data:

  • Companies in the hardware industry based in Beijing, San Francisco, and London.
  • Companies in the artificial intelligence industry based in London.
  • A list of the top 20 countries sorted by sum of company valuations in each country, excluding United States, China, India, and United Kingdom.
  • A global valuation map of all countries except United States, China, India, and United Kingdom.

1. Imports

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import plotly.express as px
import seaborn as sns

# Read the data into a dataframe
df_companies = pd.read_csv('Unicorn_Companies.csv')

2. Data exploration

df_companies.head()
CompanyValuationDate JoinedIndustryCityCountry/RegionContinentYear FoundedFundingSelect Investors
0Bytedance$180B4/7/17Artificial intelligenceBeijingChinaAsia2012$8BSequoia Capital China, SIG Asia Investments, S…
1SpaceX$100B12/1/12OtherHawthorneUnited StatesNorth America2002$7BFounders Fund, Draper Fisher Jurvetson, Rothen…
2SHEIN$100B7/3/18E-commerce & direct-to-consumerShenzhenChinaAsia2008$2BTiger Global Management, Sequoia Capital China…
3Stripe$95B1/23/14FintechSan FranciscoUnited StatesNorth America2010$2BKhosla Ventures, LowercaseCapital, capitalG
4Klarna$46B12/12/11FintechStockholmSwedenEurope2005$4BInstitutional Venture Partners, Sequoia Capita…
df_companies.shape
(1074, 10)
df_companies.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1074 entries, 0 to 1073
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Company 1074 non-null object
1 Valuation 1074 non-null object
2 Date Joined 1074 non-null object
3 Industry 1074 non-null object
4 City 1058 non-null object
5 Country/Region 1074 non-null object
6 Continent 1074 non-null object
7 Year Founded 1074 non-null int64
8 Funding 1074 non-null object
9 Select Investors 1073 non-null object
dtypes: int64(1), object(9)
memory usage: 84.0+ KB

Columns City and Select Investors contain fewer non-null values than the total number of rows in the dataset, which indicates that they are missing values.

df_companies.describe()
Year Founded
count1074.000000
mean2012.895717
std5.698573
min1919.000000
25%2011.000000
50%2014.000000
75%2016.000000
max2021.000000

The oldest company in the list was founded in 1919. This is the minimum value in Year Funded.

Data Preprocessing

Let’s create a column for just the year each company became a unicorn company.

# Create a new column `Year Joined`
df_companies['Year_Joined'] = pd.to_datetime(df_companies['Date Joined']).dt.year

For each country, we want to calculate the sum of all valuations of companies from that country. However, we’ll need to first prepare the data. The Valuation column needs to be converted to a numeric datatype.

We’ll achieve this by defining a function and using the string strip() method. This method is applied to a string. Its argument is a string that contains all the characters that you want to remove from the beginning and end of a given string—in any order.

The specified characters will be removed until a valid character is encountered. This process is applied moving forward from the beginning of the string and also moving in reverse from the end of the string, thus removing unwanted beginning and trailing characters.

Example:

[IN]:  my_string = '#....... Section 3.2.1 Issue #32 .......'
my_string = my_string.strip('.#! ')
print(my_string)

[OUT]: 'Section 3.2.1 Issue #32'

Note that we must reassign the result back to a variable or else the change will not be permanent.

# Define the `str_to_num()` function
def str_to_num(x):
   x = x.strip('$B')
   x = int(x)
​
   return x

Now we can create a new column that represents the Valuation column as an integer value. To do this, we’ll use the series method apply() to apply the str_to_num() function to the Valuation column.

apply() is a method that can be used on a DataFrame or Series object. In this case, we’re using it on the Valuation series. The method accepts a function as an argument and applies that function to each value in the series.

Example:

[IN]: def square(x):
return x ** 2

my_series = pd.Series([0, 1, 2, 3])
my_series

[OUT]: 0 0
1 1
2 2
3 3
dtype: int64

[IN]: my_series = my_series.apply(square)
my_series

[OUT]: 0 0
1 1
2 4
3 9
dtype: int64

Notice that the function passed as an argument to the apply() method does not have parentheses. It’s just the function name.

# Apply the `str_to_num()` function to the `Valuation` column
# and assign the result back to a new column called `valuation_num`
df_companies['valuation_num'] = df_companies['Valuation'].apply(str_to_num)
df_companies[['Valuation', 'valuation_num']].head()
Valuationvaluation_num
0$180B180
1$100B100
2$100B100
3$95B95
4$46B46
Find missing values

The unicorn companies dataset is fairly clean, with few missing values.

# Find the number of missing values in each column in this dataset.
df_companies.isna().sum()
Company              0
Valuation 0
Date Joined 0
Industry 0
City 16
Country/Region 0
Continent 0
Year Founded 0
Funding 0
Select Investors 1
Year_Joined 0
valuation_num 0
dtype: int64

There is a single missing value in the Select Investors column and 16 missing cities. There are no missing values in other columns.

Review rows with missing values

Before dealing with missing values, it’s important to understand the nature of the missing value that is being filled. That’s why we’ll display all rows with missing values from df_companies in 3 steps.

# Step 1: Apply the `isna()` method to the `df_companies` dataframe 
# and assign back to `mask`
mask = df_companies.isna()
mask.tail()
CompanyValuationDate JoinedIndustryCityCountry/RegionContinentYear FoundedFundingSelect InvestorsYear_Joinedvaluation_num
1069FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
1070FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
1071FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
1072FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse
1073FalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalseFalse

Now we’ll need to go from this dataframe of Boolean values to a dataframe of just the rows of df_companies that contain at least one NaN value. This means that we need a way to find the indices of the rows of the Boolean dataframe that contain at least one True value, then extract those indices from df_companies.

We can do this using the any() method for DataFrame objects. This method returns a Boolean Series indicating whether any value is True over a specified axis.

Example:

df =   
A B C
0 0 a 10
1 False 0 1
2 NaN NaN NaN

[IN]: df.any(axis=0)

[OUT]: A False
B True
C True
dtype: bool

[IN]: df.any(axis=1)

[OUT]: 0 True
1 True
2 False
dtype: bool  

Note that 0, False, and NaN are considered False and anything else is considered True.

# Step 2: Apply the `any()` method to `mask` 
# and assign the results back to `mask`
​mask = mask.any(axis=1)
mask.head()
0    False
1    False
2    False
3    False
4    False
dtype: bool

Because mask is now a series of Boolean values, we can use it as a Boolean mask. We’ll apply the Boolean mask to the df_companies dataframe to return a filtered dataframe containing just the rows that contain a missing value.

# Step 3: Apply `mask` as a Boolean mask to `df_companies` 
# and assign results to `df_missing_rows`
​df_missing_rows = df_companies[mask]
df_missing_rows
CompanyValuationDate JoinedIndustryCityCountry/RegionContinentYear FoundedFundingSelect InvestorsYear_Joinedvaluation_num
12FTX$32B7/20/21FintechNaNBahamasNorth America2018$2BSequoia Capital, Thoma Bravo, Softbank202132
170HyalRoute$4B5/26/20Mobile & telecommunicationsNaNSingaporeAsia2015$263MKuang-Chi20204
242Moglix$3B5/17/21E-commerce & direct-to-consumerNaNSingaporeAsia2015$471MJungle Ventures, Accel, Venture Highway20213
251Trax$3B7/22/19Artificial intelligenceNaNSingaporeAsia2010$1BHopu Investment Management, Boyu Capital, DC T…20193
325Amber Group$3B6/21/21FintechNaNHong KongAsia2015$328MTiger Global Management, Tiger Brokers, DCM Ve…20213
382Ninja Van$2B9/27/21Supply chain, logistics, & deliveryNaNSingaporeAsia2014$975MB Capital Group, Monk’s Hill Ventures, Dynamic…20212
541Advance Intelligence Group$2B9/23/21Artificial intelligenceNaNSingaporeAsia2016$536MVision Plus Capital, GSR Ventures, ZhenFund20212
629LinkSure Network$1B1/1/15Mobile & telecommunicationsShanghaiChinaAsia2013$52MNaN20151
811Carousell$1B9/15/21E-commerce & direct-to-consumerNaNSingaporeAsia2012$288M500 Global, Rakuten Ventures, Golden Gate Vent…20211
848Matrixport$1B6/1/21FintechNaNSingaporeAsia2019$100MDragonfly Captial, Qiming Venture Partners, DS…20211
880bolttech$1B7/1/21FintechNaNSingaporeAsia2018$210MMundi Ventures, Doqling Capital Partners, Acti…20211
889Carro$1B6/14/21E-commerce & direct-to-consumerNaNSingaporeAsia2015$595MSingTel Innov8, Alpha JWC Ventures, Golden Gat…20211
893Cider$1B9/2/21E-commerce & direct-to-consumerNaNHong KongAsia2020$140MAndreessen Horowitz, DST Global, IDG Capital20211
980NIUM$1B7/13/21FintechNaNSingaporeAsia2014$285MVertex Ventures SE Asia, Global Founders Capit…20211
986ONE$1B12/8/21Internet software & servicesNaNSingaporeAsia2011$515MTemasek, Guggenheim Investments, Qatar Investm…20211
994PatSnap$1B3/16/21Internet software & servicesNaNSingaporeAsia2007$352MSequoia Capital China, Shunwei Capital Partner…20211
1061WeLab$1B11/8/17FintechNaNHong KongAsia2013$871MSequoia Capital China, ING, Alibaba Entreprene…20171

Twelve of the 17 rows with missing values are for companies from Singapore.

At this point, we can ask the business users for insight into the causes of missing values and, if possible, get domain knowledge to intelligently impute these values.

3. Model building

Two ways to address missing values

There are several ways to address missing values, which is critical in EDA. The two primary methods are removing them and imputing other values in their place. Choosing the proper method depends on the business problem and the value the solution will add or take away from the dataset. Here, we will try both.

To compare the effect of different actions, first we’ll store the original number of values in a variable. 

# Store the total number of values in a variable called `count_total`
​count_total = df_companies.size
count_total
12888

Now, we’ll remove all rows containing missing values.

# Drop the rows containing missing values, 
# determine number of remaining values
count_dropna_rows = df_companies.dropna().size
count_dropna_rows
12684

Now, let’s remove all columns containing missing values.

# Drop the columns containing missing values, 
# determine number of remaining values
count_dropna_columns = df_companies.dropna(axis=1).size
count_dropna_columns
10740

Next, print the percentage of values removed by each method and compare them.

# Print the percentage of values removed by dropping rows.
row_percent = ((count_total - count_dropna_rows) / count_total) * 100
print(f'Percentage removed, rows: {row_percent:.3f}')

# Print the percentage of values removed by dropping columns.
col_percent = ((count_total - count_dropna_columns) / count_total) * 100
print(f'Percentage removed, columns: {col_percent:.3f}')
Percentage removed, rows: 1.583
Percentage removed, columns: 16.667
Which method was most effective? 

The percentage removed was significantly higher for columns than it was for rows. Since both approaches result in a dataset with no missing values, the “most effective” method depends on how much data we have and what we want to do with it.

It might be best to use the way that leaves the most data intact—in this case, dropping rows. Or, if we don’t have many samples and don’t want to lose any, but we don’t need all our columns, then dropping columns might be best. With this data, it would probably be best to drop rows in the majority of cases.

Imputation

Now, let’s practice the second method: imputation.

We’ll use the fillna() method to fill each missing value with the next non-NaN value in its column. 

Example:

df =    
A B C
0 5 a NaN
1 10 NaN False
2 NaN c True

[IN]: df.fillna(method='backfill')

[OUT]:
A B C
0 5 a False
1 10 c False
2 NaN c True 

Notice that if there is a NaN value in the last row, it will not backfill because there is no subsequent value in the column to refer to.

# 1. Fill missing values using the 'fillna()' method, back-filling
df_companies_backfill = df_companies.fillna(method='backfill')

# 2. Show the rows that previously had missing values
​df_companies_backfill.iloc[df_missing_rows.index, :]
CompanyValuationDate JoinedIndustryCityCountry/RegionContinentYear FoundedFundingSelect InvestorsYear_Joinedvaluation_num
12FTX$32B7/20/21FintechJacksonvilleBahamasNorth America2018$2BSequoia Capital, Thoma Bravo, Softbank202132
170HyalRoute$4B5/26/20Mobile & telecommunicationsEl SegundoSingaporeAsia2015$263MKuang-Chi20204
242Moglix$3B5/17/21E-commerce & direct-to-consumerSan FranciscoSingaporeAsia2015$471MJungle Ventures, Accel, Venture Highway20213
251Trax$3B7/22/19Artificial intelligenceAmsterdamSingaporeAsia2010$1BHopu Investment Management, Boyu Capital, DC T…20193
325Amber Group$3B6/21/21FintechSan FranciscoHong KongAsia2015$328MTiger Global Management, Tiger Brokers, DCM Ve…20213
382Ninja Van$2B9/27/21Supply chain, logistics, & deliverySan FranciscoSingaporeAsia2014$975MB Capital Group, Monk’s Hill Ventures, Dynamic…20212
541Advance Intelligence Group$2B9/23/21Artificial intelligenceHelsinkiSingaporeAsia2016$536MVision Plus Capital, GSR Ventures, ZhenFund20212
629LinkSure Network$1B1/1/15Mobile & telecommunicationsShanghaiChinaAsia2013$52MSequoia Capital India, The Times Group, GMO Ve…20151
811Carousell$1B9/15/21E-commerce & direct-to-consumerNew YorkSingaporeAsia2012$288M500 Global, Rakuten Ventures, Golden Gate Vent…20211
848Matrixport$1B6/1/21FintechSan FranciscoSingaporeAsia2019$100MDragonfly Captial, Qiming Venture Partners, DS…20211
880bolttech$1B7/1/21FintechEnglewoodSingaporeAsia2018$210MMundi Ventures, Doqling Capital Partners, Acti…20211
889Carro$1B6/14/21E-commerce & direct-to-consumerLincolnSingaporeAsia2015$595MSingTel Innov8, Alpha JWC Ventures, Golden Gat…20211
893Cider$1B9/2/21E-commerce & direct-to-consumerMexico CityHong KongAsia2020$140MAndreessen Horowitz, DST Global, IDG Capital20211
980NIUM$1B7/13/21FintechBengaluruSingaporeAsia2014$285MVertex Ventures SE Asia, Global Founders Capit…20211
986ONE$1B12/8/21Internet software & servicesNew YorkSingaporeAsia2011$515MTemasek, Guggenheim Investments, Qatar Investm…20211
994PatSnap$1B3/16/21Internet software & servicesLondonSingaporeAsia2007$352MSequoia Capital China, Shunwei Capital Partner…20211
1061WeLab$1B11/8/17FintechBeijingHong KongAsia2013$871MSequoia Capital China, ING, Alibaba Entreprene…20171

The values that were used to fill in for the missing values doesn’t make sense. The values seem to be added without consideration of the country those cities are located in.

Another option is to fill the values with a certain value, such as ‘Unknown’. However, doing so doesn’t add any value to the dataset and could make finding the missing values difficult in the future. Reviewing the missing values in this dataset determines that it is fine to leave the values as they are. This also avoids adding bias to the dataset.

4. Results and evaluation

Now that we’ve addressed our missing values, we can provide our investor with their requested data points.

Companies in the Hardware Industry

Remember our investor is interested in identifying unicorn companies in the Hardware industry in the following cities: Beijing, San Francisco, and London. They are also interested in companies in the Artificial intelligence industry in London.

We’ll write a selection statement that extracts the rows that meet these criteria. We’ll use the isin() Series method. This method is applied to a pandas series and, for each value in the series, checks whether it is a member of whatever is passed as its argument.

Example:

[IN]: my_series = pd.Series([0, 1, 2, 3])
my_series

[OUT]: 0 0
1 1
2 2
3 3
dtype: int64

[IN]: my_series.isin([1, 2])

[OUT]: 0 False
1 True
2 True
3 False
dtype: bool
# 1. Create a Boolean mask using conditional logic
cities = ['Beijing', 'San Francisco', 'London']
mask = (
   (df_companies['Industry']=='Hardware') & (df_companies['City'].isin(cities))
) | (
   (df_companies['Industry']=='Artificial intelligence') & (df_companies['City']=='London')
)
​
# 2. Apply the mask to the `df_companies` dataframe 
# and assign the results to `df_invest`
​df_invest = df_companies[mask]
df_invest
CompanyValuationDate JoinedIndustryCityCountry/RegionContinentYear FoundedFundingSelect InvestorsYear_Joinedvaluation_num
36Bitmain$12B7/6/18HardwareBeijingChinaAsia2015$765MCoatue Management, Sequoia Capital China, IDG …201812
43Global Switch$11B12/22/16HardwareLondonUnited KingdomEurope1998$5BAviation Industry Corporation of China, Essenc…201611
147Chipone$5B12/16/21HardwareBeijingChinaAsia2008$1BChina Grand Prosperity Investment, Silk Road H…20215
845Density$1B11/10/21HardwareSan FranciscoUnited StatesNorth America2014$217MFounders Fund, Upfront Ventures, 01 Advisors20211
873BenevolentAI$1B6/2/15Artificial intelligenceLondonUnited KingdomEurope2013$292MWoodford Investment Management20151
923Geek+$1B11/21/18HardwareBeijingChinaAsia2015$439MVolcanics Ventures, Vertex Ventures China, War…20181
1040TERMINUS Technology$1B10/25/18HardwareBeijingChinaAsia2015$623MChina Everbright Limited, IDG Capital, iFLYTEK20181
1046Tractable$1B6/16/21Artificial intelligenceLondonUnited KingdomEurope2014$120MInsight Partners, Ignition Partners, Georgian …20211

As seen above, eight companies meet the stated criteria.

List of countries by sum of valuation

For each country, we will sum the valuations of all companies in that country, then sort the results in descending order by summed valuation. 

# Group the data by`Country/Region`
national_valuations = df_companies.groupby(['Country/Region'])['valuation_num'].sum()
.sort_values(ascending=False).reset_index()
​
# Print the top 15 values of the DataFrame.
​national_valuations.head(15)
Country/Regionvaluation_num
0United States1933
1China696
2India196
3United Kingdom195
4Germany72
5Sweden63
6Australia56
7France55
8Canada49
9South Korea41
10Israel39
11Brazil37
12Bahamas32
13Indonesia28
14Singapore21

The sorted data indicates that the four countries with highest total company valuations are the United States, China, India, and the United Kingdom. However, our investor specified that these countries should not be included in the list because they are outliers.

Filter out top 4 outlying countries
# Remove outlying countries
national_valuations_no_big4 = national_valuations.iloc[4:, :]
​national_valuations_no_big4.head()
Country/Regionvaluation_num
4Germany72
5Sweden63
6Australia56
7France55
8Canada49

Alternative approach

We can also use isin() to create a Boolean mask to filter out specific values of the Country/Region column. In this case, this process is longer and more complicated than simply using the iloc[] statement. However, there will be situations where this is the most direct approach.

# Use `isin()` to create a Boolean mask to accomplish the same task
mask = ~national_valuations['Country/Region'].isin(['United States', 'China', 'India', 'United Kingdom'])
national_valuations_no_big4 = national_valuations[mask]
national_valuations_no_big4.head()
Country/Regionvaluation_num
4Germany72
5Sweden63
6Australia56
7France55
8Canada49

Notice that we wanted to filter out countries 1-4 and keep countries 5-46. Listing 42 countries wouldn’t be practical, so we inverted the statement with ~.

Create barplot for top 20 non-big-4 countries

Now, the data is ready to reveal the top 20 non-big-4 countries with the highest total company valuations. 

# Create a barplot to compare the top 20 non-big-4 countries 
# with highest company valuations
​sns.barplot(data=national_valuations_no_big4.head(20),
           y='Country/Region',
           x='valuation_num')
plt.title('Top 20 non-big-4 countries by total company valuation')
plt.show()
Plot maps

Our investor has also asked for a global valuation map of all countries except United States, China, India, and United Kingdom (a.k.a. “big-four countries”).

As we did before, we’ll create a scatter_geo() plot that depicts the total valuations of each non-big-four country on a world map, where each valuation is shown as a circle on the map, and the size of the circle is proportional to that country’s summed valuation.

# Plot the sum of valuations per country.
data = national_valuations_no_big4
px.scatter_geo(data,
              locations='Country/Region',
              size='valuation_num',
              locationmode='country names',
              color='Country/Region',
              title='Total company valuations by country (non-big-four)')
  • Valuation sum per country is visualized by the size of circles around the map.
  • Europe has a lot of unicorn companies in a concentrated area.

Conclusion

How would we present our findings from this lab to others? 

  • For the industry specific companies in certain locations, we could provide a short list of company names and locations.
  • For the top 20 countries by sum of valuations, we could use the plot we created in this lab or share a list.
  • For the top 20 countries sorted by sum of company valuations in each country, we would exclude United States, China, India, and United Kingdom.
  • For the questions concerning the valuation map, in addition to our visuals, we would provide a short summary of the data points. This is because the investor did not request a further breakdown of this data.

Reference for this work: Bhat, M.A. Unicorn Companies


In

,