Data Validation and Modifying Data

Introduction 

Here we will use input validation and label encoding to prepare a dataset for analysis. These are fundamental techniques used in all types of data analysis, from simple linear regression to complex neural networks. 

We will work again on the Unicorn dataset. We are a data professional at an investment firm that is attempting to invest in private companies with a valuation of at least $1 billion. These are often known as “unicorns.” 

Our client wants to develop a better understanding of unicorns, with the hope they can be early investors in future highly successful companies. They are particularly interested in the investment strategies of the three top unicorn investors: Sequoia Capital, Tiger Global Management, and Accel.

Imports and Load

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Run this cell so pandas displays all columns
pd.set_option('display.max_columns', None)

companies = pd.read_csv('Modified_Unicorn_Companies.csv')
companies.head()
CompanyValuationDate JoinedIndustryCityCountry/RegionContinentYear FoundedFundingSelect Investors 
0Bytedance1802017-04-07Artificial intelligenceBeijingChinaAsia2012$8BSequoia Capital China, SIG Asia Investments, S…
1SpaceX1002012-12-01OtherHawthorneUnited StatesNorth America2002$7BFounders Fund, Draper Fisher Jurvetson, Rothen…
2SHEIN1002018-07-03E-commerce & direct-to-consumerShenzhenChinaAsia2008$2BTiger Global Management, Sequoia Capital China…
3Stripe952014-01-23FinTechSan FranciscoUnited StatesNorth America2010$2BKhosla Ventures, LowercaseCapital, capitalG
4Klarna462011-12-12FintechStockholmSwedenEurope2005$4BInstitutional Venture Partners, Sequoia Capita…
companies.dtypes
Company             object
Valuation int64
Date Joined object
Industry object
City object
Country/Region object
Continent object
Year Founded int64
Funding object
Select Investors object
dtype: object

Modify the data

# Apply necessary datatype conversions.
companies['Date Joined'] = pd.to_datetime(companies['Date Joined'])

# Create the column Years To Unicorn.
companies['Years To Unicorn'] = companies['Date Joined'].dt.year - companies['Year Founded']

Input validation 

The data has some issues with bad data, duplicate rows, and inconsistent Industry labels. We will identify and correct each of these issues.

Correcting bad data
companies['Years To Unicorn'].describe()
count    1074.000000
mean 7.013035
std 5.331842
min -3.000000
25% 4.000000
50% 6.000000
75% 9.000000
max 98.000000
Name: Years To Unicorn, dtype: float64

Let’s isolate all rows where the Years To Unicorn column contains a negative value to examine them.

companies[companies['Years To Unicorn'] < 0]
CompanyValuationDate JoinedIndustryCityCountry/RegionContinentYear FoundedFundingSelect InvestorsYears To Unicorn
527InVision22017-11-01Internet software & servicesNew YorkUnited StatesNorth America2020$349MFirstMark Capital, Tiger Global Management, IC…-3

There is a single row that has a negative value in the Years To Unicorn column. Let’s imagine that we did an internet search revealing that InVision was founded in 2011. Now we can replace the value at Year Founded with 2011 for InVision’s row.

# Replace InVision's `Year Founded` value with 2011
companies.loc[companies['Company']=='InVision', 'Year Founded'] = 2011

# Verify the change was made properly
companies[companies['Company']=='InVision']
CompanyValuationDate JoinedIndustryCityCountry/RegionContinentYear FoundedFundingSelect InvestorsYears To Unicorn
527InVision22017-11-01Internet software & servicesNew YorkUnited StatesNorth America2011$349MFirstMark Capital, Tiger Global Management, IC…-3

Now, we need to recalculate all the values in the Years To Unicorn column to remove the negative value for InVision. And then verify that there are no more negative values afterwards.

# Recalculate all values in the `Years To Unicorn` column
companies['Years To Unicorn'] = companies['Date Joined'].dt.year - companies['Year Founded']

# Verify that there are no more negative values in the column
companies['Years To Unicorn'].describe()
count    1074.000000
mean 7.021415
std 5.323155
min 0.000000
25% 4.000000
50% 6.000000
75% 9.000000
max 98.000000
Name: Years To Unicorn, dtype: float64
Issues with Industry labels

The company provided us with the following list of industry labels to identify in the data for Industry. We’ll assume that any labels in the Industry column that are not in industry_list are misspellings.

# List provided by the company of the expected industry labels in the data
industry_list = ['Artificial intelligence', 'Other','E-commerce & direct-to-consumer', 'Fintech',\
       'Internet software & services','Supply chain, logistics, & delivery', 'Consumer & retail',\
       'Data management & analytics', 'Edtech', 'Health', 'Hardware','Auto & transportation', \
        'Travel', 'Cybersecurity','Mobile & telecommunications']

First, let’s check if there are values in the Industry column that are not in industry_list. If so, what are they?

# Check which values are in `Industry` but not in `industry_list`
set(companies['Industry']) - set(industry_list)
{'Artificial Intelligence', 'Data management and analytics', 'FinTech'}

‘Artificial Intelligence’, ‘Data management and analytics’, and ‘FinTech’ are misspellings that are currently in the Industry column.

Now, we’ll correct the bad entries in the Industry column by replacing them with an approved string from industry_list. We will use the replace() Series method on the Industry series. 

When we pass a dictionary to the method, it will replace the data in the series where that data matches the dictionary’s keys. The values that get imputed are the values of the dictionary. If a value is not specified in the dictionary, the series’ original value is retained. 

Example:

 [IN]: column_a = pd.Series(['A', 'B', 'C', 'D'])
column_a

[OUT]: 0 A
1 B
2 C
3 D
dtype: object

[IN]: replacement_dict = {'A':'z', 'B':'y', 'C':'x'}
column_a = column_a.replace(replacement_dict)
column_a

[OUT]: 0 z
1 y
2 x
3 D
dtype: object
# 1. Create `replacement_dict`
replacement_dict = {'Artificial Intelligence': 'Artificial intelligence',
                   'Data management and analytics': 'Data management & analytics',
                   'FinTech': 'Fintech'
                   }

# 2. Replace the incorrect values in the `Industry` column
companies['Industry'] = companies['Industry'].replace(replacement_dict)

# 3. Verify that there are no longer any elements in `Industry` that are not in `industry_list`
set(companies['Industry']) - set(industry_list)
set()
Handling duplicate rows

The business mentioned that no company should appear in the data more than once.

# Isolate rows of all companies that have duplicates
companies[companies.duplicated(subset=['Company'], keep=False)]
CompanyValuationDate JoinedIndustryCityCountry/RegionContinentYear FoundedFundingSelect InvestorsYears To Unicorn
385BrewDog22017-04-10Consumer & retailAberdeenUnited KingdomEurope2007$233MTSG Consumer Partners, Crowdcube10
386BrewDog22017-04-10Consumer & retailAberdeenUnitedKingdomEurope2007$233MTSG Consumer Partners10
510ZocDoc22015-08-20HealthNew YorkUnited StatesNorth America2007$374MFounders Fund, Khosla Ventures, Goldman Sachs8
511ZocDoc22015-08-20HealthNaNUnited StatesNorth America2007$374MFounders Fund8
1031SoundHound12018-05-03Artificial intelligenceSanta ClaraUnited StatesNorth America2005$215MTencent Holdings, Walden Venture Capital, Glob…13
1032SoundHound12018-05-03OtherSanta ClaraUnited StatesNorth America2005$215MTencent Holdings13

The duplicated companies are not legitimate because they are clearly not different companies with the same name. They are the same company represented twice with minor variation.

Let’s keep the first occurrence of each duplicate company and drop the subsequent rows that are copies.

# Drop rows of duplicate companies after their first occurrence
companies = companies.drop_duplicates(subset=['Company'], keep='first')

Convert numerical data to categorical data

Sometimes, we’ll want to simplify a numeric column by converting it to a categorical column. To do this, one common approach is to break the range of possible values into a defined number of equally sized bins and assign each bin a name. 

Create a High Valuation column

The data in the Valuation column represents how much money (in billions, USD) each company is valued at. Use the Valuation column to create a new column called High Valuation. For each company, the value in this column should be low if the company is in the bottom 50% of company valuations and high if the company is in the top 50%.

# Create new `High Valuation` column
# Use qcut to divide Valuation into 'high' and 'low' Valuation groups
companies['High Valuation'] = pd.qcut(companies['Valuation'], 2, labels = ['low', 'high'])

Convert categorical data to numerical data

Three common methods for changing categorical data to numerical are:

  1. Label encoding: order matters (ordinal numeric labels)
  2. Label encoding: order doesn’t matter (nominal numeric labels)
  3. Dummy encoding: order doesn’t matter (creation of binary columns for each possible category contained in the variable)

The decision on which method to use depends on the context and must be made on a case-to-case basis. However, a distinction is typically made between categorical variables with equal weight given to all possible categories vs. variables with a hierarchical structure of importance to their possible categories.

For example, a variable called subject might have possible values of history, mathematics, literature. In this case, each subject might be nominal—given the same level of importance. However, we might have another variable called class, whose possible values are freshman, sophomore, junior, senior. In this case, the class variable is ordinal—its values have an ordered, hierarchical structure of importance.

Machine learning models typically need all data to be numeric, and they generally use ordinal label encoding (method 1) and dummy encoding (method 3).

I discussed a little more about these issues in the post titled ‘Transforming Data: Categorical Data and Label Encoding’.

Convert Continent to numeric

We will suppose that the investment group has specified that they want to give more weight to continents with fewer unicorn companies because they believe this could indicate unrealized market potential.

This would make Continent an ordinal variable, since more importance is placed on continents with fewer unicorn companies. There is a hierarchy of importance.

# Rank the continents by number of unicorn companies
companies['Continent'].value_counts()
North America    586
Asia 310
Europe 143
South America 21
Oceania 8
Africa 3
Name: Continent, dtype: int64
# Create numeric `Continent Number` column
continent_dict = {'North America': 1,
                  'Asia': 2,
                  'Europe': 3,
                  'South America': 4,
                  'Oceania': 5,
                  'Africa': 6
                 }
companies['Continent Number'] = companies['Continent'].replace(continent_dict)
companies.head()
CompanyValuationDate JoinedIndustryCityCountry/RegionContinentYear FoundedFundingSelect InvestorsYears To UnicornHigh ValuationContinent Number
0Bytedance1802017-04-07Artificial intelligenceBeijingChinaAsia2012$8BSequoia Capital China, SIG Asia Investments, S…5high2
1SpaceX1002012-12-01OtherHawthorneUnited StatesNorth America2002$7BFounders Fund, Draper Fisher Jurvetson, Rothen…10high1
2SHEIN1002018-07-03E-commerce & direct-to-consumerShenzhenChinaAsia2008$2BTiger Global Management, Sequoia Capital China…10high2
3Stripe952014-01-23FintechSan FranciscoUnited StatesNorth America2010$2BKhosla Ventures, LowercaseCapital, capitalG4high1
4Klarna462011-12-12FintechStockholmSwedenEurope2005$4BInstitutional Venture Partners, Sequoia Capita…6high3
Convert Country/Region to numeric

Now, suppose that within a given continent, each company’s Country/Region is given equal importance. For analytical purposes, we want to convert the values in this column to numeric without creating a large number of dummy columns. 

# Create numeric categories for Country/Region
companies['Country/Region Numeric'] = companies['Country/Region'].astype('category').cat.code
Convert Industry to numeric

Finally, let’s create dummy variables for the values in the Industry column.

# Create dummy variables with Industry values
industry_encoded = pd.get_dummies(companies['Industry'])

# Combine `companies` DataFrame with new dummy Industry columns
companies = pd.concat([companies, industry_encoded], axis=1)

companies.head()
CompanyValuationDate JoinedIndustryCityCountry/RegionContinentYear FoundedFundingSelect InvestorsYears To UnicornHigh ValuationContinent NumberCountry/Region NumericArtificial intelligenceAuto & transportationConsumer & retailCybersecurityData management & analyticsE-commerce & direct-to-consumerEdtechFintechHardwareHealthInternet software & servicesMobile & telecommunicationsOtherSupply chain, logistics, & deliveryTravel
0Bytedance1802017-04-07Artificial intelligenceBeijingChinaAsia2012$8BSequoia Capital China, SIG Asia Investments, S…5high29100000000000000
1SpaceX1002012-12-01OtherHawthorneUnited StatesNorth America2002$7BFounders Fund, Draper Fisher Jurvetson, Rothen…10high144000000000000100
2SHEIN1002018-07-03E-commerce & direct-to-consumerShenzhenChinaAsia2008$2BTiger Global Management, Sequoia Capital China…10high29000001000000000
3Stripe952014-01-23FintechSan FranciscoUnited StatesNorth America2010$2BKhosla Ventures, LowercaseCapital, capitalG4high144000000010000000
4Klarna462011-12-12FintechStockholmSwedenEurope2005$4BInstitutional Venture Partners, Sequoia Capita…6high338000000010000000
* The table became too wide, slide right to see all the columns.

Label encoding is useful in machine learning models, because many types of machine learning require all variables to be of a numeric data type. However, label encoding may make it more difficult to directly interpret what a column value represents. Further, it may introduce unintended relationships between the categorical data in a dataset.

Conclusion

  • Input validation is essential for ensuring data is high quality and error-free.
  • In practice, input validation requires trial and error to identify issues and determine the best way to fix them.
  • There are benefits and disadvantages to both label encoding and dummy/one-hot encoding.
  • The decision to use label encoding versus dummy/one-hot encoding needs to be made on a case-by-case basis.

Reference for this work: Bhat, M.A. Unicorn Companies


In

,