Data Validation and Modifying Data

Introduction

Here we will use input validation and label encoding to prepare a dataset for analysis. These are fundamental techniques used in all types of data analysis, from simple linear regression to complex neural networks.

We will work again on the Unicorn dataset. We are a data professional at an investment firm that is attempting to invest in private companies with a valuation of at least $1 billion. These are often known as “unicorns.”

Our client wants to develop a better understanding of unicorns, with the hope they can be early investors in future highly successful companies. They are particularly interested in the investment strategies of the three top unicorn investors: Sequoia Capital, Tiger Global Management, and Accel.

Imports and Load

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

# Run this cell so pandas displays all columns
pd.set_option('display.max_columns', None)

companies = pd.read_csv('Modified_Unicorn_Companies.csv')
companies.head()

	Company	Valuation	Date Joined	Industry	City	Country/Region	Continent	Year Founded	Funding	Select Investors
0	Bytedance	180	2017-04-07	Artificial intelligence	Beijing	China	Asia	2012	$8B	Sequoia Capital China, SIG Asia Investments, S…
1	SpaceX	100	2012-12-01	Other	Hawthorne	United States	North America	2002	$7B	Founders Fund, Draper Fisher Jurvetson, Rothen…
2	SHEIN	100	2018-07-03	E-commerce & direct-to-consumer	Shenzhen	China	Asia	2008	$2B	Tiger Global Management, Sequoia Capital China…
3	Stripe	95	2014-01-23	FinTech	San Francisco	United States	North America	2010	$2B	Khosla Ventures, LowercaseCapital, capitalG
4	Klarna	46	2011-12-12	Fintech	Stockholm	Sweden	Europe	2005	$4B	Institutional Venture Partners, Sequoia Capita…

companies.dtypes

Company             object
Valuation            int64
Date Joined         object
Industry            object
City                object
Country/Region      object
Continent           object
Year Founded         int64
Funding             object
Select Investors    object
dtype: object

Modify the data

# Apply necessary datatype conversions.
companies['Date Joined'] = pd.to_datetime(companies['Date Joined'])

# Create the column Years To Unicorn.
companies['Years To Unicorn'] = companies['Date Joined'].dt.year - companies['Year Founded']

Input validation

The data has some issues with bad data, duplicate rows, and inconsistent Industry labels. We will identify and correct each of these issues.

Correcting bad data

companies['Years To Unicorn'].describe()

count    1074.000000
mean        7.013035
std         5.331842
min        -3.000000
25%         4.000000
50%         6.000000
75%         9.000000
max        98.000000
Name: Years To Unicorn, dtype: float64

Let’s isolate all rows where the Years To Unicorn column contains a negative value to examine them.

companies[companies['Years To Unicorn'] < 0]

	Company	Valuation	Date Joined	Industry	City	Country/Region	Continent	Year Founded	Funding	Select Investors	Years To Unicorn
527	InVision	2	2017-11-01	Internet software & services	New York	United States	North America	2020	$349M	FirstMark Capital, Tiger Global Management, IC…	-3

There is a single row that has a negative value in the Years To Unicorn column. Let’s imagine that we did an internet search revealing that InVision was founded in 2011. Now we can replace the value at Year Founded with 2011 for InVision’s row.

# Replace InVision's `Year Founded` value with 2011
companies.loc[companies['Company']=='InVision', 'Year Founded'] = 2011

# Verify the change was made properly
companies[companies['Company']=='InVision']

	Company	Valuation	Date Joined	Industry	City	Country/Region	Continent	Year Founded	Funding	Select Investors	Years To Unicorn
527	InVision	2	2017-11-01	Internet software & services	New York	United States	North America	2011	$349M	FirstMark Capital, Tiger Global Management, IC…	-3

Now, we need to recalculate all the values in the Years To Unicorn column to remove the negative value for InVision. And then verify that there are no more negative values afterwards.

# Recalculate all values in the `Years To Unicorn` column
companies['Years To Unicorn'] = companies['Date Joined'].dt.year - companies['Year Founded']

# Verify that there are no more negative values in the column
companies['Years To Unicorn'].describe()

count    1074.000000
mean        7.021415
std         5.323155
min         0.000000
25%         4.000000
50%         6.000000
75%         9.000000
max        98.000000
Name: Years To Unicorn, dtype: float64

Issues with Industry labels

The company provided us with the following list of industry labels to identify in the data for Industry. We’ll assume that any labels in the Industry column that are not in industry_list are misspellings.

# List provided by the company of the expected industry labels in the data
industry_list = ['Artificial intelligence', 'Other','E-commerce & direct-to-consumer', 'Fintech',\
       'Internet software & services','Supply chain, logistics, & delivery', 'Consumer & retail',\
       'Data management & analytics', 'Edtech', 'Health', 'Hardware','Auto & transportation', \
        'Travel', 'Cybersecurity','Mobile & telecommunications']

First, let’s check if there are values in the Industry column that are not in industry_list. If so, what are they?

# Check which values are in `Industry` but not in `industry_list`
set(companies['Industry']) - set(industry_list)

{'Artificial Intelligence', 'Data management and analytics', 'FinTech'}

‘Artificial Intelligence’, ‘Data management and analytics’, and ‘FinTech’ are misspellings that are currently in the Industry column.

Now, we’ll correct the bad entries in the Industry column by replacing them with an approved string from industry_list. We will use the replace() Series method on the Industry series.

When we pass a dictionary to the method, it will replace the data in the series where that data matches the dictionary’s keys. The values that get imputed are the values of the dictionary. If a value is not specified in the dictionary, the series’ original value is retained.

Example:

 [IN]: column_a = pd.Series(['A', 'B', 'C', 'D'])
       column_a

[OUT]: 0    A
       1    B
       2    C
       3    D
       dtype: object

 [IN]: replacement_dict = {'A':'z', 'B':'y', 'C':'x'}
       column_a = column_a.replace(replacement_dict)
       column_a

[OUT]: 0    z
       1    y
       2    x
       3    D
       dtype: object

# 1. Create `replacement_dict`
replacement_dict = {'Artificial Intelligence': 'Artificial intelligence',
                   'Data management and analytics': 'Data management & analytics',
                   'FinTech': 'Fintech'
                   }

# 2. Replace the incorrect values in the `Industry` column
companies['Industry'] = companies['Industry'].replace(replacement_dict)

# 3. Verify that there are no longer any elements in `Industry` that are not in `industry_list`
set(companies['Industry']) - set(industry_list)

set()

Handling duplicate rows

The business mentioned that no company should appear in the data more than once.

# Isolate rows of all companies that have duplicates
companies[companies.duplicated(subset=['Company'], keep=False)]

	Company	Valuation	Date Joined	Industry	City	Country/Region	Continent	Year Founded	Funding	Select Investors	Years To Unicorn
385	BrewDog	2	2017-04-10	Consumer & retail	Aberdeen	United Kingdom	Europe	2007	$233M	TSG Consumer Partners, Crowdcube	10
386	BrewDog	2	2017-04-10	Consumer & retail	Aberdeen	UnitedKingdom	Europe	2007	$233M	TSG Consumer Partners	10
510	ZocDoc	2	2015-08-20	Health	New York	United States	North America	2007	$374M	Founders Fund, Khosla Ventures, Goldman Sachs	8
511	ZocDoc	2	2015-08-20	Health	NaN	United States	North America	2007	$374M	Founders Fund	8
1031	SoundHound	1	2018-05-03	Artificial intelligence	Santa Clara	United States	North America	2005	$215M	Tencent Holdings, Walden Venture Capital, Glob…	13
1032	SoundHound	1	2018-05-03	Other	Santa Clara	United States	North America	2005	$215M	Tencent Holdings	13

The duplicated companies are not legitimate because they are clearly not different companies with the same name. They are the same company represented twice with minor variation.

Let’s keep the first occurrence of each duplicate company and drop the subsequent rows that are copies.

# Drop rows of duplicate companies after their first occurrence
companies = companies.drop_duplicates(subset=['Company'], keep='first')

Convert numerical data to categorical data

Sometimes, we’ll want to simplify a numeric column by converting it to a categorical column. To do this, one common approach is to break the range of possible values into a defined number of equally sized bins and assign each bin a name.

Create a High Valuation column

The data in the Valuation column represents how much money (in billions, USD) each company is valued at. Use the Valuation column to create a new column called High Valuation. For each company, the value in this column should be low if the company is in the bottom 50% of company valuations and high if the company is in the top 50%.

# Create new `High Valuation` column
# Use qcut to divide Valuation into 'high' and 'low' Valuation groups
companies['High Valuation'] = pd.qcut(companies['Valuation'], 2, labels = ['low', 'high'])

Convert categorical data to numerical data

Three common methods for changing categorical data to numerical are:

Label encoding: order matters (ordinal numeric labels)
Label encoding: order doesn’t matter (nominal numeric labels)
Dummy encoding: order doesn’t matter (creation of binary columns for each possible category contained in the variable)

The decision on which method to use depends on the context and must be made on a case-to-case basis. However, a distinction is typically made between categorical variables with equal weight given to all possible categories vs. variables with a hierarchical structure of importance to their possible categories.

For example, a variable called subject might have possible values of history, mathematics, literature. In this case, each subject might be nominal—given the same level of importance. However, we might have another variable called class, whose possible values are freshman, sophomore, junior, senior. In this case, the class variable is ordinal—its values have an ordered, hierarchical structure of importance.

Machine learning models typically need all data to be numeric, and they generally use ordinal label encoding (method 1) and dummy encoding (method 3).

I discussed a little more about these issues in the post titled ‘Transforming Data: Categorical Data and Label Encoding’.

Convert Continent to numeric

We will suppose that the investment group has specified that they want to give more weight to continents with fewer unicorn companies because they believe this could indicate unrealized market potential.

This would make Continent an ordinal variable, since more importance is placed on continents with fewer unicorn companies. There is a hierarchy of importance.

# Rank the continents by number of unicorn companies
companies['Continent'].value_counts()

North America    586
Asia             310
Europe           143
South America     21
Oceania            8
Africa             3
Name: Continent, dtype: int64

# Create numeric `Continent Number` column
continent_dict = {'North America': 1,
                  'Asia': 2,
                  'Europe': 3,
                  'South America': 4,
                  'Oceania': 5,
                  'Africa': 6
                 }
companies['Continent Number'] = companies['Continent'].replace(continent_dict)
companies.head()

Company

Valuation

Date Joined

Industry

City

Country/Region

Continent

Year Founded

Funding

Select Investors

Years To Unicorn

High Valuation

Continent Number

Bytedance

180

2017-04-07

Artificial intelligence

Beijing

China

Asia

2012

$8B

Sequoia Capital China, SIG Asia Investments, S…

high

SpaceX

100

2012-12-01

Other

Hawthorne

United States

North America

2002

$7B

Founders Fund, Draper Fisher Jurvetson, Rothen…

high

SHEIN

100

2018-07-03

E-commerce & direct-to-consumer

Shenzhen

China

Asia

2008

$2B

Tiger Global Management, Sequoia Capital China…

high

Stripe

2014-01-23

Fintech

San Francisco

United States

North America

2010

$2B

Khosla Ventures, LowercaseCapital, capitalG

high

Klarna

2011-12-12

Fintech

Stockholm

Sweden

Europe

2005

$4B

Institutional Venture Partners, Sequoia Capita…

high

Convert Country/Region to numeric

Now, suppose that within a given continent, each company’s Country/Region is given equal importance. For analytical purposes, we want to convert the values in this column to numeric without creating a large number of dummy columns.

# Create numeric categories for Country/Region
companies['Country/Region Numeric'] = companies['Country/Region'].astype('category').cat.code

Convert Industry to numeric

Finally, let’s create dummy variables for the values in the Industry column.

# Create dummy variables with Industry values
industry_encoded = pd.get_dummies(companies['Industry'])

# Combine `companies` DataFrame with new dummy Industry columns
companies = pd.concat([companies, industry_encoded], axis=1)

companies.head()

	Company	Valuation	Date Joined	Industry	City	Country/Region	Continent	Year Founded	Funding	Select Investors	Years To Unicorn	High Valuation	Continent Number	Country/Region Numeric	Artificial intelligence	Auto & transportation	Consumer & retail	Cybersecurity	Data management & analytics	E-commerce & direct-to-consumer	Edtech	Fintech	Hardware	Health	Internet software & services	Mobile & telecommunications	Other	Supply chain, logistics, & delivery	Travel
0	Bytedance	180	2017-04-07	Artificial intelligence	Beijing	China	Asia	2012	$8B	Sequoia Capital China, SIG Asia Investments, S…	5	high	2	9	1	0	0	0	0	0	0	0	0	0	0	0	0	0	0
1	SpaceX	100	2012-12-01	Other	Hawthorne	United States	North America	2002	$7B	Founders Fund, Draper Fisher Jurvetson, Rothen…	10	high	1	44	0	0	0	0	0	0	0	0	0	0	0	0	1	0	0
2	SHEIN	100	2018-07-03	E-commerce & direct-to-consumer	Shenzhen	China	Asia	2008	$2B	Tiger Global Management, Sequoia Capital China…	10	high	2	9	0	0	0	0	0	1	0	0	0	0	0	0	0	0	0
3	Stripe	95	2014-01-23	Fintech	San Francisco	United States	North America	2010	$2B	Khosla Ventures, LowercaseCapital, capitalG	4	high	1	44	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0
4	Klarna	46	2011-12-12	Fintech	Stockholm	Sweden	Europe	2005	$4B	Institutional Venture Partners, Sequoia Capita…	6	high	3	38	0	0	0	0	0	0	0	1	0	0	0	0	0	0	0

* The table became too wide, slide right to see all the columns.

Label encoding is useful in machine learning models, because many types of machine learning require all variables to be of a numeric data type. However, label encoding may make it more difficult to directly interpret what a column value represents. Further, it may introduce unintended relationships between the categorical data in a dataset.

Conclusion

Input validation is essential for ensuring data is high quality and error-free.
In practice, input validation requires trial and error to identify issues and determine the best way to fix them.
There are benefits and disadvantages to both label encoding and dummy/one-hot encoding.
The decision to use label encoding versus dummy/one-hot encoding needs to be made on a case-by-case basis.

Reference for this work: Bhat, M.A. Unicorn Companies