Basic EDA: Sample 2 

Sample 2: Unicorn Companies

Following my notes on Python’s basic tools for data works, here I’ll do some basic EDA projects. Disclaimer: The one below is based on Google’s Advanced Data Analysis Program. My only intention, by repeating their structure, is to practice what I’ve learned and keep these notes as future reference. Their content can be reached via Coursera. A free version is also available without claiming the certificate.

The reference for the below work: Bhat, M.A. (2022, March). Unicorn Companies.

0. Introduction

We’ll imagine that we are a member of an analytics team that provides insights to an investing firm. To help them decide which companies to invest in next, the firm wants insights into unicorn companies, companies that are valued at over one billion dollars. 

The data we will use for this task provides information on over 1,000 unicorn companies, including their industry, country, year founded, and select investors. We will use this information to gain insights into how and when companies reach this prestigious milestone and to make recommendations for next steps to the investing firm.

1. Imports

# Import libraries and packages
import pandas as pd
import matplotlib.pyplot as plt
import datetime as dt

# Load data from the csv file into a DataFrame and save in a variable
companies = pd.read_csv("Unicorn_Companies.csv")

2. Data Exploration

# Display the first 10 rows of the data
companies.head(10)
CompanyValuationDate JoinedIndustryCityCountry/RegionContinentYear FoundedFundingSelect Investors
0Bytedance$180B4/7/17Artificial intelligenceBeijingChinaAsia2012$8BSequoia Capital China, SIG Asia Investments, S…
1SpaceX$100B12/1/12OtherHawthorneUnited StatesNorth America2002$7BFounders Fund, Draper Fisher Jurvetson, Rothen…
2SHEIN$100B7/3/18E-commerce & direct-to-consumerShenzhenChinaAsia2008$2BTiger Global Management, Sequoia Capital China…
3Stripe$95B1/23/14FintechSan FranciscoUnited StatesNorth America2010$2BKhosla Ventures, LowercaseCapital, capitalG
4Klarna$46B12/12/11FintechStockholmSwedenEurope2005$4BInstitutional Venture Partners, Sequoia Capita…
5Canva$40B1/8/18Internet software & servicesSurry HillsAustraliaOceania2012$572MSequoia Capital China, Blackbird Ventures, Mat…
6Checkout.com$40B5/2/19FintechLondonUnited KingdomEurope2012$2BTiger Global Management, Insight Partners, DST…
7Instacart$39B12/30/14Supply chain, logistics, & deliverySan FranciscoUnited StatesNorth America2012$3BKhosla Ventures, Kleiner Perkins Caufield & By…
8JUUL Labs$38B12/20/17Consumer & retailSan FranciscoUnited StatesNorth America2015$14BTiger Global Management
9Databricks$38B2/5/19Data management & analyticsSan FranciscoUnited StatesNorth America2013$3BAndreessen Horowitz, New Enterprise Associates…
# How large the dataset is
companies.size
10740
# Shape of the dataset
companies.shape
(1074, 10)
# Get basic information
companies.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1074 entries, 0 to 1073
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Company 1074 non-null object
1 Valuation 1074 non-null object
2 Date Joined 1074 non-null object
3 Industry 1074 non-null object
4 City 1058 non-null object
5 Country/Region 1074 non-null object
6 Continent 1074 non-null object
7 Year Founded 1074 non-null int64
8 Funding 1074 non-null object
9 Select Investors 1073 non-null object
dtypes: int64(1), object(9)
memory usage: 84.0+ KB

3. Statistical Tests

# Get descriptive statistics
companies.describe()
Year Founded
count1074
mean2012.895717
std5.698573
min1919
25%2011
50%2014
75%2016
max2021
# Use pd.to_datetime() to convert Date Joined column to datetime 
companies["Date Joined"] = pd.to_datetime(companies["Date Joined"])

# Use .dt.year to extract year component from Date Joined column
companies["Year Joined"] = companies["Date Joined"].dt.year

# Confirm the recent changes
companies.head()
CompanyValuationDate JoinedIndustryCityCountry/RegionContinentYear FoundedFundingSelect InvestorsYear Joined
0Bytedance$180B2017-04-07Artificial intelligenceBeijingChinaAsia2012$8BSequoia Capital China, SIG Asia Investments, S…2017
1SpaceX$100B2012-12-01OtherHawthorneUnited StatesNorth America2002$7BFounders Fund, Draper Fisher Jurvetson, Rothen…2012
2SHEIN$100B2018-07-03E-commerce & direct-to-consumerShenzhenChinaAsia2008$2BTiger Global Management, Sequoia Capital China…2018
3Stripe$95B2014-01-23FintechSan FranciscoUnited StatesNorth America2010$2BKhosla Ventures, LowercaseCapital, capitalG2014
4Klarna$46B2011-12-12FintechStockholmSwedenEurope2005$4BInstitutional Venture Partners, Sequoia Capita…2011

4. Results and Evaluation

Take a sample of the data 

It is not necessary to take a sample of the data in order to conduct the visualizations and EDA that follow. But we may encounter scenarios (in the future) where we will need to take a sample of the data due to time and resource limitations. 

We’ll use sample() with the n parameter set to 50 to randomly sample 50 unicorn companies from the data. And we specify the random_state parameter to ensure reproducibility of our work.

# Sample the data
companies_sample = companies.sample(n = 50, random_state = 42)
Visualize (the time it took companies to reach unicorn status)

We’ll visualize the longest time it took companies to reach unicorn status for each industry represented in the sample. But first we’ll need to prepare the data.

# Create new `years_till_unicorn` column 
companies_sample["years_till_unicorn"] = companies_sample["Year Joined"] - companies_sample["Year Founded"]

# Group the data by `Industry`. For each industry, 
# get the max value in the `years_till_unicorn` column.
grouped = (companies_sample[["Industry", "years_till_unicorn"]]
           .groupby("Industry")
           .max()
           .sort_values(by="years_till_unicorn")
          )
grouped
years_till_unicorn
Industry
Consumer & retail1
Auto & transportation2
Artificial intelligence5
Data management & analytics8
Mobile & telecommunications9
Supply chain, logistics, & delivery12
Internet software & services13
Other15
E-commerce & direct-to-consumer18
Cybersecurity19
Fintech21
Health21

Now we can create a bar plot.

# Create bar plot with Industry column as the categories of the bars
# and the difference in years between Year Joined column 
# and Year Founded column as the heights of the bars
plt.bar(grouped.index, grouped["years_till_unicorn"])

# Set title
plt.title("Bar plot of maximum years taken by company to become unicorn per industry (from sample)")

# Set x-axis label
plt.xlabel("Industry")

# Set y-axis label
plt.ylabel("Maximum number of years")

# Rotate labels on the x-axis as a way to avoid overlap in the positions of the text  
plt.xticks(rotation=45, horizontalalignment='right')

# Display the plot
plt.show()

This bar plot shows that for this sample of unicorn companies, the largest value for maximum time taken to become a unicorn occurred in the Heath and Fintech industries, while the smallest value occurred in the Consumer & Retail industry.

My side note: These max digits may not be the representatives of their industries. They are rather some individual companies.

Visualize (the maximum unicorn company valuation per industry)

Visualize unicorn companies’ maximum valuation for each industry represented in the sample. But before plotting, we need to create a new column that represents the companies’ valuations as numbers (instead of strings, as they’re currently represented). Then, we’ll use this new column to plot our data.

# Create a column representing company valuation as numeric data
companies_sample['valuation_billions'] = companies_sample['Valuation']

# Remove the '$' from each value
companies_sample['valuation_billions'] = companies_sample['valuation_billions'].str.replace('$', '')

# Remove the 'B' from each value
companies_sample['valuation_billions'] = companies_sample['valuation_billions'].str.replace('B', '')

# Convert column to type int
companies_sample['valuation_billions'] = companies_sample['valuation_billions'].astype('int')
companies_sample.head()
CompanyValuationDate JoinedIndustryCityCountry/RegionContinentYear FoundedFundingSelect InvestorsYear Joinedyears_till_unicornvaluation_billions
542Aiven$2B2021-10-18Internet software & servicesHelsinkiFinlandEurope2016$210MInstitutional Venture Partners, Atomico, Early…202152
370Jusfoun Big Data$2B2018-07-09Data management & analyticsBeijingChinaAsia2010$137MBoxin Capital, DT Capital Partners, IDG Capital201882
307Innovaccer$3B2021-02-19HealthSan FranciscoUnited StatesNorth America2014$379MM12, WestBridge Capital, Lightspeed Venture Pa…202173
493Algolia$2B2021-07-28Internet software & servicesSan FranciscoUnited StatesNorth America2012$334MAccel, Alven Capital, Storm Ventures202192
350SouChe Holdings$3B2017-11-01E-commerce & direct-to-consumerHangzhouChinaAsia2012$1BMorningside Ventures, Warburg Pincus, CreditEa…201753

My side note: Before doing the above modifications, we had to be sure that the Valuation column has all the same data format as in ‘$3B’.

Prepare data for modeling
grouped = (companies_sample[["Industry", "valuation_billions"]]
           .groupby("Industry")
           .max()
           .sort_values(by="valuation_billions")
          )
grouped
valuation_billions
Industry
Auto & transportation1
Consumer & retail1
Other2
Supply chain, logistics, & delivery2
Cybersecurity3
Health3
Data management & analytics4
E-commerce & direct-to-consumer4
Internet software & services5
Mobile & telecommunications7
Fintech10
Artificial intelligence12
# Create bar plot with Industry column as the categories of the bars
# and new valuation column as the heights of the bars
plt.bar(grouped.index, grouped["valuation_billions"])

# Set title
plt.title("Bar plot of maximum unicorn company valuation per industry (from sample)")

# Set x-axis label
plt.xlabel("Industry")

# Set y-axis label
plt.ylabel("Maximum valuation in billions of dollars")

# Rotate labels on the x-axis as a way to avoid overlap in the positions of the text  
plt.xticks(rotation=45, horizontalalignment='right')

# Display the plot
plt.show()

This bar plot shows that for this sample of unicorn companies, the highest maximum valuation occurred in the Artificial Intelligence industry, while the lowest maximum valuation occurred in the Auto & transportation, and Consumer & retail industries.

My side note: Again, these are just representatives of individual companies. We can’t reflect these digits to their entire industry.


What could be the next steps to consider?

  • Identify the main industries that the investing firm is interested in investing in.
  • Select a subset of this data that includes only companies in those industries.
  • Analyze that subset more closely. Determine which companies have higher valuation but do not have as many investors currently. They may be good candidates to consider investing in.

In

,