Basic EDA: Sample 5

Sample 5: Unicorn Companies

Following my notes on Python’s basic tools for data works, here I’ll do some basic EDA projects. Disclaimer: The one below is based on Google’s Advanced Data Analysis Program. My only intention, by repeating their structure, is to practice what I’ve learned and keep these notes as future reference. Their content can be reached via Coursera. A free version is also available without claiming the certificate.

The reference for the below work: Bhat, M.A. (2022, March). Unicorn Companies.

0. Introduction

As we did before with sample 2, we’ll imagine that we are a member of an analytics team that provides insights to an investing firm. To help them decide which companies to invest in next, the firm wants insights into unicorn companies, companies that are valued at over one billion dollars. 

We will work with a dataset about unicorn companies, discovering characteristics of the data, structuring the data in ways that will help us draw meaningful insights, and using visualizations to analyze the data. Ultimately, we will draw conclusions about what significant trends or patterns we find in the dataset. 

1. Imports

# Import libraries and packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

# Load the dataset provided into a DataFrame.
companies = pd.read_csv("Unicorn_Companies.csv")

2. Data exploration

companies.head()
CompanyValuationDate JoinedIndustryCityCountry/RegionContinentYear FoundedFundingSelect Investors
0Bytedance$180B4/7/17Artificial intelligenceBeijingChinaAsia2012$8BSequoia Capital China, SIG Asia Investments, S…
1SpaceX$100B12/1/12OtherHawthorneUnited StatesNorth America2002$7BFounders Fund, Draper Fisher Jurvetson, Rothen…
2SHEIN$100B7/3/18E-commerce & direct-to-consumerShenzhenChinaAsia2008$2BTiger Global Management, Sequoia Capital China…
3Stripe$95B1/23/14FintechSan FranciscoUnited StatesNorth America2010$2BKhosla Ventures, LowercaseCapital, capitalG
4Klarna$46B12/12/11FintechStockholmSwedenEurope2005$4BInstitutional Venture Partners, Sequoia Capita…
companies.shape
(1074, 10)
# Check for duplicates.
companies.drop_duplicates().shape
(1074, 10)
companies.dtypes
Company             object
Company object
Valuation object
Date Joined object
Industry object
City object
Country/Region object
Continent object
Year Founded int64
Funding object
Select Investors object
dtype: object
Sort the data
# To arrange the data from latest to earliest 'Year Founded'
companies.sort_values(by="Year Founded", ascending=False).head()
CompanyValuationDate JoinedIndustryCityCountry/RegionContinentYear FoundedFundingSelect Investors
782Phantom$1B1/31/22FintechSan FranciscoUnited StatesNorth America2021$118MParadigm, Andreessen Horowitz, Jump Capital
714Yidian Zixun$1B10/17/17Mobile & telecommunicationsBeijingChinaAsia2021$151MPhoenix New Media, Tianjin Haihe Industry Fund
822GlobalBees$1B12/28/21E-commerce & direct-to-consumerNew DelhiIndiaAsia2021$185MChiratae Ventures, SoftBank Group, Trifecta Ca…
554ClickHouse$2B10/28/21Data management & analyticsPortola ValleyUnited StatesNorth America2021$300MLightspeed Venture Partners, Almaz Capital Par…
952LayerZero Labs$1B3/30/22Internet software & servicesNew YorkUnited StatesNorth America2021$143MAndreessen Horowitz, FTX Ventures, Tiger Globa…
Determine the number of companies founded each year
# Display each unique year that occurs in the dataset
# along with the number of companies that were founded in each unique year.
companies["Year Founded"].value_counts().sort_values(ascending=False)
2015    155
2016 110
2014 109
2012 95
2013 87
2011 82
2017 74
2018 61
2019 45
2010 40
2009 34
2008 27
2020 25
2007 24
2006 15
2005 14
2000 11
2021 11
2001 9
1999 8
2004 8
2003 8
1998 5
2002 4
1994 2
1995 2
1992 1
1993 1
1990 1
1984 1
1996 1
1979 1
1991 1
1919 1
1997 1
Name: Year Founded, dtype: int64

Let’s plot these results on a histogram.

sns.histplot(data=companies, x='Year Founded')
plt.title('Year Founded histogram')

Now let’s examine when the companies joined unicorn status. Our Date Joined column currently has the data type object, so we need to convert it first.

# Convert 'Date Joined' column to datetime.
#companies["Date Joined"] = pd.to_datetime(companies["Date Joined"])

# Display the data types of the columns in `companies`
# to confirm that the update actually took place
companies.dtypes
Company                     object
Valuation object
Date Joined datetime64[ns]
Industry object
City object
Country/Region object
Continent object
Year Founded int64
Funding object
Select Investors object
dtype: object
Create a Month Joined column
# Obtain the names of the months when companies gained unicorn status.
companies["Month Joined"] = companies["Date Joined"].dt.month_name()

# Display the first few rows of `companies`
# to confirm that the new column did get added.
companies.head()
CompanyValuationDate JoinedIndustryCityCountry/RegionContinentYear FoundedFundingSelect InvestorsMonth Joined
0Bytedance$180B2017-04-07Artificial intelligenceBeijingChinaAsia2012$8BSequoia Capital China, SIG Asia Investments, S…April
1SpaceX$100B2012-12-01OtherHawthorneUnited StatesNorth America2002$7BFounders Fund, Draper Fisher Jurvetson, Rothen…December
2SHEIN$100B2018-07-03E-commerce & direct-to-consumerShenzhenChinaAsia2008$2BTiger Global Management, Sequoia Capital China…July
3Stripe$95B2014-01-23FintechSan FranciscoUnited StatesNorth America2010$2BKhosla Ventures, LowercaseCapital, capitalGJanuary
4Klarna$46B2011-12-12FintechStockholmSwedenEurope2005$4BInstitutional Venture Partners, Sequoia Capita…December
Create a Years To Join column 

Let’s determine how many years it took for companies to reach unicorn status.

companies["Years To Join"] = companies["Date Joined"].dt.year - companies["Year Founded"]
companies.head()
CompanyValuationDate JoinedIndustryCityCountry/RegionContinentYear FoundedFundingSelect InvestorsMonth JoinedYears To Join
0Bytedance$180B2017-04-07Artificial intelligenceBeijingChinaAsia2012$8BSequoia Capital China, SIG Asia Investments, S…April5
1SpaceX$100B2012-12-01OtherHawthorneUnited StatesNorth America2002$7BFounders Fund, Draper Fisher Jurvetson, Rothen…December10
2SHEIN$100B2018-07-03E-commerce & direct-to-consumerShenzhenChinaAsia2008$2BTiger Global Management, Sequoia Capital China…July10
3Stripe$95B2014-01-23FintechSan FranciscoUnited StatesNorth America2010$2BKhosla Ventures, LowercaseCapital, capitalGJanuary4
4Klarna$46B2011-12-12FintechStockholmSwedenEurope2005$4BInstitutional Venture Partners, Sequoia Capita…December6
Gain more insight on a specific year

We’ll get more insight on the year 2021, as that is the most recent year for which there is data available. Trends that took place in 2021 might be similar to trends that will take place in 2022.

companies_2021 = companies[companies["Date Joined"].dt.year == 2021]
companies_2021.head()
CompanyValuationDate JoinedIndustryCityCountry/RegionContinentYear FoundedFundingSelect InvestorsMonth JoinedYears To Join
12FTX$32B2021-07-20FintechNaNBahamasNorth America2018$2BSequoia Capital, Thoma Bravo, SoftbankJuly3
16J&T Express$20B2021-04-07Supply chain, logistics, & deliveryJakartaIndonesiaAsia2015$5BHillhouse Capital Management, Boyu Capital, Se…April6
24Blockchain.com$14B2021-02-17FintechLondonUnited KingdomEurope2011$490MLightspeed Venture Partners, Google Ventures, …February10
27OpenSea$13B2021-07-20E-commerce & direct-to-consumerNew YorkUnited StatesNorth America2017$427MAndreessen Horowitz, Thirty Five Ventures, Sou…July4
34Getir$12B2021-03-26E-commerce & direct-to-consumerIstanbulTurkeyEurope2015$2BTiger Global Management, Sequoia Capital, Revo…March6
Observe trends over time

We’ll count the number of companies that joined per week. This allows the observation of trends over the weeks of 2021. 

# Insert a 'Week Joined' column into 'companies_2021'.
companies_2021.insert(3, "Week Joined", companies_2021["Date Joined"].dt.strftime('%Y-W%V'), True)

# Group `companies_2021` by `Week Joined`. 
# Aggregate by counting companies that joined per week of 2021.
companies_by_week_2021 = companies_2021.groupby(by="Week Joined")["Company"].count().reset_index().rename(columns={"Company":"Company Count"})

# Display the first few rows of the new DataFrame to confirm that it was created.
companies_by_week_2021.head()
Week JoinedCompany Count
02021-W0112
12021-W029
22021-W035
32021-W048
42021-W054
Compare trends over time

Let’s observe trends in the average valuation of companies quarterly between 2021 and 2020.

# Filter by the additional year to create a subset 
# that consists of companies that joined in that year.
companies_2020 = companies[companies["Date Joined"].dt.year == 2020]

# Concatenate the new subset with the subset that we defined previously.
companies_2020_2021 = pd.concat([companies_2020, companies_2021.drop(columns="Week Joined")])

# Add `Quarter Joined` column to `companies_2021`.
companies_2020_2021["Quarter Joined"] = companies_2020_2021["Date Joined"].dt.to_period('Q').dt.strftime('%Y-Q%q')

# Convert the `Valuation` column to numeric by removing `$` and `B` 
# and casting each value to data type `float`.
companies_2020_2021["Valuation"] =  companies_2020_2021["Valuation"].str.strip("$B").astype(float)

# Group `companies_2020_2021` by `Quarter Joined`, 
# Aggregate by computing average `Funding` of companies that joined per quarter of each year.
# Save the resulting DataFrame in a new variable.
companies_by_quarter_2020_2021 = companies_2020_2021.groupby(by="Quarter Joined")["Valuation"].mean().reset_index().rename(columns={"Valuation":"Average Valuation"})

# Display the first few rows of the new DataFrame to confirm that it was created.
companies_by_quarter_2020_2021.head()
Quarter JoinedAverage Valuation
02020-Q13.444444
12020-Q23.777778
22020-Q33.896552
32020-Q43.697674
42021-Q12.75

3. Visualization

Let’s create a box plot to visualize the distribution of how long it took companies to become unicorns, with respect to the month they joined.

# Define a list that contains months in chronological order.
month_order = ["January", "February", "March", "April", "May", "June",
               "July", "August", "September", "October", "November", 
               "December"]

sns.boxplot(x=companies['Month Joined'], 
            y=companies['Years To Join'], 
            order=month_order, 
            showfliers=False)

# Set the title of the plot.
plt.title('Distribution of years to become unicorn with respect to month joined')

# Rotate labels on the x-axis as a way to avoid overlap in the positions of the text. 
plt.xticks(rotation=45, horizontalalignment='right')

# Display the plot.
plt.show()

In the preceding box plot, the median value for Years To Join is different for each month. Also, the median Years To Join is lower for the months of September and October. This could indicate that companies that reached unicorn status in early fall took less time to reach a $1 billion valuation. 

This is because the number of companies is relatively close/consistent for each month. If that were not the case, it would be misleading to compare the median values from the box plots between months.

4. Results and Evaluation

Visualize the time it took companies to reach unicorn status

We will create a bar plot to visualize the average number of years it took companies to reach unicorn status with respect to when they were founded.

# Set the size of the plot.
plt.figure(figsize=(10,6))

# Create bar plot 
sns.barplot(x=companies["Year Founded"], y=companies["Years To Join"], ci=False)

# Set title
plt.title("Bar plot of years to join with respect to year founded")

# Set x-axis label
plt.xlabel("Year founded")

# Set y-axis label
plt.ylabel("Years to join unicorn status")

# Rotate the labels on the x-axis as a way to avoid overlap in the positions of the text.  
plt.xticks(rotation=45, horizontalalignment='right')

# Display the plot.
plt.show()

There appears to be a trend wherein companies that were founded later took less time to reach unicorn status, on average. 

However, this is a bias that is common in time data. Because companies founded in later years have been around for less time. Therefore, there is less time to collect data on such companies compared to companies founded in earlier years.

Visualize the number of companies that joined per interval
# Set the size of the plot.
plt.figure(figsize = (20, 5))

# Create a bar plot.
plt.bar(x=companies_by_week_2021['Week Joined'], height=companies_by_week_2021['Company Count'])
plt.plot()

# Set the x-axis label.
plt.xlabel("Week number")

# Set the y-axis label.
plt.ylabel("Number of companies")

# Set the title.
plt.title("Number of companies that became unicorns per week in 2021")

# Rotate the labels on the x-axis as a way to avoid overlap in the positions of the text.  
plt.xticks(rotation = 45, horizontalalignment='right', fontsize=8)

# Display the plot.
plt.show()

Observations from a bar plot of the number of companies that became unicorns per week in 2021: 

  • The number of companies that joined unicorn status fluctuated over the weeks of 2021, with a trend of decline followed by increase occurring periodically. 
  • The highest number of companies reached $1 billion valuation in Week 37 of 2021, which corresponds to the third week of September 2021.
Visualize the average valuation over the quarters 

Now let’s create a grouped bar plot to visualize the average valuation over the quarters, with two bars for each time-interval. This will allow us to compare quarterly values between the two years.

# Using slicing, extract the year component and the time interval that we specified, 
# and save them by adding two new columns into the subset. 
companies_by_quarter_2020_2021['Quarter Number'] = companies_by_quarter_2020_2021['Quarter Joined'].str[-2:]
companies_by_quarter_2020_2021['Year Joined'] = companies_by_quarter_2020_2021['Quarter Joined'].str[:4]
# Set the size of the plot.
plt.figure(figsize = (10, 5))

# Create a grouped bar plot.
sns.barplot(x=companies_by_quarter_2020_2021['Quarter Number'],
            y=companies_by_quarter_2020_2021['Average Valuation'],
            hue=companies_by_quarter_2020_2021['Year Joined'])
plt.plot()

# Set the x-axis label.
plt.xlabel("Quarter number")

# Set the y-axis label.
plt.ylabel("Average valuation (billions of dollars)")

# Set the title.
plt.title("Average valuation of companies that became unicorns per quarter in 2020 vs. 2021")

# Display the plot.
plt.show()

Observations from a grouped bar plot of average valuation of companies that became unicorns per quarter in 2020 vs. 2021: 

  • In each quarter, the average valuation of companies that joined unicorn status was higher in 2020 than in 2021. 
  • In 2020, Q3 was the quarter with the highest average valuation of companies that reached unicorn status, and there was a trend of increase from Q1 to Q2 and from Q2 to Q3. 
  • In 2021, Q1 was the quarter with the highest average valuation of companies that reached unicorn status, and there was a trend of decrease across the quarters.

Some notes about this study and the dataset

Potential bias:
  • If there were bias in terms of which cities and countries were taken into account when collecting the data, then the analysis would be more representative of the cities and countries that are in the dataset than those that are not.
  • If the dataset did not include certain industries, then the analysis would be more representative of the industries that are included and may not reflect trends in those that are excluded from the data.
  • If the dataset had time gaps, (e.g., if companies that joined in certain windows of time were not included in the data), then that may have affected the patterns observed, depending on how salient the gaps were.
  • Another point of bias pertains to the nature of time data: There have been fewer years to collect data on companies that were founded more recently than for companies that were founded longer ago.
Potential next steps with EDA:
  • Analyze the data with respect to industries of unicorn companies at different datetime intervals.
  • Analyze the data with respect to cities or countries where unicorn companies were founded at different datetime intervals.
  • Clean the data as needed.
Possible questions:
  • How many rounds of funding did each company require and when did this funding take place?
  • Have any of these unicorn companies acquired other companies along the way? 
  • If so, which companies acquired other companies, which companies did they acquire, and when did the acquisitions take place?

Conclusion

What findings would we share with others?
  • There are 1074 unicorn companies represented in this dataset.
  • 2015 is the year when the most number of unicorn companies were founded.
  • Many of the unicorn companies that were founded in 2021 were founded in the United States and belong to “Fintech”, “E-commerce & direct-to-consumer”, and “Internet software & services” industries.
  • The box plot created shows that companies that become unicorns in the months of September and October have a smaller median value for how long it took to become unicorns.
  • One of the bar plots created shows that the average valuation of companies that joined in 2020 is highest in the third quarter of the year, whereas the average valuation of companies that joined in 2021 is highest in the first quarter of the year.
What recommendations would we share with stakeholders based on these findings?
  • According to data analysis that was conducted on a dataset of 1074 unicorn companies, companies that joined in the months of September and October tended to take less time to become unicorns.
  • Another finding was that many of the unicorn companies that were founded in 2021 were founded in the United States and belong to “Fintech”, “E-commerce & direct-to-consumer”, and “Internet software & services” industries. So if the stakeholders want to invest in companies founded in 2021, it would be a good idea to consider companies that belong to these industries, as they may be strong candidates for becoming unicorns.
  • It was also discovered that the average valuation of companies that joined in 2021 is highest in the first quarter of the year, and the average valuation of companies that joined in 2020 is the third quarter of the year. When considering companies that newly join in the future, it would be worth closely looking at companies that join in the first and third quarters of the year. (Although this is a weak one, in my opinion. Because this data/interpretation is based on just two years, whereas it takes many more years for companies to grow.)
  • The data can be analyzed further to gather more insights that are specific to the interests of the investing firm and the stakeholders.

In

,