Sample 5: Unicorn Companies
Following my notes on Python’s basic tools for data works, here I’ll do some basic EDA projects. Disclaimer: The one below is based on Google’s Advanced Data Analysis Program. My only intention, by repeating their structure, is to practice what I’ve learned and keep these notes as future reference. Their content can be reached via Coursera. A free version is also available without claiming the certificate.
The reference for the below work: Bhat, M.A. (2022, March). Unicorn Companies.
0. Introduction
As we did before with sample 2, we’ll imagine that we are a member of an analytics team that provides insights to an investing firm. To help them decide which companies to invest in next, the firm wants insights into unicorn companies, companies that are valued at over one billion dollars.
We will work with a dataset about unicorn companies, discovering characteristics of the data, structuring the data in ways that will help us draw meaningful insights, and using visualizations to analyze the data. Ultimately, we will draw conclusions about what significant trends or patterns we find in the dataset.
1. Imports
# Import libraries and packages
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
# Load the dataset provided into a DataFrame.
companies = pd.read_csv("Unicorn_Companies.csv")
2. Data exploration
companies.head()
Company | Valuation | Date Joined | Industry | City | Country/Region | Continent | Year Founded | Funding | Select Investors | |
0 | Bytedance | $180B | 4/7/17 | Artificial intelligence | Beijing | China | Asia | 2012 | $8B | Sequoia Capital China, SIG Asia Investments, S… |
1 | SpaceX | $100B | 12/1/12 | Other | Hawthorne | United States | North America | 2002 | $7B | Founders Fund, Draper Fisher Jurvetson, Rothen… |
2 | SHEIN | $100B | 7/3/18 | E-commerce & direct-to-consumer | Shenzhen | China | Asia | 2008 | $2B | Tiger Global Management, Sequoia Capital China… |
3 | Stripe | $95B | 1/23/14 | Fintech | San Francisco | United States | North America | 2010 | $2B | Khosla Ventures, LowercaseCapital, capitalG |
4 | Klarna | $46B | 12/12/11 | Fintech | Stockholm | Sweden | Europe | 2005 | $4B | Institutional Venture Partners, Sequoia Capita… |
companies.shape
(1074, 10)
# Check for duplicates.
companies.drop_duplicates().shape
(1074, 10)
companies.dtypes
Company object
Company object
Valuation object
Date Joined object
Industry object
City object
Country/Region object
Continent object
Year Founded int64
Funding object
Select Investors object
dtype: object
Sort the data
# To arrange the data from latest to earliest 'Year Founded'
companies.sort_values(by="Year Founded", ascending=False).head()
Company | Valuation | Date Joined | Industry | City | Country/Region | Continent | Year Founded | Funding | Select Investors | |
782 | Phantom | $1B | 1/31/22 | Fintech | San Francisco | United States | North America | 2021 | $118M | Paradigm, Andreessen Horowitz, Jump Capital |
714 | Yidian Zixun | $1B | 10/17/17 | Mobile & telecommunications | Beijing | China | Asia | 2021 | $151M | Phoenix New Media, Tianjin Haihe Industry Fund |
822 | GlobalBees | $1B | 12/28/21 | E-commerce & direct-to-consumer | New Delhi | India | Asia | 2021 | $185M | Chiratae Ventures, SoftBank Group, Trifecta Ca… |
554 | ClickHouse | $2B | 10/28/21 | Data management & analytics | Portola Valley | United States | North America | 2021 | $300M | Lightspeed Venture Partners, Almaz Capital Par… |
952 | LayerZero Labs | $1B | 3/30/22 | Internet software & services | New York | United States | North America | 2021 | $143M | Andreessen Horowitz, FTX Ventures, Tiger Globa… |
Determine the number of companies founded each year
# Display each unique year that occurs in the dataset
# along with the number of companies that were founded in each unique year.
companies["Year Founded"].value_counts().sort_values(ascending=False)
2015 155
2016 110
2014 109
2012 95
2013 87
2011 82
2017 74
2018 61
2019 45
2010 40
2009 34
2008 27
2020 25
2007 24
2006 15
2005 14
2000 11
2021 11
2001 9
1999 8
2004 8
2003 8
1998 5
2002 4
1994 2
1995 2
1992 1
1993 1
1990 1
1984 1
1996 1
1979 1
1991 1
1919 1
1997 1
Name: Year Founded, dtype: int64
Let’s plot these results on a histogram.
sns.histplot(data=companies, x='Year Founded')
plt.title('Year Founded histogram')
Now let’s examine when the companies joined unicorn status. Our Date Joined column currently has the data type object, so we need to convert it first.
# Convert 'Date Joined' column to datetime.
#companies["Date Joined"] = pd.to_datetime(companies["Date Joined"])
# Display the data types of the columns in `companies`
# to confirm that the update actually took place
companies.dtypes
Company object
Valuation object
Date Joined datetime64[ns]
Industry object
City object
Country/Region object
Continent object
Year Founded int64
Funding object
Select Investors object
dtype: object
Create a Month Joined column
# Obtain the names of the months when companies gained unicorn status.
companies["Month Joined"] = companies["Date Joined"].dt.month_name()
# Display the first few rows of `companies`
# to confirm that the new column did get added.
companies.head()
Company | Valuation | Date Joined | Industry | City | Country/Region | Continent | Year Founded | Funding | Select Investors | Month Joined | |
0 | Bytedance | $180B | 2017-04-07 | Artificial intelligence | Beijing | China | Asia | 2012 | $8B | Sequoia Capital China, SIG Asia Investments, S… | April |
1 | SpaceX | $100B | 2012-12-01 | Other | Hawthorne | United States | North America | 2002 | $7B | Founders Fund, Draper Fisher Jurvetson, Rothen… | December |
2 | SHEIN | $100B | 2018-07-03 | E-commerce & direct-to-consumer | Shenzhen | China | Asia | 2008 | $2B | Tiger Global Management, Sequoia Capital China… | July |
3 | Stripe | $95B | 2014-01-23 | Fintech | San Francisco | United States | North America | 2010 | $2B | Khosla Ventures, LowercaseCapital, capitalG | January |
4 | Klarna | $46B | 2011-12-12 | Fintech | Stockholm | Sweden | Europe | 2005 | $4B | Institutional Venture Partners, Sequoia Capita… | December |
Create a Years To Join column
Let’s determine how many years it took for companies to reach unicorn status.
companies["Years To Join"] = companies["Date Joined"].dt.year - companies["Year Founded"]
companies.head()
Company | Valuation | Date Joined | Industry | City | Country/Region | Continent | Year Founded | Funding | Select Investors | Month Joined | Years To Join | |
0 | Bytedance | $180B | 2017-04-07 | Artificial intelligence | Beijing | China | Asia | 2012 | $8B | Sequoia Capital China, SIG Asia Investments, S… | April | 5 |
1 | SpaceX | $100B | 2012-12-01 | Other | Hawthorne | United States | North America | 2002 | $7B | Founders Fund, Draper Fisher Jurvetson, Rothen… | December | 10 |
2 | SHEIN | $100B | 2018-07-03 | E-commerce & direct-to-consumer | Shenzhen | China | Asia | 2008 | $2B | Tiger Global Management, Sequoia Capital China… | July | 10 |
3 | Stripe | $95B | 2014-01-23 | Fintech | San Francisco | United States | North America | 2010 | $2B | Khosla Ventures, LowercaseCapital, capitalG | January | 4 |
4 | Klarna | $46B | 2011-12-12 | Fintech | Stockholm | Sweden | Europe | 2005 | $4B | Institutional Venture Partners, Sequoia Capita… | December | 6 |
Gain more insight on a specific year
We’ll get more insight on the year 2021, as that is the most recent year for which there is data available. Trends that took place in 2021 might be similar to trends that will take place in 2022.
companies_2021 = companies[companies["Date Joined"].dt.year == 2021]
companies_2021.head()
Company | Valuation | Date Joined | Industry | City | Country/Region | Continent | Year Founded | Funding | Select Investors | Month Joined | Years To Join | |
12 | FTX | $32B | 2021-07-20 | Fintech | NaN | Bahamas | North America | 2018 | $2B | Sequoia Capital, Thoma Bravo, Softbank | July | 3 |
16 | J&T Express | $20B | 2021-04-07 | Supply chain, logistics, & delivery | Jakarta | Indonesia | Asia | 2015 | $5B | Hillhouse Capital Management, Boyu Capital, Se… | April | 6 |
24 | Blockchain.com | $14B | 2021-02-17 | Fintech | London | United Kingdom | Europe | 2011 | $490M | Lightspeed Venture Partners, Google Ventures, … | February | 10 |
27 | OpenSea | $13B | 2021-07-20 | E-commerce & direct-to-consumer | New York | United States | North America | 2017 | $427M | Andreessen Horowitz, Thirty Five Ventures, Sou… | July | 4 |
34 | Getir | $12B | 2021-03-26 | E-commerce & direct-to-consumer | Istanbul | Turkey | Europe | 2015 | $2B | Tiger Global Management, Sequoia Capital, Revo… | March | 6 |
Observe trends over time
We’ll count the number of companies that joined per week. This allows the observation of trends over the weeks of 2021.
# Insert a 'Week Joined' column into 'companies_2021'.
companies_2021.insert(3, "Week Joined", companies_2021["Date Joined"].dt.strftime('%Y-W%V'), True)
# Group `companies_2021` by `Week Joined`.
# Aggregate by counting companies that joined per week of 2021.
companies_by_week_2021 = companies_2021.groupby(by="Week Joined")["Company"].count().reset_index().rename(columns={"Company":"Company Count"})
# Display the first few rows of the new DataFrame to confirm that it was created.
companies_by_week_2021.head()
Week Joined | Company Count | |
0 | 2021-W01 | 12 |
1 | 2021-W02 | 9 |
2 | 2021-W03 | 5 |
3 | 2021-W04 | 8 |
4 | 2021-W05 | 4 |
Compare trends over time
Let’s observe trends in the average valuation of companies quarterly between 2021 and 2020.
# Filter by the additional year to create a subset
# that consists of companies that joined in that year.
companies_2020 = companies[companies["Date Joined"].dt.year == 2020]
# Concatenate the new subset with the subset that we defined previously.
companies_2020_2021 = pd.concat([companies_2020, companies_2021.drop(columns="Week Joined")])
# Add `Quarter Joined` column to `companies_2021`.
companies_2020_2021["Quarter Joined"] = companies_2020_2021["Date Joined"].dt.to_period('Q').dt.strftime('%Y-Q%q')
# Convert the `Valuation` column to numeric by removing `$` and `B`
# and casting each value to data type `float`.
companies_2020_2021["Valuation"] = companies_2020_2021["Valuation"].str.strip("$B").astype(float)
# Group `companies_2020_2021` by `Quarter Joined`,
# Aggregate by computing average `Funding` of companies that joined per quarter of each year.
# Save the resulting DataFrame in a new variable.
companies_by_quarter_2020_2021 = companies_2020_2021.groupby(by="Quarter Joined")["Valuation"].mean().reset_index().rename(columns={"Valuation":"Average Valuation"})
# Display the first few rows of the new DataFrame to confirm that it was created.
companies_by_quarter_2020_2021.head()
Quarter Joined | Average Valuation | |
0 | 2020-Q1 | 3.444444 |
1 | 2020-Q2 | 3.777778 |
2 | 2020-Q3 | 3.896552 |
3 | 2020-Q4 | 3.697674 |
4 | 2021-Q1 | 2.75 |
3. Visualization
Let’s create a box plot to visualize the distribution of how long it took companies to become unicorns, with respect to the month they joined.
# Define a list that contains months in chronological order.
month_order = ["January", "February", "March", "April", "May", "June",
"July", "August", "September", "October", "November",
"December"]
sns.boxplot(x=companies['Month Joined'],
y=companies['Years To Join'],
order=month_order,
showfliers=False)
# Set the title of the plot.
plt.title('Distribution of years to become unicorn with respect to month joined')
# Rotate labels on the x-axis as a way to avoid overlap in the positions of the text.
plt.xticks(rotation=45, horizontalalignment='right')
# Display the plot.
plt.show()
In the preceding box plot, the median value for Years To Join is different for each month. Also, the median Years To Join is lower for the months of September and October. This could indicate that companies that reached unicorn status in early fall took less time to reach a $1 billion valuation.
This is because the number of companies is relatively close/consistent for each month. If that were not the case, it would be misleading to compare the median values from the box plots between months.
4. Results and Evaluation
Visualize the time it took companies to reach unicorn status
We will create a bar plot to visualize the average number of years it took companies to reach unicorn status with respect to when they were founded.
# Set the size of the plot.
plt.figure(figsize=(10,6))
# Create bar plot
sns.barplot(x=companies["Year Founded"], y=companies["Years To Join"], ci=False)
# Set title
plt.title("Bar plot of years to join with respect to year founded")
# Set x-axis label
plt.xlabel("Year founded")
# Set y-axis label
plt.ylabel("Years to join unicorn status")
# Rotate the labels on the x-axis as a way to avoid overlap in the positions of the text.
plt.xticks(rotation=45, horizontalalignment='right')
# Display the plot.
plt.show()
There appears to be a trend wherein companies that were founded later took less time to reach unicorn status, on average.
However, this is a bias that is common in time data. Because companies founded in later years have been around for less time. Therefore, there is less time to collect data on such companies compared to companies founded in earlier years.
Visualize the number of companies that joined per interval
# Set the size of the plot.
plt.figure(figsize = (20, 5))
# Create a bar plot.
plt.bar(x=companies_by_week_2021['Week Joined'], height=companies_by_week_2021['Company Count'])
plt.plot()
# Set the x-axis label.
plt.xlabel("Week number")
# Set the y-axis label.
plt.ylabel("Number of companies")
# Set the title.
plt.title("Number of companies that became unicorns per week in 2021")
# Rotate the labels on the x-axis as a way to avoid overlap in the positions of the text.
plt.xticks(rotation = 45, horizontalalignment='right', fontsize=8)
# Display the plot.
plt.show()
Observations from a bar plot of the number of companies that became unicorns per week in 2021:
- The number of companies that joined unicorn status fluctuated over the weeks of 2021, with a trend of decline followed by increase occurring periodically.
- The highest number of companies reached $1 billion valuation in Week 37 of 2021, which corresponds to the third week of September 2021.
Visualize the average valuation over the quarters
Now let’s create a grouped bar plot to visualize the average valuation over the quarters, with two bars for each time-interval. This will allow us to compare quarterly values between the two years.
# Using slicing, extract the year component and the time interval that we specified,
# and save them by adding two new columns into the subset.
companies_by_quarter_2020_2021['Quarter Number'] = companies_by_quarter_2020_2021['Quarter Joined'].str[-2:]
companies_by_quarter_2020_2021['Year Joined'] = companies_by_quarter_2020_2021['Quarter Joined'].str[:4]
# Set the size of the plot.
plt.figure(figsize = (10, 5))
# Create a grouped bar plot.
sns.barplot(x=companies_by_quarter_2020_2021['Quarter Number'],
y=companies_by_quarter_2020_2021['Average Valuation'],
hue=companies_by_quarter_2020_2021['Year Joined'])
plt.plot()
# Set the x-axis label.
plt.xlabel("Quarter number")
# Set the y-axis label.
plt.ylabel("Average valuation (billions of dollars)")
# Set the title.
plt.title("Average valuation of companies that became unicorns per quarter in 2020 vs. 2021")
# Display the plot.
plt.show()
Observations from a grouped bar plot of average valuation of companies that became unicorns per quarter in 2020 vs. 2021:
- In each quarter, the average valuation of companies that joined unicorn status was higher in 2020 than in 2021.
- In 2020, Q3 was the quarter with the highest average valuation of companies that reached unicorn status, and there was a trend of increase from Q1 to Q2 and from Q2 to Q3.
- In 2021, Q1 was the quarter with the highest average valuation of companies that reached unicorn status, and there was a trend of decrease across the quarters.
Some notes about this study and the dataset
Potential bias:
- If there were bias in terms of which cities and countries were taken into account when collecting the data, then the analysis would be more representative of the cities and countries that are in the dataset than those that are not.
- If the dataset did not include certain industries, then the analysis would be more representative of the industries that are included and may not reflect trends in those that are excluded from the data.
- If the dataset had time gaps, (e.g., if companies that joined in certain windows of time were not included in the data), then that may have affected the patterns observed, depending on how salient the gaps were.
- Another point of bias pertains to the nature of time data: There have been fewer years to collect data on companies that were founded more recently than for companies that were founded longer ago.
Potential next steps with EDA:
- Analyze the data with respect to industries of unicorn companies at different datetime intervals.
- Analyze the data with respect to cities or countries where unicorn companies were founded at different datetime intervals.
- Clean the data as needed.
Possible questions:
- How many rounds of funding did each company require and when did this funding take place?
- Have any of these unicorn companies acquired other companies along the way?
- If so, which companies acquired other companies, which companies did they acquire, and when did the acquisitions take place?
Conclusion
What findings would we share with others?
- There are 1074 unicorn companies represented in this dataset.
- 2015 is the year when the most number of unicorn companies were founded.
- Many of the unicorn companies that were founded in 2021 were founded in the United States and belong to “Fintech”, “E-commerce & direct-to-consumer”, and “Internet software & services” industries.
- The box plot created shows that companies that become unicorns in the months of September and October have a smaller median value for how long it took to become unicorns.
- One of the bar plots created shows that the average valuation of companies that joined in 2020 is highest in the third quarter of the year, whereas the average valuation of companies that joined in 2021 is highest in the first quarter of the year.
What recommendations would we share with stakeholders based on these findings?
- According to data analysis that was conducted on a dataset of 1074 unicorn companies, companies that joined in the months of September and October tended to take less time to become unicorns.
- Another finding was that many of the unicorn companies that were founded in 2021 were founded in the United States and belong to “Fintech”, “E-commerce & direct-to-consumer”, and “Internet software & services” industries. So if the stakeholders want to invest in companies founded in 2021, it would be a good idea to consider companies that belong to these industries, as they may be strong candidates for becoming unicorns.
- It was also discovered that the average valuation of companies that joined in 2021 is highest in the first quarter of the year, and the average valuation of companies that joined in 2020 is the third quarter of the year. When considering companies that newly join in the future, it would be worth closely looking at companies that join in the first and third quarters of the year. (Although this is a weak one, in my opinion. Because this data/interpretation is based on just two years, whereas it takes many more years for companies to grow.)
- The data can be analyzed further to gather more insights that are specific to the interests of the investing firm and the stakeholders.