Perform feature engineering in Python

Introduction

In this activity, we are working for a firm that provides insights to the National Basketball Association (NBA). We will help NBA managers and coaches identify which players are most likely to thrive in the high-pressure environment of professional basketball and help the team be successful over time.

To do this, we will analyze a subset of data that contains information about NBA players and their performance records. We will conduct feature engineering to determine which features will most effectively predict whether a player’s NBA career will last at least five years. The insights gained then will be used in the next stage of the project: building the predictive model.

Step 1: Imports

# Import pandas
import pandas as pd
# Save in a variable named `data`
data = pd.read_csv("nba-players.csv", index_col=0)

Step 2: Data exploration

# Display first 10 rows of data
data.head(10)

Out[3]:

namegpminptsfgmfgafg3p_made3pa3pftaftorebdrebrebaststlblktovtarget_5yrs
0Brandon Ingram3627.47.42.67.634.70.52.125.02.369.90.73.44.11.90.40.41.30
1Andrew Harrison3526.97.22.06.729.60.72.823.53.476.50.52.02.43.71.10.51.60
2JaKarr Sampson7415.35.22.04.742.20.41.724.41.367.00.51.72.21.00.50.31.00
3Malik Sealy5811.65.72.35.542.60.10.522.61.368.91.00.91.90.80.60.11.01
4Matt Geiger4811.54.51.63.052.40.00.10.01.967.41.01.52.50.30.30.40.81
5Tony Bennett7511.43.71.53.542.30.31.132.50.573.20.20.70.81.80.40.00.70
6Don MacLean6210.96.62.55.843.50.00.150.01.881.10.51.42.00.60.20.10.71
7Tracy Murray4810.35.72.35.441.50.41.530.00.887.50.80.91.70.20.20.10.71
8Duane Cooper659.92.41.02.439.20.10.523.30.571.40.20.60.82.30.30.01.10
9Dave Johnson428.53.71.43.538.30.10.321.41.467.80.40.71.10.30.20.00.70
10 rows × 21 columns

# Display number of rows, number of columns
data.shape
(1340, 21)
Data dictionary

The following table provides a description of the data in each column.

name: Name of NBA player
gp: Number of games played
min: Number of minutes played per game
pts: Average number of points per game
fgm: Average number of field goals made per game
fga: Average number of field goal attempts per game
fg: Average percent of field goals made per game
3p_made” Average number of three-point field goals made per game
3pa: Average number of three-point field goal attempts per game
3p: Average percent of three-point field goals made per game
ftm: Average number of free throws made per game
fta: Average number of free throw attempts per game
ft: Average percent of free throws made per game
oreb: Average number of offensive rebounds per game
dreb: Average number of defensive rebounds per game
reb: Average number of rebounds per game
ast: Average number of assists per game
stl: Average number of steals per game
blk: Average number of blocks per game
tov: Average number of turnovers per game
target_5yrs: 1 if career duration >= 5 yrs, 0 otherwise

# Display a summary of the DataFrame
data.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1340 entries, 0 to 1339
Data columns (total 21 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 1340 non-null object
1 gp 1340 non-null int64
2 min 1340 non-null float64
3 pts 1340 non-null float64
4 fgm 1340 non-null float64
5 fga 1340 non-null float64
6 fg 1340 non-null float64
7 3p_made 1340 non-null float64
8 3pa 1340 non-null float64
9 3p 1340 non-null float64
10 ftm 1340 non-null float64
11 fta 1340 non-null float64
12 ft 1340 non-null float64
13 oreb 1340 non-null float64
14 dreb 1340 non-null float64
15 reb 1340 non-null float64
16 ast 1340 non-null float64
17 stl 1340 non-null float64
18 blk 1340 non-null float64
19 tov 1340 non-null float64
20 target_5yrs 1340 non-null int64
dtypes: float64(18), int64(2), object(1)
memory usage: 230.3+ KB
Check for missing values
# Display the number of missing values in each column.
# Check whether each value is missing.
#Aggregate the number of missing values per column.
data.isna().sum()
name           0
gp             0
min            0
pts            0
fgm            0
fga            0
fg             0
3p_made        0
3pa            0
3p             0
ftm            0
fta            0
ft             0
oreb           0
dreb           0
reb            0
ast            0
stl            0
blk            0
tov            0
target_5yrs    0
dtype: int64

Step 3: Statistical tests

# Display percentage (%) of values for each class (1, 0) represented in the target column of this dataset
data["target_5yrs"].value_counts(normalize=True)*100
1    62.014925
0    37.985075
Name: target_5yrs, dtype: float64
  • The dataset is not perfectly balanced, but an exact 50-50 split is a rare occurrence in datasets, and a 62-38 split is not too imbalanced. However, if the majority class made up 90% or more of the dataset, then that would be of concern, and it would be prudent to address that issue through techniques like upsampling and downsampling.

Step 4: Results and evaluation

Feature Selection
  • A player’s name is not helpful in determining their career duration. Moreover, it may not be ethical or fair to predict a player’s career duration based on a name.
  • The number of games a player has played in may not be as important in determining their career duration as the number of points they have earned. However, gp and pts could be combined to get the total number of points earned across the games played, and that result could be a helpful feature. That approach can be implemented later in the in feature extraction step.
  • If the number of points earned across games will be extracted as a feature, then that could be combined with the number of minutes played across games (min * gp) to extract another feature. This could be a measure of players’ efficiency and could help in predicting players’ career duration. min on its own may not be useful as a feature for the same reason as gp.
  • There are three different columns that give information about field goals. The percent of field goals a player makes (fg) says more about their performance than the number of field goals they make (fgm) or the number of field goals they attempt (fga). The percent gives more context, as it takes into account both how many field goals a player successfully made and how many field goals they attempted in total. This allows for a more meaningful comparison between players. The same logic applies to the percent of three-point field goals made, as well as the percent of free throws made.
  • There are columns for the number offensive rebounds (oreb), the number of defensive rebounds (dreb), and the number of rebounds overall (reb). Because the overall number of rebounds should already incorporate both offensive and defensive rebounds, it would make sense to use the overall as a feature.
  • The number of assists (ast), steals (stl), blocks (blk), and turnovers (tov) also provide information about how well players are performing in games, and thus, could be helpful in predicting how long players last in the league.

Therefore, at this stage of the feature engineering process, it would be most effective to select the following columns:

gpminptsfg3pftrebaststlblktov.

# Select the columns to proceed with and save the DataFrame in new variable `selected_data`
# Include the target column, `target_5yrs`
selected_data = data[["gp", "min", "pts", "fg", "3p", "ft", "reb", "ast", "stl", "blk", "tov", "target_5yrs"]]

# Display the first few rows
selected_data.head()
gpminptsfg3pftrebaststlblktovtarget_5yrs
03627.47.434.725.069.94.11.90.40.41.30
13526.97.229.623.576.52.43.71.10.51.60
27415.35.242.224.467.02.21.00.50.31.00
35811.65.742.622.668.91.90.80.60.11.01
44811.54.552.40.067.42.50.30.30.40.81
Feature transformation

An important aspect of feature transformation is feature encoding. If there are categorical columns that we would want to use as features, those columns should be transformed to be numerical. This technique is also known as feature encoding.

In this particular dataset, name is the only categorical column and the other columns are numerical. Given that name is not selected as a feature, all of the features that are selected at this point are already numerical and do not require transformation.

Feature extraction
  • The gpptsmin columns lend themselves to feature extraction.
    • gp represents the total number of games a player has played in, and pts represents the average number of points the player has earned per game. It might be helpful to combine these columns to get the total number of points the player has earned across the games and use the result as a new feature, which could be added into a new column named total_points. The total points earned by a player can reflect their performance and shape their career longevity.
    • The min column represents the average number of minutes played per game. total_points could be combined with min and gp to extract a new feature: points earned per minute. This can be considered a measure of player efficiency, which could shape career duration. This feature can be added into a column named efficiency.
# Make a copy of `selected_data` 
extracted_data = selected_data.copy()

# Calculate total points earned by multiplying the number of games played by the average number of points earned per game
extracted_data["total_points"] = extracted_data["gp"] * extracted_data["pts"]

# Calculate efficiency by dividing the total points earned by the total number of minutes played, 
# which yields points per minute 
extracted_data["efficiency"] = extracted_data["total_points"] / (extracted_data["min"] * extracted_data["gp"])

# Display the first few rows of `extracted_data` to confirm that the new columns were added
extracted_data.head()
gpminptsfg3pftrebaststlblktovtarget_5yrstotal_pointsefficiency
03627.47.434.725.069.94.11.90.40.41.30266.40.270073
13526.97.229.623.576.52.43.71.10.51.60252.00.267658
27415.35.242.224.467.02.21.00.50.31.00384.80.339869
35811.65.742.622.668.91.90.80.60.11.01330.60.491379
44811.54.552.40.067.42.50.30.30.40.81216.00.391304

Now, to prepare for the Naive Bayes model that we will build later, let’s clean the extracted data and ensure it is concise. Naive Bayes involves an assumption that features are independent of each other given the class. In order to satisfy that criteria, if certain features are aggregated to yield new features, it may be necessary to remove those original features. Therefore, we’ll drop the columns that were used to extract new features.

Note: There are other types of models that do not involve independence assumptions, so this would not be required in those instances. In fact, keeping the original features may be beneficial.

# Remove `gp`, `pts`, and `min` from `extracted_data`
extracted_data = extracted_data.drop(columns=["gp", "pts", "min"])

# Display the first few rows of `extracted_data` to ensure that column drops took place
extracted_data.head()
fg3pftrebaststlblktovtarget_5yrstotal_pointsefficiency
034.725.069.94.11.90.40.41.30266.40.270073
129.623.576.52.43.71.10.51.60252.00.267658
242.224.467.02.21.00.50.31.00384.80.339869
342.622.668.91.90.80.60.11.01330.60.491379
452.40.067.42.50.30.30.40.81216.00.391304

Now we’ll export the extracted data as a new .csv file, so we can use this later.

# Export the extracted data
extracted_data.to_csv("extracted_nba_players_data.csv", index=0)

Considerations

What summary could we provide to stakeholders?

  • The following attributes about player performance could help predict their NBA career duration and should be included in a presentation to stakeholders: field goals, three-point field goals, free throws, rebounds, assists, steals, blocks, turnovers, total points, and efficiency as points per minute.
  • It would be important to explain that these attributes, along with a relevant dataset, will be used in the next stage of the project. At that point, a model will be built to predict a player’s career duration. Insights gained will be shared with stakeholders once the project is complete. Stakeholders would also appreciate being provided with a timeline and key deliverables that they can expect to receive.

Disclaimer: Like most of my posts, this content is intended solely for educational purposes and was created primarily for my personal reference. At times, I may rephrase original texts, and in some cases, I include materials such as graphs, equations, and datasets directly from their original sources. 

I typically reference a variety of sources and update my posts whenever new or related information becomes available. For this particular post, the primary source was Google Advanced Data Analytics Professional Certificate program.