Data cleaning is an important part of EDA. I’ll write more about it in the upcoming posts and here we’ll start with checking and eliminating duplicates.
Identifying duplicates
print(df)
print()
print(df.duplicated())
brand style rating
0 Wowyow cistern 4.0
1 Wowyow cistern 4.0
2 Splaysh jug 5.5
3 Splaysh stock 3.3
4 Pipplee stock 3.0
0 False
1 True
2 False
3 False
4 False
dtype: bool
The duplicated() function will only return entire rows that have exactly matching values, not just individual matching values found within a column. If we wish to identify duplicates for only one column or a series of columns within a dataframe, we will need to include that in the “subset” portion of the argument field of the duplicated() function.
print(df)
print()
print(df.duplicated(subset=['type'], keep='last'))
color rating type
0 olive 9.0 rinds
1 olive 9.0 rinds
2 gray 4.5 pellets
3 salmon 11.0 pellets
4 salmon 7.0 pellets
0 True
1 False
2 True
3 True
4 False
dtype: bool
Above is an example of identifying duplicates in only one column (subset) of values and labeling the last duplicates as “false,” so that they are “kept”.
Deduplication
Deduplication is the elimination or removal of matching data values in a dataset.
df
brand style rating
0 Wowyow cistern 4.0
1 Wowyow cistern 4.0
2 Splaysh jug 5.5
3 Splaysh stock 3.3
4 Pipplee stock 3.0
df.drop_duplicates()
brand style rating
0 Wowyow cistern 4.0
2 Splaysh jug 5.5
3 Splaysh stock 3.3
4 Pipplee stock 3.0
Again, the drop_duplicates() function as written above will only drop duplicates of exact matches of entire rows of data. If we wish to drop duplicates within a single column, we will need to specify which columns to check for duplicates using the subset keyword argument.
print(df)
df = df.drop_duplicates(subset='style')
print()
print(df)
brand style rating
0 Wowyow cistern 4.0
1 Wowyow cistern 4.0
2 Splaysh jug 5.5
3 Splaysh stock 3.3
4 Pipplee stock 3.0
brand style rating
0 Wowyow cistern 4.0
2 Splaysh jug 5.5
3 Splaysh stock 3.3
And the example below drops all rows (except the first occurrence) that have duplicate values in both the style and rating columns:
print(df)
df = df.drop_duplicates(subset=['style', 'rating'])
print()
print(df)
brand style rating
0 Wowyow cistern 4.0
1 Wowyow cistern 4.0
2 Splaysh jug 5.5
3 Splaysh stock 3.3
4 Pipplee stock 3.0
brand style rating
0 Wowyow cistern 4.0
2 Splaysh jug 5.5
3 Splaysh stock 3.3
4 Pipplee stock 3.0