Cleaning Data: Checking for Duplicates

Data cleaning is an important part of EDA. I’ll write more about it in the upcoming posts and here we’ll start with checking and eliminating duplicates.

Identifying duplicates

print(df)
print()
print(df.duplicated())

    brand    style  rating
0   Wowyow  cistern     4.0
1   Wowyow  cistern     4.0
2  Splaysh      jug     5.5
3  Splaysh    stock     3.3
4  Pipplee    stock     3.0

0    False
1     True
2    False
3    False
4    False
dtype: bool

The duplicated() function will only return entire rows that have exactly matching values, not just individual matching values found within a column. If we wish to identify duplicates for only one column or a series of columns within a dataframe, we will need to include that in the “subset” portion of the argument field of the duplicated() function.

print(df)
print()
print(df.duplicated(subset=['type'], keep='last'))

   color  rating     type
0   olive     9.0    rinds
1   olive     9.0    rinds
2    gray     4.5  pellets
3  salmon    11.0  pellets
4  salmon     7.0  pellets

0     True
1    False
2     True
3     True
4    False
dtype: bool

Above is an example of identifying duplicates in only one column (subset) of values and labeling the last duplicates as “false,” so that they are “kept”.

Deduplication

Deduplication is the elimination or removal of matching data values in a dataset.

df

    brand    style  rating
0   Wowyow  cistern     4.0
1   Wowyow  cistern     4.0
2  Splaysh      jug     5.5
3  Splaysh    stock     3.3
4  Pipplee    stock     3.0

df.drop_duplicates()

    brand    style  rating
0   Wowyow  cistern     4.0
2  Splaysh      jug     5.5
3  Splaysh    stock     3.3
4  Pipplee    stock     3.0

Again, the drop_duplicates() function as written above will only drop duplicates of exact matches of entire rows of data. If we wish to drop duplicates within a single column, we will need to specify which columns to check for duplicates using the subset keyword argument.

print(df)
df = df.drop_duplicates(subset='style')
print()
print(df)

    brand    style  rating
0   Wowyow  cistern     4.0
1   Wowyow  cistern     4.0
2  Splaysh      jug     5.5
3  Splaysh    stock     3.3
4  Pipplee    stock     3.0

     brand    style  rating
0   Wowyow  cistern     4.0
2  Splaysh      jug     5.5
3  Splaysh    stock     3.3

And the example below drops all rows (except the first occurrence) that have duplicate values in both the style and rating columns:

print(df)
df = df.drop_duplicates(subset=['style', 'rating'])
print()
print(df)

    brand    style  rating
0   Wowyow  cistern     4.0
1   Wowyow  cistern     4.0
2  Splaysh      jug     5.5
3  Splaysh    stock     3.3
4  Pipplee    stock     3.0

     brand    style  rating
0   Wowyow  cistern     4.0
2  Splaysh      jug     5.5
3  Splaysh    stock     3.3
4  Pipplee    stock     3.0