Python Libraries: Pandas

While NumPy is capable of many of the same functions and operations as pandas, it’s not always as easy to work with because it requires us to work more abstractly with the data and keep track of what’s being done to it, even if we can’t see it.

Pandas, on the other hand, provides a simple interface that allows us to display our data as rows and columns. This means that we can always follow exactly what’s happening to our data as we manipulate it.

(Typically, when using pandas, we import both NumPy and pandas together. This is just for convenience, given that NumPy is often used in conjunction with pandas.)

The data frame is a core data structure in pandas. A data frame is made up of rows and columns, and it can contain data of many different data types including integers, floats, strings, booleans, and more. Pandas’ key functionality is the manipulation and analysis of tabular data.

Dataframes and Series

Core pandas object classes are dataframes and series:

DataFrame: A two-dimensional, labeled data structure with rows and columns. We can think of a dataframe like a spreadsheet or a SQL table, where each column and row is represented by a Series.

Creating DataFrame

import pandas as pd
data = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=data)
df

	col1	col2
0	1	3
1	2	4

# Use pd.DataFrame() function to create a dataframe from a NumPy array.
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
                   columns=['a', 'b', 'c'], index=['x', 'y', 'z'])
df2

	a	b	c
x	1	2	3
y	4	5	6
z	7	8	9

With the code above, we had to indicate the columns and index.

Often, data professionals need to be able to create a dataframe from existing data that’s not written in Python syntax.

# Use pd.read_csv() function to create a dataframe from a .csv file
# from a URL or filepath.
df3 = pd.read_csv('train.csv')
df3.head()

PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Parch	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	0	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th…	female	38.0	1	0	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	0	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	0	373450	8.0500	NaN	S

Series: One-dimensional, labeled array. Series objects are most often used to represent individual columns or rows of a dataframe.

Each element in a series has an associated label called an index. The index allows for more efficient and intuitive data manipulation by making it easier to reference specific elements of our data.

# Print class of first row
print(type(df3.iloc[0]))

# Print class of "Name" column
print(type(df3['Name']))

<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>

Pandas Attributes and Methods

# The columns attribute returns an Index object containing the dataframe's columns.
titanic.columns

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

# The shape attribute returns the shape of the dataframe (rows, columns).
titanic.shape

(891, 12)

# The info() method returns summary information about the dataframe.
titanic.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

Selecting and Indexing

# We can select a column by name using brackets.
titanic['Age']

# Or we can select a column by name using dot notation
# only when its name contains no spaces or special characters.
titanic.Age

0      22.0
1      38.0
2      26.0
3      35.0
4      35.0
       ... 
886    27.0
887    19.0
888     NaN
889    26.0
890    32.0
Name: Age, Length: 891, dtype: float64

# We can create a DataFrame object of specific columns using 
# a list of column names inside brackets.
titanic[['Name', 'Age']]

	Name	Age
0	Braund, Mr. Owen Harris	22.0
1	Cumings, Mrs. John Bradley (Florence Briggs Th…	38.0
2	Heikkinen, Miss. Laina	26.0
3	Futrelle, Mrs. Jacques Heath (Lily May Peel)	35.0
4	Allen, Mr. William Henry	35.0
…	…	…
886	Montvila, Rev. Juozas	27.0
887	Graham, Miss. Margaret Edith	19.0
888	Johnston, Miss. Catherine Helen “Carrie”	NaN
889	Behr, Mr. Karl Howell	26.0
890	Dooley, Mr. Patrick	32.0

891 rows × 2 columns

Selection Statements with loc and iloc

Selecting Rows

loc[]

loc[] lets us select rows by name. Here’s an example:

df = pd.DataFrame({
   'A': ['alpha', 'apple', 'arsenic', 'angel', 'android'],
   'B': [1, 2, 3, 4, 5],
   'C': ['coconut', 'curse', 'cassava', 'cuckoo', 'clarinet'],
   'D': [6, 7, 8, 9, 10]
   },
   index=['row_0', 'row_1', 'row_2', 'row_3', 'row_4'])
df

             A  B         C   D
row_0    alpha  1   coconut   6
row_1    apple  2     curse   7
row_2  arsenic  3   cassava   8
row_3    angel  4    cuckoo   9
row_4  android  5  clarinet  10

The row index of the dataframe contains the names of the rows. Use loc[] to select rows by name:

print(df.loc['row_1'])

A    apple
B        2
C    curse
D        7
Name: row_1, dtype: object

Inserting just the row index name in selector brackets returns a Series object. Inserting the row index name as a list returns a DataFrame object:

print(df.loc[['row_1']])

           A  B      C  D
row_1  apple  2  curse  7

To select multiple rows by name, we use a list within selector brackets:

print(df.loc[['row_2', 'row_4']])

             A  B         C   D
row_2  arsenic  3   cassava   8
row_4  android  5  clarinet  10

We can even specify a range of rows by named index:

print(df.loc['row_0':'row_3'])

             A  B        C  D
row_0    alpha  1  coconut  6
row_1    apple  2    curse  7
row_2  arsenic  3  cassava  8
row_3    angel  4   cuckoo  9

Because we’re using named indices, the returned range includes the specified end index.

iloc[]

iloc[] lets us select rows by numeric position, similar to how you would access elements of a list or an array. Here’s an example.

print(df)
print()
print(df.iloc[1])

           A  B         C   D
row_0    alpha  1   coconut   6
row_1    apple  2     curse   7
row_2  arsenic  3   cassava   8
row_3    angel  4    cuckoo   9
row_4  android  5  clarinet  10

A    apple
B        2
C    curse
D        7
Name: row_1, dtype: object

Inserting just the row index number in selector brackets returns a Series object. Inserting the row index number as a list returns a DataFrame object:

print(df.iloc[[1]])

           A  B      C  D
row_1  apple  2  curse  7

To select multiple rows by index number, we use a list within selector brackets:

print(df.iloc[[0, 2, 4]])

             A  B         C   D
row_0    alpha  1   coconut   6
row_2  arsenic  3   cassava   8
row_4  android  5  clarinet  10

Specify a range of rows by index number:

print(df.iloc[0:3])

            A  B        C  D
row_0    alpha  1  coconut  6
row_1    apple  2    curse  7
row_2  arsenic  3  cassava  8

Note that this does not include the row at index three.

Selecting Columns

To select a column, we can simply put the column’s name in brackets as we did above with the Titanic dataframe. To the same with loc or iloc, we must specify rows as well.

print(df.loc[:, ['B', 'D']])
print(df.iloc[:, [1,3]])

       B   D
row_0  1   6
row_1  2   7
row_2  3   8
row_3  4   9
row_4  5  10

Both gives the same result.

Selecting Rows and Columns

Both loc[] and iloc[] can be used to select specific rows and columns together.

print(df.loc['row_0':'row_2', ['A','C']])

             A        C
row_0    alpha  coconut
row_1    apple    curse
row_2  arsenic  cassava

print(df.iloc[[2, 4], 0:3])

             A  B         C
row_2  arsenic  3   cassava
row_4  android  5  clarinet

However, when using rows with named indices, we cannot mix numeric and named notation.

print(df.loc[0:3, ['D']])

Error on line 1

To view rows [0:3] at column ‘D’ (if we don’t know the index number of column D), we’d have to use selector brackets after an iloc[] statement:

# This is most convenient for VIEWING:
print(df.iloc[0:3][['D']])
print()

# But this is best practice/more stable for assignment/manipulation:
print(df.loc[df.index[0:3], 'D'])

       D
row_0  6
row_1  7
row_2  8

row_0    6
row_1    7
row_2    8
Name: D, dtype: int64

However, in many (perhaps most) cases our rows will not have named indices, but rather numeric indices. In this case, we can mix numeric and named notation. For example, here’s the same dataset, but with numeric indices instead of named indices.

df = pd.DataFrame({
   'A': ['alpha', 'apple', 'arsenic', 'angel', 'android'],
   'B': [1, 2, 3, 4, 5],
   'C': ['coconut', 'curse', 'cassava', 'cuckoo', 'clarinet'],
   'D': [6, 7, 8, 9, 10]
   },
   )
df

         A  B         C   D
0    alpha  1   coconut   6
1    apple  2     curse   7
2  arsenic  3   cassava   8
3    angel  4    cuckoo   9
4  android  5  clarinet  10

Notice that the rows are enumerated. Now, this code will execute without error:

print(df.loc[0:3, ['D']])