Python Libraries: Pandas

While NumPy is capable of many of the same functions and operations as pandas, it’s not always as easy to work with because it requires us to work more abstractly with the data and keep track of what’s being done to it, even if we can’t see it.

Pandas, on the other hand, provides a simple interface that allows us to display our data as rows and columns. This means that we can always follow exactly what’s happening to our data as we manipulate it.

(Typically, when using pandas, we import both NumPy and pandas together. This is just for convenience, given that NumPy is often used in conjunction with pandas.)

The data frame is a core data structure in pandas. A data frame is made up of rows and columns, and it can contain data of many different data types including integers, floats, strings, booleans, and more. Pandas’ key functionality is the manipulation and analysis of tabular data.

Dataframes and Series

Core pandas object classes are dataframes and series:

DataFrame:  A two-dimensional, labeled data structure with rows and columns. We can think of a dataframe like a spreadsheet or a SQL table, where each column and row is represented by a Series.

Creating DataFrame

import pandas as pd
data = {'col1': [1, 2], 'col2': [3, 4]}
df = pd.DataFrame(data=data)
df
col1col2
013
124
# Use pd.DataFrame() function to create a dataframe from a NumPy array.
df2 = pd.DataFrame(np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]]),
                   columns=['a', 'b', 'c'], index=['x', 'y', 'z'])
df2
abc
x123
y456
z789

With the code above, we had to indicate the columns and index.

Often, data professionals need to be able to create a dataframe from existing data that’s not written in Python syntax.

# Use pd.read_csv() function to create a dataframe from a .csv file
# from a URL or filepath.
df3 = pd.read_csv('train.csv')
df3.head()
PassengerIdSurvivedPclassNameSexAgeSibSpParchTicketFareCabinEmbarked
0103Braund, Mr. Owen Harrismale22.010A/5 211717.2500NaNS
1211Cumings, Mrs. John Bradley (Florence Briggs Th…female38.010PC 1759971.2833C85C
2313Heikkinen, Miss. Lainafemale26.000STON/O2. 31012827.9250NaNS
3411Futrelle, Mrs. Jacques Heath (Lily May Peel)female35.01011380353.1000C123S
4503Allen, Mr. William Henrymale35.0003734508.0500NaNS

Series: One-dimensional, labeled array. Series objects are most often used to represent individual columns or rows of a dataframe.

Each element in a series has an associated label called an index. The index allows for more efficient and intuitive data manipulation by making it easier to reference specific elements of our data.

# Print class of first row
print(type(df3.iloc[0]))

# Print class of "Name" column
print(type(df3['Name']))
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>

Pandas Attributes and Methods

# The columns attribute returns an Index object containing the dataframe's columns.
titanic.columns
Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
dtype='object')
# The shape attribute returns the shape of the dataframe (rows, columns).
titanic.shape
(891, 12)
# The info() method returns summary information about the dataframe.
titanic.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

Selecting and Indexing

# We can select a column by name using brackets.
titanic['Age']

# Or we can select a column by name using dot notation
# only when its name contains no spaces or special characters.
titanic.Age
0      22.0
1 38.0
2 26.0
3 35.0
4 35.0
...
886 27.0
887 19.0
888 NaN
889 26.0
890 32.0
Name: Age, Length: 891, dtype: float64
# We can create a DataFrame object of specific columns using 
# a list of column names inside brackets.
titanic[['Name', 'Age']]
NameAge
0Braund, Mr. Owen Harris22.0
1Cumings, Mrs. John Bradley (Florence Briggs Th…38.0
2Heikkinen, Miss. Laina26.0
3Futrelle, Mrs. Jacques Heath (Lily May Peel)35.0
4Allen, Mr. William Henry35.0
886Montvila, Rev. Juozas27.0
887Graham, Miss. Margaret Edith19.0
888Johnston, Miss. Catherine Helen “Carrie”NaN
889Behr, Mr. Karl Howell26.0
890Dooley, Mr. Patrick32.0
891 rows × 2 columns

Selection Statements with loc and iloc

Selecting Rows

loc[] 

loc[] lets us select rows by name. Here’s an example:

df = pd.DataFrame({
   'A': ['alpha', 'apple', 'arsenic', 'angel', 'android'],
   'B': [1, 2, 3, 4, 5],
   'C': ['coconut', 'curse', 'cassava', 'cuckoo', 'clarinet'],
   'D': [6, 7, 8, 9, 10]
   },
   index=['row_0', 'row_1', 'row_2', 'row_3', 'row_4'])
df
             A  B         C   D
row_0 alpha 1 coconut 6
row_1 apple 2 curse 7
row_2 arsenic 3 cassava 8
row_3 angel 4 cuckoo 9
row_4 android 5 clarinet 10

The row index of the dataframe contains the names of the rows. Use loc[] to select rows by name:

print(df.loc['row_1'])
A    apple
B 2
C curse
D 7
Name: row_1, dtype: object

Inserting just the row index name in selector brackets returns a Series object. Inserting the row index name as a list returns a DataFrame object:

print(df.loc[['row_1']])
           A  B      C  D
row_1 apple 2 curse 7

To select multiple rows by name, we use a list within selector brackets:

print(df.loc[['row_2', 'row_4']])
             A  B         C   D
row_2 arsenic 3 cassava 8
row_4 android 5 clarinet 10

We can even specify a range of rows by named index:

print(df.loc['row_0':'row_3'])
             A  B        C  D
row_0 alpha 1 coconut 6
row_1 apple 2 curse 7
row_2 arsenic 3 cassava 8
row_3 angel 4 cuckoo 9

Because we’re using named indices, the returned range includes the specified end index.

iloc[] 

iloc[] lets us select rows by numeric position, similar to how you would access elements of a list or an array. Here’s an example.

print(df)
print()
print(df.iloc[1])
           A  B         C   D
row_0 alpha 1 coconut 6
row_1 apple 2 curse 7
row_2 arsenic 3 cassava 8
row_3 angel 4 cuckoo 9
row_4 android 5 clarinet 10

A apple
B 2
C curse
D 7
Name: row_1, dtype: object

Inserting just the row index number in selector brackets returns a Series object. Inserting the row index number as a list returns a DataFrame object:

print(df.iloc[[1]])
           A  B      C  D
row_1 apple 2 curse 7

To select multiple rows by index number, we use a list within selector brackets:

print(df.iloc[[0, 2, 4]])
             A  B         C   D
row_0 alpha 1 coconut 6
row_2 arsenic 3 cassava 8
row_4 android 5 clarinet 10

Specify a range of rows by index number:

print(df.iloc[0:3])
            A  B        C  D
row_0 alpha 1 coconut 6
row_1 apple 2 curse 7
row_2 arsenic 3 cassava 8

Note that this does not include the row at index three. 

Selecting Columns

To select a column, we can simply put the column’s name in brackets as we did above with the Titanic dataframe. To the same with loc or iloc, we must specify rows as well.

print(df.loc[:, ['B', 'D']])
print(df.iloc[:, [1,3]])
       B   D
row_0 1 6
row_1 2 7
row_2 3 8
row_3 4 9
row_4 5 10

Both gives the same result.

Selecting Rows and Columns

Both loc[] and iloc[] can be used to select specific rows and columns together.

print(df.loc['row_0':'row_2', ['A','C']])
             A        C
row_0 alpha coconut
row_1 apple curse
row_2 arsenic cassava
print(df.iloc[[2, 4], 0:3])
             A  B         C
row_2 arsenic 3 cassava
row_4 android 5 clarinet

However, when using rows with named indices, we cannot mix numeric and named notation. 

print(df.loc[0:3, ['D']])
Error on line 1

To view rows [0:3] at column ‘D’ (if we don’t know the index number of column D), we’d have to use selector brackets after an iloc[] statement:

# This is most convenient for VIEWING:
print(df.iloc[0:3][['D']])
print()

# But this is best practice/more stable for assignment/manipulation:
print(df.loc[df.index[0:3], 'D']) 
       D
row_0 6
row_1 7
row_2 8

row_0 6
row_1 7
row_2 8
Name: D, dtype: int64

However, in many (perhaps most) cases our rows will not have named indices, but rather numeric indices. In this case, we can mix numeric and named notation. For example, here’s the same dataset, but with numeric indices instead of named indices.

df = pd.DataFrame({
   'A': ['alpha', 'apple', 'arsenic', 'angel', 'android'],
   'B': [1, 2, 3, 4, 5],
   'C': ['coconut', 'curse', 'cassava', 'cuckoo', 'clarinet'],
   'D': [6, 7, 8, 9, 10]
   },
   )
df
         A  B         C   D
0 alpha 1 coconut 6
1 apple 2 curse 7
2 arsenic 3 cassava 8
3 angel 4 cuckoo 9
4 android 5 clarinet 10

Notice that the rows are enumerated. Now, this code will execute without error:

print(df.loc[0:3, ['D']])
   D
0 6
1 7
2 8
3 9

In