Python Libraries: NumPy

NumPy is a Python library and its power comes from vectorization, which enables operations to be performed on multiple components of a data object at the same time.

Imagine that we want to create a new list that’s the element-wise product of two lists with the same length.

# Lists cannot be multiplied together.
list_a = [1, 2, 3]
list_b = [2, 4, 6]

list_a * list_b

TypeError: can't multiply sequence by non-int of type 'list'

This gave us an error. But we can solve this with a for loop.

# To perform element-wise multiplication between two lists, 
# we could use a for loop.
list_c = []
for i in range(len(list_a)):
    list_c.append(list_a[i] * list_b[i])

list_c

[2, 8, 18]

However, for a simpler way, we can use NumPy to perform this operation as a vectorized computation. Simply by converting each list to a NumPy array and multiplying the two arrays together using the product operator.

# NumPy arrays let us perform array operations.
import numpy as np

# Convert lists to arrays.
array_a = np.array(list_a)
array_b = np.array(list_b)

# Perform element-wise multiplication between the arrays.
array_a * array_b

array([ 2,  8, 18])

The results are the same, but the vectorized approach is simpler, easier to read, and faster to execute because while loops iterate over one element at a time, vector operations compute simultaneously in a single statement.

Vectors also take up less memory space, which is another factor that becomes important when working with a lot of data.

Arrays and Vectors with NumPy

The array is the core data structure of NumPy. The data object itself is known as an “n-dimensional array,” or ndarray for short. The ndarray is a vector. Arrays, as an operation applied to a vector, execute much faster than the same operation applied to a list.

Arrays can be indexed.

import numpy as np
# The np.array() function converts an object to an ndarray
x = np.array([1, 2, 3, 4])
x

array([1, 2, 3, 4])

# Arrays can be indexed.
x[-1] = 5
x

array([1, 2, 3, 5])

But to change the size of an array, we have to reassign it. (This is about how NumPy stores data. I’ll talk about it in the next post.)

# Trying to access an index that doesn't exist will throw an error.
x[4] = 10

IndexError: index 4 is out of bounds for axis 0 with size 4

All of the elements of an array should be the same type. If not, NumPy will create an array that forces everything to the same data type, if possible.

# Arrays cast every element they contain as the same data type.
arr = np.array([1, 2, 'coconut'])
arr

array(['1', '2', 'coconut'], dtype='<U21')

NumPy’s Class and its contents’ Data Types

# NumPy arrays are a class called `ndarray`.
arr = np.array([1, 2, 3])
print(type(arr))

<class 'numpy.ndarray'>

# The dtype attribute returns the data type of an array's contents.
arr.dtype

dtype('int64')

Array attributes

NumPy arrays have several attributes that enable us to access information about the array. Some of the most commonly used attributes are:

ndarray.shape : returns a tuple of the array’s dimensions.
ndarray.dtype : returns the data type of the array’s contents.
ndarray.size : returns the total number of elements in the array.
ndarray.T : returns the array transposed (rows become columns, columns become rows).

array_2d = np.array([(1, 2, 3), (4, 5, 6)])
print(array_2d)
print()

print(array_2d.shape)
print(array_2d.dtype)
print(array_2d.size)
print(array_2d.T)

[[1 2 3]
 [4 5 6]]

(2, 3)
int64
6
[[1 4]
 [2 5]
 [3 6]]

The dimensions of an array

As the name implies, ndarrays can be multidimensional. For a one-dimensional array, NumPy takes an array-like object of length X, like a list, and creates an ndarray in the shape of X.

One-dimensional array

A one-dimensional array is neither a row nor a column. It is similar to a list.

As seen above, we can use the shape attribute to confirm the shape of an array.

# The shape attribute returns the number of elements in each dimension of an array.
arr = np.array([1, 2, 3])
arr.shape

(3,)

We can also use the ndim attribute to confirm the number of dimensions the array has.

# The ndim attribute returns the number of dimensions in an array.
arr.ndim

We will often need to confirm the shape and number of dimensions of their array. If we’re, for example, trying to attach it to another existing array. These methods are also commonly used to help understand what’s going wrong when our code throws an error.

Two-dimensional array

A two-dimensional array can be created from a list of lists, where each internal list is the same length. We can think of these internal lists as individual rows, so the final array is like a table.

# Create a 2D array by passing a list of lists to np.array() function.
arr_2d = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
print(arr_2d.shape)
print(arr_2d.ndim)
arr_2d

(4, 2)
2
array([[1, 2],
       [3, 4],
       [5, 6],
       [7, 8]])

Three-dimensional array

If a two-dimensional array is a list of lists, then a three-dimensional array is a list that contains two of these, so a list of two lists of lists.

# Create a 3D array by passing a list of two lists of lists to np.array() function.
arr_3d = np.array([[[1, 2, 3],
                   [3, 4, 5]],

                  [[5, 6, 7],
                   [7, 8, 9]]])

print(arr_3d.shape)
print(arr_3d.ndim)
arr_3d

(2, 2, 3)
3
array([[[1, 2, 3],
        [3, 4, 5]],

       [[5, 6, 7],
        [7, 8, 9]]])

This array can be thought of as two tables, each with two rows and three columns. Thus, it has three dimensions. This can go on indefinitely.

Reshaping arrays

Reshaping data is a common task in data analysis.

Our two-dimensional array was four rows and two columns. Imagine that we want this data to be two rows by four columns. We just plug these values into the reshape method and reassign the result back to the array 2D variable.

# The reshape() method changes the shape of an array.
arr_2d = arr_2d.reshape(2, 4)
arr_2d

array([[1, 2, 3, 4],
       [5, 6, 7, 8]])