NumPy arrays

While Python contains a few, general built-in tools, specialized tools are provided by libraries that have to be imported.

One of the most commonly used libraries is called NumPy, which stands for Numerical Python. In general, we should be using this library when we are working with numbers.

We can access the functions provided by NumPy using the import statement and module name (e.g. import numpy). However, it is standard procedure to shorten the imported name to np for better code readability. This is a widely adopted convention that we should follow so that anyone working with our code can easily understand it.

import numpy as np

Special types of lists called ndarrays. We can define an array using the function np.array with some parentheses and square brackets. Let’s define an array containing the populations of the twenty-two largest countries (to the nearest million).

population = np.array([70000000,   67000000,  213000000,   90000000,  274000000, 
                       102000000,  221000000, 1380000000,   84000000,  329000000,
                       84000000,  110000000,  206000000,  165000000,  144000000, 
                       97000000,   67000000,  126000000,  115000000,   83000000, 
                       1411000000,  129000000])

Shape and size

Once our data is in ndarray format, we can use NumPy functions to investigate our data. We can find the total number of elements in the array by running:

np.size(population)
22

And we can find the shape (number of rows, columns) of the array by running:

np.shape(population)
(22,)

Statistics

We can also use NumPy functions to find statistics of our data. For example:

# Mean
np.mean(population)
253045454.54545453
# Median
np.median(population)
120500000.0
# Max
np.max(population)
1411000000
# Min
np.min(population)
67000000
# Standard deviation
np.std(population)
367743423.0637056
# Sum
np.sum(population)
5567000000

Mathematical operations

NumPy has functions for performing mathematical operations. First let’s define some arrays containing the populations of the largest twenty-two countries in 1960 and 2020.

# Populations of 22 largest countries in 1960
pop_1960 = np.array([667000000, 451000000, 181000000,  88000000,  45000000,  72000000,
                     45000000,  48000000, 120000000,  38000000,  93000000,  22000000, 
                     26000000,  27000000,  33000000,  15000000,  27000000,  22000000, 
                     73000000,  27000000,  47000000,  52000000])

# Populations of 22 largest countries in 2020
pop_2020 = np.array([1411000000, 1380000000, 329000000, 274000000, 221000000, 
                     213000000,  206000000,  165000000,  144000000, 129000000, 
                     126000000,  115000000,  110000000,  102000000, 97000000, 
                     90000000,   84000000,   84000000,   83000000,   70000000, 
                     67000000,   67000000])

# Population of world in 1960
world_1960 = 3032000000

# Population of world in 2020
world_2020 = 7762000000

Let’s say we wanted to find the population of each country as a percentage of the total population in 1960.

np.divide(pop_1960, world_1960) * 100
array([21.99868074, 14.87467018,  5.96965699,  2.90237467,  1.48416887,
        2.37467018,  1.48416887,  1.58311346,  3.95778364,  1.25329815,
        3.06728232,  0.72559367,  0.85751979,  0.89050132,  1.0883905 ,
        0.49472296,  0.89050132,  0.72559367,  2.40765172,  0.89050132,
        1.55013193,  1.71503958])

Note

An operation between an array and a single number or between arrays of two different sizes is termed broadcasting.

Now let’s calculate the population change between 1960 and 2020.

np.subtract(pop_2020, pop_1960)
array([744000000, 929000000, 148000000, 186000000, 176000000, 141000000,
       161000000, 117000000,  24000000,  91000000,  33000000,  93000000,
        84000000,  75000000,  64000000,  75000000,  57000000,  62000000,
        10000000,  43000000,  20000000,  15000000])

Note

Since the arrays are the same size, NumPy performs an element-by-element operation (i.e. the first number of array1 is subtracted from the first number of array1, the second number of array1 is subtracted from the second number of array2… etc.).

What about the percentage change in population for each country?

np.divide(np.subtract(pop_2020, pop_1960), pop_1960) * 100
array([111.54422789, 205.98669623,  81.7679558 , 211.36363636,
       391.11111111, 195.83333333, 357.77777778, 243.75      ,
        20.        , 239.47368421,  35.48387097, 422.72727273,
       323.07692308, 277.77777778, 193.93939394, 500.        ,
       211.11111111, 281.81818182,  13.69863014, 159.25925926,
        42.55319149,  28.84615385])

NumPy has many other matematical operations. See here for a full list.

Searching

We can use NumPy to find specific elements of our array. We can find the index of the smallest value in our array using the np.argmin function.

# Define a new variable that contains population growth as a percentage of population
pop_growth = np.divide(np.subtract(pop_2020, pop_1960), pop_1960) * 100

# Find index for smallest growth
np.argmin(pop_growth)
18
# Find index for largest growth
np.argmax(pop_growth)
15

This is useful since the populations in our array correspond to countries in the list below:

country = ['China', 'India', 'United States', 'Indonesia', 'Pakistan', 'Brazil', 'Nigeria', 'Bangladesh',
          'Russian Federation', 'Mexico', 'Japan', 'Ethopia', 'Philippines', 'Egypt', 'Vietnam', 'DR Congo', 
           'Turkey', 'Iran','Germany', 'Thailand', 'France', 'United Kingdom']

So we can use list indexing to find the country with the lowest growth rate between 1960 and 2020:

country[np.argmin(pop_growth)]
'Germany'

Or highest growth rate:

country[np.argmax(pop_growth)]
'DR Congo'

Multi-dimensional arrays

We will often use NumPy in Spatial Data Science to perform operations on multi-dimensional arrays or matrices (i.e. arrays with 2, 3 or more dimensions). Let’s say we have a small grid of points that represent elevation values (i.e. a digital elevation model or DEM). We could represent it as a 2D ndarray like so:

dem = np.array([[10, 9, 4, 4], [11, 8, 5, 4], [12, 7, 6, 4]])
print(dem)
[[10  9  4  4]
 [11  8  5  4]
 [12  7  6  4]]

This time our array has three rows and four columns (i.e. 3 x 4) and 12 elements.

np.shape(dem)
(3, 4)
np.size(dem)
12

So what happens when we perform mathematical operations (e.g. np.mean) on the array? The default option is to compute the mean of all elements in the array.

np.mean(dem)
7.0

But we can also specify which axis we would like to perform the operation on. If wanted to calculate the mean along each column, we would add axis=0 as an argument. See the docs for np.mean.

np.mean(dem, axis=0)
array([11.,  8.,  5.,  4.])

Alternatively, we could calculate the mean along each row by adding axis=1 as an argument.

np.mean(dem, axis=1)
array([6.75, 7.  , 7.25])

Or perform a different operation…

np.sum(dem, axis=1)
array([27, 28, 29])

The figure below shows which axis is which.

cells

Note

In a 3D array, the third (or depth) axis would be axis=2.

Matrix operations

We can perform element-by-element operations or broadcasting on the array.

np.add(dem, 1)
array([[11, 10,  5,  5],
       [12,  9,  6,  5],
       [13,  8,  7,  5]])
np.multiply(dem, 2)
array([[20, 18,  8,  8],
       [22, 16, 10,  8],
       [24, 14, 12,  8]])

Matrix indexing and slicing

We can index and slice NumPy arrays in the same way we slice Python lists - using square brackets. Remember that indexing in Python starts at 0.

dem
array([[10,  9,  4,  4],
       [11,  8,  5,  4],
       [12,  7,  6,  4]])

Since the array is 2D, we need to use two numbers to access elements in the array. The order is array[row, col] So if we wanted the value in the second row and first column we would type:

dem[1,0]
11

We may also want to access the entire row or column. We can do this by using a colon sign which means “all”. To return the second column, we would type:

dem[:,1]
array([9, 8, 7])

Or the third row:

dem[2,:]
array([12,  7,  6])

Recall that leaving either side of the colon blank means start from (or go to) the end of the list. So if we wanted the values in the third row, but only in the second and third columns we could type:

dem[2,1:]
array([7, 6])

Comparison operators

If we wanted to return specific values, we can use comparison operators such as those in the table below:

Comparison operator

Explanation

x == y

True if x is equal to y

x != y

True if x is not equal to y

x > y

True if x is greater than y

x < y

True if x is less than y

x >= y

True if x is greater than or equal to y

x <= y

True if x is less than or equal to y

mask = (dem > 8)
dem[mask]
array([10,  9, 11, 12])

Note

Here we produced boolean mask which has the same shape as the original array but each value is either True or False (depending on whether the value was greater or less than 8). We then returned values of the original array where the mask was True.

Likewise, we can also just get values equal to 10:

mask = (dem == 10)
dem[mask]
array([10])

Or not 10:

mask = (dem != 10)
dem[mask]
array([ 9,  4, 11,  8,  5, 12,  7,  6])

Matrix searching

Sometimes we want to find the location (or index) of specific values. To do this, we can combine our comparison operators with a useful function called np.where.

# Find indices of values greater than 9
np.where(dem > 9)
(array([0, 1, 2]), array([0, 0, 0]))

This could be used find the number of values greater than 9 (similar to what we did with masks before).

dem[np.where(dem > 9)].size
3

We could also use this approach to derive values from other arrays so long as they have the same shape.

array = np.array([[[178, 189, 567], [145, 239, 445], [197, 345, 678]], [[56, 78, 190], [46, 10, 11], [6, 2, 1]], [[45, 118, 203], [72, 119, 34], [87, 9, 5]]])
np.mean(array, axis=2)
array([[311.33333333, 276.33333333, 406.66666667],
       [108.        ,  22.33333333,   3.        ],
       [122.        ,  75.        ,  33.66666667]])

Acknowledgments

This demo was inspired by NumPy docs.