NumPy arrays
Contents
NumPy arrays¶
While Python contains a few, general built-in tools, specialized tools are provided by libraries that have to be imported.
One of the most commonly used libraries is called NumPy
, which stands for Numerical Python. In general, we should be using this library when we are working with numbers.
We can access the functions provided by NumPy
using the import statement and module name (e.g. import numpy
). However, it is standard procedure to shorten the imported name to np
for better code readability. This is a widely adopted convention that we should follow so that anyone working with our code can easily understand it.
import numpy as np
Special types of lists called ndarrays
. We can define an array using the function np.array
with some parentheses and square brackets. Let’s define an array containing the populations of the twenty-two largest countries (to the nearest million).
population = np.array([70000000, 67000000, 213000000, 90000000, 274000000,
102000000, 221000000, 1380000000, 84000000, 329000000,
84000000, 110000000, 206000000, 165000000, 144000000,
97000000, 67000000, 126000000, 115000000, 83000000,
1411000000, 129000000])
Shape and size¶
Once our data is in ndarray
format, we can use NumPy
functions to investigate our data. We can find the total number of elements in the array by running:
np.size(population)
22
And we can find the shape (number of rows, columns) of the array by running:
np.shape(population)
(22,)
Statistics¶
We can also use NumPy
functions to find statistics of our data. For example:
# Mean
np.mean(population)
253045454.54545453
# Median
np.median(population)
120500000.0
# Max
np.max(population)
1411000000
# Min
np.min(population)
67000000
# Standard deviation
np.std(population)
367743423.0637056
# Sum
np.sum(population)
5567000000
Mathematical operations¶
NumPy
has functions for performing mathematical operations. First let’s define some arrays containing the populations of the largest twenty-two countries in 1960 and 2020.
# Populations of 22 largest countries in 1960
pop_1960 = np.array([667000000, 451000000, 181000000, 88000000, 45000000, 72000000,
45000000, 48000000, 120000000, 38000000, 93000000, 22000000,
26000000, 27000000, 33000000, 15000000, 27000000, 22000000,
73000000, 27000000, 47000000, 52000000])
# Populations of 22 largest countries in 2020
pop_2020 = np.array([1411000000, 1380000000, 329000000, 274000000, 221000000,
213000000, 206000000, 165000000, 144000000, 129000000,
126000000, 115000000, 110000000, 102000000, 97000000,
90000000, 84000000, 84000000, 83000000, 70000000,
67000000, 67000000])
# Population of world in 1960
world_1960 = 3032000000
# Population of world in 2020
world_2020 = 7762000000
Let’s say we wanted to find the population of each country as a percentage of the total population in 1960.
np.divide(pop_1960, world_1960) * 100
array([21.99868074, 14.87467018, 5.96965699, 2.90237467, 1.48416887,
2.37467018, 1.48416887, 1.58311346, 3.95778364, 1.25329815,
3.06728232, 0.72559367, 0.85751979, 0.89050132, 1.0883905 ,
0.49472296, 0.89050132, 0.72559367, 2.40765172, 0.89050132,
1.55013193, 1.71503958])
Note
An operation between an array and a single number or between arrays of two different sizes is termed broadcasting.
Now let’s calculate the population change between 1960 and 2020.
np.subtract(pop_2020, pop_1960)
array([744000000, 929000000, 148000000, 186000000, 176000000, 141000000,
161000000, 117000000, 24000000, 91000000, 33000000, 93000000,
84000000, 75000000, 64000000, 75000000, 57000000, 62000000,
10000000, 43000000, 20000000, 15000000])
Note
Since the arrays are the same size, NumPy performs an element-by-element operation (i.e. the first number of array1 is subtracted from the first number of array1, the second number of array1 is subtracted from the second number of array2… etc.).
What about the percentage change in population for each country?
np.divide(np.subtract(pop_2020, pop_1960), pop_1960) * 100
array([111.54422789, 205.98669623, 81.7679558 , 211.36363636,
391.11111111, 195.83333333, 357.77777778, 243.75 ,
20. , 239.47368421, 35.48387097, 422.72727273,
323.07692308, 277.77777778, 193.93939394, 500. ,
211.11111111, 281.81818182, 13.69863014, 159.25925926,
42.55319149, 28.84615385])
NumPy
has many other matematical operations. See here for a full list.
Searching¶
We can use NumPy
to find specific elements of our array. We can find the index of the smallest value in our array using the np.argmin
function.
# Define a new variable that contains population growth as a percentage of population
pop_growth = np.divide(np.subtract(pop_2020, pop_1960), pop_1960) * 100
# Find index for smallest growth
np.argmin(pop_growth)
18
# Find index for largest growth
np.argmax(pop_growth)
15
This is useful since the populations in our array correspond to countries in the list below:
country = ['China', 'India', 'United States', 'Indonesia', 'Pakistan', 'Brazil', 'Nigeria', 'Bangladesh',
'Russian Federation', 'Mexico', 'Japan', 'Ethopia', 'Philippines', 'Egypt', 'Vietnam', 'DR Congo',
'Turkey', 'Iran','Germany', 'Thailand', 'France', 'United Kingdom']
So we can use list indexing to find the country with the lowest growth rate between 1960 and 2020:
country[np.argmin(pop_growth)]
'Germany'
Or highest growth rate:
country[np.argmax(pop_growth)]
'DR Congo'
Multi-dimensional arrays¶
We will often use NumPy
in Spatial Data Science to perform operations on multi-dimensional arrays or matrices (i.e. arrays with 2, 3 or more dimensions). Let’s say we have a small grid of points that represent elevation values (i.e. a digital elevation model or DEM). We could represent it as a 2D ndarray
like so:
dem = np.array([[10, 9, 4, 4], [11, 8, 5, 4], [12, 7, 6, 4]])
print(dem)
[[10 9 4 4]
[11 8 5 4]
[12 7 6 4]]
This time our array has three rows and four columns (i.e. 3 x 4) and 12 elements.
np.shape(dem)
(3, 4)
np.size(dem)
12
So what happens when we perform mathematical operations (e.g. np.mean
) on the array? The default option is to compute the mean of all elements in the array.
np.mean(dem)
7.0
But we can also specify which axis we would like to perform the operation on. If wanted to calculate the mean along each column, we would add axis=0
as an argument. See the docs for np.mean
.
np.mean(dem, axis=0)
array([11., 8., 5., 4.])
Alternatively, we could calculate the mean along each row by adding axis=1
as an argument.
np.mean(dem, axis=1)
array([6.75, 7. , 7.25])
Or perform a different operation…
np.sum(dem, axis=1)
array([27, 28, 29])
The figure below shows which axis is which.
![cells](../_images/np_axes.png)
Note
In a 3D array, the third (or depth) axis would be axis=2
.
Matrix operations¶
We can perform element-by-element operations or broadcasting on the array.
np.add(dem, 1)
array([[11, 10, 5, 5],
[12, 9, 6, 5],
[13, 8, 7, 5]])
np.multiply(dem, 2)
array([[20, 18, 8, 8],
[22, 16, 10, 8],
[24, 14, 12, 8]])
Matrix indexing and slicing¶
We can index and slice NumPy arrays in the same way we slice Python lists - using square brackets. Remember that indexing in Python starts at 0.
dem
array([[10, 9, 4, 4],
[11, 8, 5, 4],
[12, 7, 6, 4]])
Since the array is 2D, we need to use two numbers to access elements in the array. The order is array[row, col]
So if we wanted the value in the second row and first column we would type:
dem[1,0]
11
We may also want to access the entire row or column. We can do this by using a colon sign which means “all”. To return the second column, we would type:
dem[:,1]
array([9, 8, 7])
Or the third row:
dem[2,:]
array([12, 7, 6])
Recall that leaving either side of the colon blank means start from (or go to) the end of the list. So if we wanted the values in the third row, but only in the second and third columns we could type:
dem[2,1:]
array([7, 6])
Comparison operators¶
If we wanted to return specific values, we can use comparison operators such as those in the table below:
Comparison operator |
Explanation |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
mask = (dem > 8)
dem[mask]
array([10, 9, 11, 12])
Note
Here we produced boolean mask which has the same shape as the original array but each value is either True
or False
(depending on whether the value was greater or less than 8). We then returned values of the original array where the mask was True
.
Likewise, we can also just get values equal to 10:
mask = (dem == 10)
dem[mask]
array([10])
Or not 10:
mask = (dem != 10)
dem[mask]
array([ 9, 4, 11, 8, 5, 12, 7, 6])
Matrix searching¶
Sometimes we want to find the location (or index) of specific values. To do this, we can combine our comparison operators with a useful function called np.where
.
# Find indices of values greater than 9
np.where(dem > 9)
(array([0, 1, 2]), array([0, 0, 0]))
This could be used find the number of values greater than 9 (similar to what we did with masks before).
dem[np.where(dem > 9)].size
3
We could also use this approach to derive values from other arrays so long as they have the same shape.
array = np.array([[[178, 189, 567], [145, 239, 445], [197, 345, 678]], [[56, 78, 190], [46, 10, 11], [6, 2, 1]], [[45, 118, 203], [72, 119, 34], [87, 9, 5]]])
np.mean(array, axis=2)
array([[311.33333333, 276.33333333, 406.66666667],
[108. , 22.33333333, 3. ],
[122. , 75. , 33.66666667]])
Acknowledgments¶
This demo was inspired by NumPy docs.