Pandas DataFrames I

One of the pillars spatial data science is a Python library called Pandas which provides functions to import tabular data from various file formats (e.g. .csv, .xlsx) and many other data manipulation operations such as merging, reshaping, selecting, cleaning, and wrangling. In this week, we will introduce some basic Pandas operations. In Week 7, we will progress to more advanced usage of Pandas.

Note

The name “Pandas” is derived from the “panel data”, an econometrics term for data sets that include observations over multiple time periods.

terminology

Data structure

The main Pandas data structure is called a DataFrame which looks similar to an Excel spreadsheets. A typical DataFrame is a 2-D array with rows, columns, indices, and column names. We can import the Pandas and define a DataFrame like so:

import pandas as pd

df = pd.DataFrame({"city": ['Belfast', 'Cardiff', 'Edinburgh', 'London'],
                   "country": ['Northern Ireland', 'Wales', 'Scotland', 'England']})

df.head()
city country
0 Belfast Northern Ireland
1 Cardiff Wales
2 Edinburgh Scotland
3 London England

So what do we have here? We now have a small table containing countries in the United Kingdom and their capital cities. The table has two labeled columns, one called city and one called country. The table has four rows, with each row corresponding to an index between 0 and 3.

Reading data

In reality, we rarely make a table from scratch. Instead we read files that already contain data and store the contents of those files into variables. We can do this using read_csv which is a general function for reading data files separated by commas, spaces, or other common separators. The function takes a path string and returns a DataFrame object.

df = pd.read_csv('data/euro_cities.csv')

# "df" stands for DataFrame
type(df)
pandas.core.frame.DataFrame
eurostar

A concise way of checking what is contained in our df object is the head method which prints the first five rows of the data.

df.head()
city country population
0 Moscow Russia 10381222
1 London United Kingdom 7556900
2 Saint Petersburg Russia 5028000
3 Berlin Germany 3426354
4 Madrid Spain 3255944

Alternatively, we could check the last five rows of the data:

df.tail()
city country population
95 Tyumen Russia 519119
96 Lisbon Portugal 517802
97 Lipetsk Russia 515655
98 Hannover Germany 515140
99 Penza Russia 512602

Or the last 10 rows:

df.tail(10)
city country population
90 Orenburg Russia 550204
91 Bremen Germany 546501
92 Vilnius Lithuania 542366
93 Novokuznetsk Russia 539616
94 Ryazan' Russia 520173
95 Tyumen Russia 519119
96 Lisbon Portugal 517802
97 Lipetsk Russia 515655
98 Hannover Germany 515140
99 Penza Russia 512602

DataFrame properties

To check the data types of each column, we can use Pandas method called dtypes:

df.dtypes
city          object
country       object
population     int64
dtype: object

Here we see that population is an integer (with 64-bit precision) and the other two columns are objects (Pandas stores string data types as “objects” by default).

We can also retrieve the column names of the DataFrame using the column method.

df.columns
Index(['city', 'country', 'population'], dtype='object')

And, just like NumPy arrays, we can find the shape of our DataFrame using the shape method.

df.shape
(100, 3)

Indexing

We can select specific columns based on the column values. The basic syntax is dataframe[value], where value can be a single column name, or a list of column names.

df['city']
0               Moscow
1               London
2     Saint Petersburg
3               Berlin
4               Madrid
            ...       
95              Tyumen
96              Lisbon
97             Lipetsk
98            Hannover
99               Penza
Name: city, Length: 100, dtype: object
df[['city', 'country']]
city country
0 Moscow Russia
1 London United Kingdom
2 Saint Petersburg Russia
3 Berlin Germany
4 Madrid Spain
... ... ...
95 Tyumen Russia
96 Lisbon Portugal
97 Lipetsk Russia
98 Hannover Germany
99 Penza Russia

100 rows × 2 columns

We can select specific rows using the iloc method.

df.iloc[6]
city             Rome
country         Italy
population    2318895
Name: 6, dtype: object

The syntax for selecting the top 10 cities should be quite familiar.

df.iloc[0:10]
city country population
0 Moscow Russia 10381222
1 London United Kingdom 7556900
2 Saint Petersburg Russia 5028000
3 Berlin Germany 3426354
4 Madrid Spain 3255944
5 Kyiv Ukraine 2797553
6 Rome Italy 2318895
7 Paris France 2138551
8 Bucharest Romania 1877155
9 Minsk Belarus 1742124

If we wanted to find the city with the highest/lowest population we could use idxmax/idxmin which returns the index of the highest/lowest value in the column.

df['population'].idxmax()
0
index_high = df['population'].idxmax()

city_high = df.iloc[index_high]['city']

f"The city with the largest population in Europe is {city_high}"
'The city with the largest population in Europe is Moscow'

Descriptive statistics

Pandas provides basic functions to calculate descriptive statistics - similar to what we might usually do with Microsoft Excel.

df['population'].mean()
1174041.59
df['population'].max()
10381222
df['population'].min()
512602

We can get all descriptive statistics at once using the describe function

df['population'].describe()
count    1.000000e+02
mean     1.174042e+06
std      1.326413e+06
min      5.126020e+05
25%      5.969205e+05
50%      7.817830e+05
75%      1.152821e+06
max      1.038122e+07
Name: population, dtype: float64

Where the 25%, 50%, and 75% are the inter-quartile ranges. Often the output of describe is more readable if we round these data to two decimal places.

round(df['population'].describe(), 2)
count         100.00
mean      1174041.59
std       1326413.44
min        512602.00
25%        596920.50
50%        781783.00
75%       1152820.75
max      10381222.00
Name: population, dtype: float64

Slicing

Let’s say we wanted to find how many cities in Germany made it into this top 100 list. We would first produce a Boolean mask (i.e. valus equal to True or False) using the == comparison operator.

mask = df['country'] == 'Germany'
mask
0     False
1     False
2     False
3      True
4     False
      ...  
95    False
96    False
97    False
98     True
99    False
Name: country, Length: 100, dtype: bool

We could then apply this mask to our original DataFrame.

df[mask]
city country population
3 Berlin Germany 3426354
11 Hamburg Germany 1739117
21 Munich Germany 1260391
40 Koeln Germany 963395
64 Frankfurt am Main Germany 650000
75 Essen Germany 593085
77 Stuttgart Germany 589793
78 Dortmund Germany 588462
85 Duesseldorf Germany 573057
91 Bremen Germany 546501
98 Hannover Germany 515140

We could do something similar if we wanted to find cities with populations of over 2 million.

df[df['population'] > 2000000]
city country population
0 Moscow Russia 10381222
1 London United Kingdom 7556900
2 Saint Petersburg Russia 5028000
3 Berlin Germany 3426354
4 Madrid Spain 3255944
5 Kyiv Ukraine 2797553
6 Rome Italy 2318895
7 Paris France 2138551

We can add logical operators to find cities with populations of between 700,000 and 800,000

df[(df['population'] < 800000) & (df['population'] > 700000)]
city country population
48 Zaporizhia Ukraine 796217
49 Marseille France 794811
50 Lodz Poland 768755
51 Krakow Poland 755050
52 Riga Latvia 742572
53 Amsterdam Netherlands 741636
54 Nottingham United Kingdom 729977
55 Lviv Ukraine 717803
56 Sevilla Spain 703206
57 Tol'yatti Russia 702879

Grouping

If we wanted to find the total population living in these cities for each country, we could write a for loop. To do this, we would have to produce a list of individual countries, loop over each country, compute the total, and append to a new list… and then put it back into a new DataFrame.

This is what that would look like:

countries = pd.unique(df['country'])

total_pop = []
for country in countries:
    
    country_df = df[df['country'] == country]
    
    total_pop.append(country_df['population'].sum())
new_df = pd.DataFrame({'country': countries,
                       'population': total_pop})
new_df.head(10)
country population
0 Russia 39954312
1 United Kingdom 12029600
2 Germany 11445295
3 Spain 7637517
4 Ukraine 9453918
5 Italy 6614141
6 France 2933362
7 Romania 1877155
8 Belarus 1742124
9 Hungary 1741041

But a better way of doing this would be to use the groupby function.

df.groupby('country')['population'].sum()
country
Austria                    1691468
Belarus                    1742124
Belgium                    1019022
Bosnia and Herzegovina      696731
Bulgaria                   1152556
Croatia                     698966
Czech Republic             1165581
Denmark                    1153615
Finland                     558457
France                     2933362
Germany                   11445295
Greece                      664046
Hungary                    1741041
Ireland                    1024027
Italy                      6614141
Latvia                      742572
Lithuania                   542366
Moldova                     635994
Netherlands                1339835
Norway                      580000
Poland                     4431189
Portugal                    517802
Romania                    1877155
Russia                    39954312
Serbia                     1273651
Spain                      7637517
Sweden                     2087816
Ukraine                    9453918
United Kingdom            12029600
Name: population, dtype: int64

The first argument groupby accepts is the column we want group our data into (country in our case). The next argument is the column (or list of columns) to summarize (population). Finally, this function does nothing until we specify how we want to group our data.

It’s actually nice to reset the index after using groupby so that we end up with a DataFrame (rather than a Series).

new_df = df.groupby('country')['population'].sum().reset_index()
new_df.head(10)
pandas.core.frame.DataFrame

We hope you will agree that the groupby method is much cleaner and interpretable. It is also a lot faster which may be important when we have very large DataFrames with millions of rows.

Note

To summarize multiple columns we could include a list of columns like so: df.groupby('country')[['city','population']].sum() but this would not produce anything different in our case because a string variable cannot be summed.

Sorting

Similar to lists, we can sort DataFrames using the sort_values function. This function takes two arguments, by and ascending which determine which column and which order we would like to sort by. So to find the 10 largest cities in Europe by population we could type:

df.sort_values(by='population', ascending=False).head(10)
city country population
0 Moscow Russia 10381222
1 London United Kingdom 7556900
2 Saint Petersburg Russia 5028000
3 Berlin Germany 3426354
4 Madrid Spain 3255944
5 Kyiv Ukraine 2797553
6 Rome Italy 2318895
7 Paris France 2138551
8 Bucharest Romania 1877155
9 Minsk Belarus 1742124

Another way of doing this would be to use the nlargest or nsmallest function. Documentation here.

Saving

We can save our DataFrames just as easy as reading the original file using the to_csv function

new_df.to_csv('data/new.csv')