Pandas DataFrames I¶

One of the pillars spatial data science is a Python library called Pandas which provides functions to import tabular data from various file formats (e.g. .csv, .xlsx) and many other data manipulation operations such as merging, reshaping, selecting, cleaning, and wrangling. In this week, we will introduce some basic Pandas operations. In Week 7, we will progress to more advanced usage of Pandas.

Note

The name “Pandas” is derived from the “panel data”, an econometrics term for data sets that include observations over multiple time periods.

Data structure¶

The main Pandas data structure is called a DataFrame which looks similar to an Excel spreadsheets. A typical DataFrame is a 2-D array with rows, columns, indices, and column names. We can import the Pandas and define a DataFrame like so:

import pandas as pd

df = pd.DataFrame({"city": ['Belfast', 'Cardiff', 'Edinburgh', 'London'],
                   "country": ['Northern Ireland', 'Wales', 'Scotland', 'England']})

df.head()

	city	country
0	Belfast	Northern Ireland
1	Cardiff	Wales
2	Edinburgh	Scotland
3	London	England

So what do we have here? We now have a small table containing countries in the United Kingdom and their capital cities. The table has two labeled columns, one called city and one called country. The table has four rows, with each row corresponding to an index between 0 and 3.

Reading data¶

In reality, we rarely make a table from scratch. Instead we read files that already contain data and store the contents of those files into variables. We can do this using read_csv which is a general function for reading data files separated by commas, spaces, or other common separators. The function takes a path string and returns a DataFrame object.

df = pd.read_csv('data/euro_cities.csv')

# "df" stands for DataFrame
type(df)

pandas.core.frame.DataFrame

A concise way of checking what is contained in our df object is the head method which prints the first five rows of the data.

df.head()

	city	country	population
0	Moscow	Russia	10381222
1	London	United Kingdom	7556900
2	Saint Petersburg	Russia	5028000
3	Berlin	Germany	3426354
4	Madrid	Spain	3255944

Alternatively, we could check the last five rows of the data:

df.tail()

	city	country	population
95	Tyumen	Russia	519119
96	Lisbon	Portugal	517802
97	Lipetsk	Russia	515655
98	Hannover	Germany	515140
99	Penza	Russia	512602

Or the last 10 rows:

df.tail(10)

	city	country	population
90	Orenburg	Russia	550204
91	Bremen	Germany	546501
92	Vilnius	Lithuania	542366
93	Novokuznetsk	Russia	539616
94	Ryazan'	Russia	520173
95	Tyumen	Russia	519119
96	Lisbon	Portugal	517802
97	Lipetsk	Russia	515655
98	Hannover	Germany	515140
99	Penza	Russia	512602

DataFrame properties¶

To check the data types of each column, we can use Pandas method called dtypes:

df.dtypes

city          object
country       object
population     int64
dtype: object

Here we see that population is an integer (with 64-bit precision) and the other two columns are objects (Pandas stores string data types as “objects” by default).

We can also retrieve the column names of the DataFrame using the column method.

df.columns

Index(['city', 'country', 'population'], dtype='object')

And, just like NumPy arrays, we can find the shape of our DataFrame using the shape method.

df.shape

(100, 3)

Indexing¶

We can select specific columns based on the column values. The basic syntax is dataframe[value], where value can be a single column name, or a list of column names.

df['city']

             Moscow
             London
   Saint Petersburg
             Berlin
             Madrid
            ...       
            Tyumen
            Lisbon
           Lipetsk
          Hannover
             Penza
Name: city, Length: 100, dtype: object

df[['city', 'country']]

	city	country
0	Moscow	Russia
1	London	United Kingdom
2	Saint Petersburg	Russia
3	Berlin	Germany
4	Madrid	Spain
...	...	...
95	Tyumen	Russia
96	Lisbon	Portugal
97	Lipetsk	Russia
98	Hannover	Germany
99	Penza	Russia

100 rows × 2 columns

We can select specific rows using the iloc method.

df.iloc[6]

city             Rome
country         Italy
population    2318895
Name: 6, dtype: object

The syntax for selecting the top 10 cities should be quite familiar.

df.iloc[0:10]

	city	country	population
0	Moscow	Russia	10381222
1	London	United Kingdom	7556900
2	Saint Petersburg	Russia	5028000
3	Berlin	Germany	3426354
4	Madrid	Spain	3255944
5	Kyiv	Ukraine	2797553
6	Rome	Italy	2318895
7	Paris	France	2138551
8	Bucharest	Romania	1877155
9	Minsk	Belarus	1742124

If we wanted to find the city with the highest/lowest population we could use idxmax/idxmin which returns the index of the highest/lowest value in the column.

df['population'].idxmax()

index_high = df['population'].idxmax()

city_high = df.iloc[index_high]['city']

f"The city with the largest population in Europe is {city_high}"

'The city with the largest population in Europe is Moscow'

Descriptive statistics¶

Pandas provides basic functions to calculate descriptive statistics - similar to what we might usually do with Microsoft Excel.

df['population'].mean()

1174041.59

df['population'].max()

10381222

df['population'].min()

We can get all descriptive statistics at once using the describe function

df['population'].describe()

count    1.000000e+02
mean     1.174042e+06
std      1.326413e+06
min      5.126020e+05
25%      5.969205e+05
50%      7.817830e+05
75%      1.152821e+06
max      1.038122e+07
Name: population, dtype: float64

Where the 25%, 50%, and 75% are the inter-quartile ranges. Often the output of describe is more readable if we round these data to two decimal places.

round(df['population'].describe(), 2)

count         100.00
mean      1174041.59
std       1326413.44
min        512602.00
25%        596920.50
50%        781783.00
75%       1152820.75
max      10381222.00
Name: population, dtype: float64

Slicing¶

Let’s say we wanted to find how many cities in Germany made it into this top 100 list. We would first produce a Boolean mask (i.e. valus equal to True or False) using the == comparison operator.

mask = df['country'] == 'Germany'
mask

   False
   False
   False
    True
   False
      ...  
  False
  False
  False
   True
  False
Name: country, Length: 100, dtype: bool

We could then apply this mask to our original DataFrame.

df[mask]

	city	country	population
3	Berlin	Germany	3426354
11	Hamburg	Germany	1739117
21	Munich	Germany	1260391
40	Koeln	Germany	963395
64	Frankfurt am Main	Germany	650000
75	Essen	Germany	593085
77	Stuttgart	Germany	589793
78	Dortmund	Germany	588462
85	Duesseldorf	Germany	573057
91	Bremen	Germany	546501
98	Hannover	Germany	515140

We could do something similar if we wanted to find cities with populations of over 2 million.

df[df['population'] > 2000000]

	city	country	population
0	Moscow	Russia	10381222
1	London	United Kingdom	7556900
2	Saint Petersburg	Russia	5028000
3	Berlin	Germany	3426354
4	Madrid	Spain	3255944
5	Kyiv	Ukraine	2797553
6	Rome	Italy	2318895
7	Paris	France	2138551

We can add logical operators to find cities with populations of between 700,000 and 800,000

df[(df['population'] < 800000) & (df['population'] > 700000)]

	city	country	population
48	Zaporizhia	Ukraine	796217
49	Marseille	France	794811
50	Lodz	Poland	768755
51	Krakow	Poland	755050
52	Riga	Latvia	742572
53	Amsterdam	Netherlands	741636
54	Nottingham	United Kingdom	729977
55	Lviv	Ukraine	717803
56	Sevilla	Spain	703206
57	Tol'yatti	Russia	702879

Grouping¶

If we wanted to find the total population living in these cities for each country, we could write a for loop. To do this, we would have to produce a list of individual countries, loop over each country, compute the total, and append to a new list… and then put it back into a new DataFrame.

This is what that would look like:

countries = pd.unique(df['country'])

total_pop = []
for country in countries:
    
    country_df = df[df['country'] == country]
    
    total_pop.append(country_df['population'].sum())

new_df = pd.DataFrame({'country': countries,
                       'population': total_pop})
new_df.head(10)

	country	population
0	Russia	39954312
1	United Kingdom	12029600
2	Germany	11445295
3	Spain	7637517
4	Ukraine	9453918
5	Italy	6614141
6	France	2933362
7	Romania	1877155
8	Belarus	1742124
9	Hungary	1741041

But a better way of doing this would be to use the groupby function.

df.groupby('country')['population'].sum()

country
Austria                    1691468
Belarus                    1742124
Belgium                    1019022
Bosnia and Herzegovina      696731
Bulgaria                   1152556
Croatia                     698966
Czech Republic             1165581
Denmark                    1153615
Finland                     558457
France                     2933362
Germany                   11445295
Greece                      664046
Hungary                    1741041
Ireland                    1024027
Italy                      6614141
Latvia                      742572
Lithuania                   542366
Moldova                     635994
Netherlands                1339835
Norway                      580000
Poland                     4431189
Portugal                    517802
Romania                    1877155
Russia                    39954312
Serbia                     1273651
Spain                      7637517
Sweden                     2087816
Ukraine                    9453918
United Kingdom            12029600
Name: population, dtype: int64

The first argument groupby accepts is the column we want group our data into (country in our case). The next argument is the column (or list of columns) to summarize (population). Finally, this function does nothing until we specify how we want to group our data.

It’s actually nice to reset the index after using groupby so that we end up with a DataFrame (rather than a Series).

new_df = df.groupby('country')['population'].sum().reset_index()
new_df.head(10)

pandas.core.frame.DataFrame

We hope you will agree that the groupby method is much cleaner and interpretable. It is also a lot faster which may be important when we have very large DataFrames with millions of rows.

Note

To summarize multiple columns we could include a list of columns like so: df.groupby('country')[['city','population']].sum() but this would not produce anything different in our case because a string variable cannot be summed.

Sorting¶

Similar to lists, we can sort DataFrames using the sort_values function. This function takes two arguments, by and ascending which determine which column and which order we would like to sort by. So to find the 10 largest cities in Europe by population we could type:

df.sort_values(by='population', ascending=False).head(10)

	city	country	population
0	Moscow	Russia	10381222
1	London	United Kingdom	7556900
2	Saint Petersburg	Russia	5028000
3	Berlin	Germany	3426354
4	Madrid	Spain	3255944
5	Kyiv	Ukraine	2797553
6	Rome	Italy	2318895
7	Paris	France	2138551
8	Bucharest	Romania	1877155
9	Minsk	Belarus	1742124

Another way of doing this would be to use the nlargest or nsmallest function. Documentation here.

Saving¶

We can save our DataFrames just as easy as reading the original file using the to_csv function

new_df.to_csv('data/new.csv')

Programming for spatial data science

Pandas DataFrames I

Contents