Pandas DataFrames I
Contents
Pandas DataFrames I¶
One of the pillars spatial data science is a Python library called Pandas
which provides functions to import tabular data from various file formats (e.g. .csv
, .xlsx
) and many other data manipulation operations such as merging, reshaping, selecting, cleaning, and wrangling. In this week, we will introduce some basic Pandas
operations. In Week 7, we will progress to more advanced usage of Pandas
.
Note
The name “Pandas” is derived from the “panel data”, an econometrics term for data sets that include observations over multiple time periods.
![terminology](../_images/terminology.webp)
Data structure¶
The main Pandas
data structure is called a DataFrame which looks similar to an Excel spreadsheets. A typical DataFrame is a 2-D array with rows, columns, indices, and column names. We can import the Pandas
and define a DataFrame like so:
import pandas as pd
df = pd.DataFrame({"city": ['Belfast', 'Cardiff', 'Edinburgh', 'London'],
"country": ['Northern Ireland', 'Wales', 'Scotland', 'England']})
df.head()
city | country | |
---|---|---|
0 | Belfast | Northern Ireland |
1 | Cardiff | Wales |
2 | Edinburgh | Scotland |
3 | London | England |
So what do we have here? We now have a small table containing countries in the United Kingdom and their capital cities. The table has two labeled columns, one called city
and one called country
. The table has four rows, with each row corresponding to an index between 0 and 3.
Reading data¶
In reality, we rarely make a table from scratch. Instead we read files that already contain data and store the contents of those files into variables. We can do this using read_csv
which is a general function for reading data files separated by commas, spaces, or other common separators. The function takes a path string and returns a DataFrame
object.
df = pd.read_csv('data/euro_cities.csv')
# "df" stands for DataFrame
type(df)
pandas.core.frame.DataFrame
![eurostar](../_images/eurostar.png)
A concise way of checking what is contained in our df
object is the head
method which prints the first five rows of the data.
df.head()
city | country | population | |
---|---|---|---|
0 | Moscow | Russia | 10381222 |
1 | London | United Kingdom | 7556900 |
2 | Saint Petersburg | Russia | 5028000 |
3 | Berlin | Germany | 3426354 |
4 | Madrid | Spain | 3255944 |
Alternatively, we could check the last five rows of the data:
df.tail()
city | country | population | |
---|---|---|---|
95 | Tyumen | Russia | 519119 |
96 | Lisbon | Portugal | 517802 |
97 | Lipetsk | Russia | 515655 |
98 | Hannover | Germany | 515140 |
99 | Penza | Russia | 512602 |
Or the last 10 rows:
df.tail(10)
city | country | population | |
---|---|---|---|
90 | Orenburg | Russia | 550204 |
91 | Bremen | Germany | 546501 |
92 | Vilnius | Lithuania | 542366 |
93 | Novokuznetsk | Russia | 539616 |
94 | Ryazan' | Russia | 520173 |
95 | Tyumen | Russia | 519119 |
96 | Lisbon | Portugal | 517802 |
97 | Lipetsk | Russia | 515655 |
98 | Hannover | Germany | 515140 |
99 | Penza | Russia | 512602 |
DataFrame properties¶
To check the data types of each column, we can use Pandas
method called dtypes
:
df.dtypes
city object
country object
population int64
dtype: object
Here we see that population is an integer (with 64-bit precision) and the other two columns are objects (Pandas
stores string data types as “objects” by default).
We can also retrieve the column names of the DataFrame using the column
method.
df.columns
Index(['city', 'country', 'population'], dtype='object')
And, just like NumPy
arrays, we can find the shape of our DataFrame using the shape
method.
df.shape
(100, 3)
Indexing¶
We can select specific columns based on the column values. The basic syntax is dataframe[value]
, where value can be a single column name, or a list of column names.
df['city']
0 Moscow
1 London
2 Saint Petersburg
3 Berlin
4 Madrid
...
95 Tyumen
96 Lisbon
97 Lipetsk
98 Hannover
99 Penza
Name: city, Length: 100, dtype: object
df[['city', 'country']]
city | country | |
---|---|---|
0 | Moscow | Russia |
1 | London | United Kingdom |
2 | Saint Petersburg | Russia |
3 | Berlin | Germany |
4 | Madrid | Spain |
... | ... | ... |
95 | Tyumen | Russia |
96 | Lisbon | Portugal |
97 | Lipetsk | Russia |
98 | Hannover | Germany |
99 | Penza | Russia |
100 rows × 2 columns
We can select specific rows using the iloc
method.
df.iloc[6]
city Rome
country Italy
population 2318895
Name: 6, dtype: object
The syntax for selecting the top 10 cities should be quite familiar.
df.iloc[0:10]
city | country | population | |
---|---|---|---|
0 | Moscow | Russia | 10381222 |
1 | London | United Kingdom | 7556900 |
2 | Saint Petersburg | Russia | 5028000 |
3 | Berlin | Germany | 3426354 |
4 | Madrid | Spain | 3255944 |
5 | Kyiv | Ukraine | 2797553 |
6 | Rome | Italy | 2318895 |
7 | Paris | France | 2138551 |
8 | Bucharest | Romania | 1877155 |
9 | Minsk | Belarus | 1742124 |
If we wanted to find the city with the highest/lowest population we could use idxmax
/idxmin
which returns the index of the highest/lowest value in the column.
df['population'].idxmax()
0
index_high = df['population'].idxmax()
city_high = df.iloc[index_high]['city']
f"The city with the largest population in Europe is {city_high}"
'The city with the largest population in Europe is Moscow'
Descriptive statistics¶
Pandas
provides basic functions to calculate descriptive statistics - similar to what we might usually do with Microsoft Excel.
df['population'].mean()
1174041.59
df['population'].max()
10381222
df['population'].min()
512602
We can get all descriptive statistics at once using the describe
function
df['population'].describe()
count 1.000000e+02
mean 1.174042e+06
std 1.326413e+06
min 5.126020e+05
25% 5.969205e+05
50% 7.817830e+05
75% 1.152821e+06
max 1.038122e+07
Name: population, dtype: float64
Where the 25%, 50%, and 75%
are the inter-quartile ranges. Often the output of describe
is more readable if we round these data to two decimal places.
round(df['population'].describe(), 2)
count 100.00
mean 1174041.59
std 1326413.44
min 512602.00
25% 596920.50
50% 781783.00
75% 1152820.75
max 10381222.00
Name: population, dtype: float64
Slicing¶
Let’s say we wanted to find how many cities in Germany made it into this top 100 list. We would first produce a Boolean mask (i.e. valus equal to True
or False
) using the ==
comparison operator.
mask = df['country'] == 'Germany'
mask
0 False
1 False
2 False
3 True
4 False
...
95 False
96 False
97 False
98 True
99 False
Name: country, Length: 100, dtype: bool
We could then apply this mask to our original DataFrame.
df[mask]
city | country | population | |
---|---|---|---|
3 | Berlin | Germany | 3426354 |
11 | Hamburg | Germany | 1739117 |
21 | Munich | Germany | 1260391 |
40 | Koeln | Germany | 963395 |
64 | Frankfurt am Main | Germany | 650000 |
75 | Essen | Germany | 593085 |
77 | Stuttgart | Germany | 589793 |
78 | Dortmund | Germany | 588462 |
85 | Duesseldorf | Germany | 573057 |
91 | Bremen | Germany | 546501 |
98 | Hannover | Germany | 515140 |
We could do something similar if we wanted to find cities with populations of over 2 million.
df[df['population'] > 2000000]
city | country | population | |
---|---|---|---|
0 | Moscow | Russia | 10381222 |
1 | London | United Kingdom | 7556900 |
2 | Saint Petersburg | Russia | 5028000 |
3 | Berlin | Germany | 3426354 |
4 | Madrid | Spain | 3255944 |
5 | Kyiv | Ukraine | 2797553 |
6 | Rome | Italy | 2318895 |
7 | Paris | France | 2138551 |
We can add logical operators to find cities with populations of between 700,000 and 800,000
df[(df['population'] < 800000) & (df['population'] > 700000)]
city | country | population | |
---|---|---|---|
48 | Zaporizhia | Ukraine | 796217 |
49 | Marseille | France | 794811 |
50 | Lodz | Poland | 768755 |
51 | Krakow | Poland | 755050 |
52 | Riga | Latvia | 742572 |
53 | Amsterdam | Netherlands | 741636 |
54 | Nottingham | United Kingdom | 729977 |
55 | Lviv | Ukraine | 717803 |
56 | Sevilla | Spain | 703206 |
57 | Tol'yatti | Russia | 702879 |
Grouping¶
If we wanted to find the total population living in these cities for each country, we could write a for
loop. To do this, we would have to produce a list of individual countries, loop over each country, compute the total, and append to a new list… and then put it back into a new DataFrame.
This is what that would look like:
countries = pd.unique(df['country'])
total_pop = []
for country in countries:
country_df = df[df['country'] == country]
total_pop.append(country_df['population'].sum())
new_df = pd.DataFrame({'country': countries,
'population': total_pop})
new_df.head(10)
country | population | |
---|---|---|
0 | Russia | 39954312 |
1 | United Kingdom | 12029600 |
2 | Germany | 11445295 |
3 | Spain | 7637517 |
4 | Ukraine | 9453918 |
5 | Italy | 6614141 |
6 | France | 2933362 |
7 | Romania | 1877155 |
8 | Belarus | 1742124 |
9 | Hungary | 1741041 |
But a better way of doing this would be to use the groupby
function.
df.groupby('country')['population'].sum()
country
Austria 1691468
Belarus 1742124
Belgium 1019022
Bosnia and Herzegovina 696731
Bulgaria 1152556
Croatia 698966
Czech Republic 1165581
Denmark 1153615
Finland 558457
France 2933362
Germany 11445295
Greece 664046
Hungary 1741041
Ireland 1024027
Italy 6614141
Latvia 742572
Lithuania 542366
Moldova 635994
Netherlands 1339835
Norway 580000
Poland 4431189
Portugal 517802
Romania 1877155
Russia 39954312
Serbia 1273651
Spain 7637517
Sweden 2087816
Ukraine 9453918
United Kingdom 12029600
Name: population, dtype: int64
The first argument groupby
accepts is the column we want group our data into (country
in our case). The next argument is the column (or list of columns) to summarize (population
). Finally, this function does nothing until we specify how we want to group our data.
It’s actually nice to reset the index after using groupby
so that we end up with a DataFrame (rather than a Series).
new_df = df.groupby('country')['population'].sum().reset_index()
new_df.head(10)
pandas.core.frame.DataFrame
We hope you will agree that the groupby
method is much cleaner and interpretable. It is also a lot faster which may be important when we have very large DataFrames with millions of rows.
Note
To summarize multiple columns we could include a list of columns like so: df.groupby('country')[['city','population']].sum()
but this would not produce anything different in our case because a string variable cannot be summed.
Sorting¶
Similar to lists, we can sort DataFrames using the sort_values
function. This function takes two arguments, by
and ascending
which determine which column and which order we would like to sort by. So to find the 10 largest cities in Europe by population we could type:
df.sort_values(by='population', ascending=False).head(10)
city | country | population | |
---|---|---|---|
0 | Moscow | Russia | 10381222 |
1 | London | United Kingdom | 7556900 |
2 | Saint Petersburg | Russia | 5028000 |
3 | Berlin | Germany | 3426354 |
4 | Madrid | Spain | 3255944 |
5 | Kyiv | Ukraine | 2797553 |
6 | Rome | Italy | 2318895 |
7 | Paris | France | 2138551 |
8 | Bucharest | Romania | 1877155 |
9 | Minsk | Belarus | 1742124 |
Another way of doing this would be to use the nlargest
or nsmallest
function. Documentation here.
Saving¶
We can save our DataFrames just as easy as reading the original file using the to_csv
function
new_df.to_csv('data/new.csv')