Pandas DataFrames II
Contents
Pandas DataFrames II¶
Since Pandas
provides such an extensive set of tools for working with table data, it is worth doing another demo on it. In this demo, we will import a new dataset called air_quality_no2_long.csv
which provides air quality data (Nitrogen Dioxide) from the measurement stations FR04014, BETR801 and London Westminster in Paris, Antwerp, and London, respectively.
import pandas as pd
df1 = pd.read_csv('data/air_quality_no2_long.csv')
df1.head()
city | country | date.utc | location | parameter | value | unit | |
---|---|---|---|---|---|---|---|
0 | Paris | FR | 2019-06-21 00:00:00+00:00 | FR04014 | no2 | 20.0 | µg/m³ |
1 | Paris | FR | 2019-06-20 23:00:00+00:00 | FR04014 | no2 | 21.8 | µg/m³ |
2 | Paris | FR | 2019-06-20 22:00:00+00:00 | FR04014 | no2 | 26.5 | µg/m³ |
3 | Paris | FR | 2019-06-20 21:00:00+00:00 | FR04014 | no2 | 24.9 | µg/m³ |
4 | Paris | FR | 2019-06-20 20:00:00+00:00 | FR04014 | no2 | 21.4 | µg/m³ |
Let’s say we had another air quality dataset, except this time for particulate matter less than 2.5 micrometers (PM2.5).
df2 = pd.read_csv('data/air_quality_pm25_long.csv')
df2.head()
city | country | date.utc | location | parameter | value | unit | |
---|---|---|---|---|---|---|---|
0 | Antwerpen | BE | 2019-06-18 06:00:00+00:00 | BETR801 | pm25 | 18.0 | µg/m³ |
1 | Antwerpen | BE | 2019-06-17 08:00:00+00:00 | BETR801 | pm25 | 6.5 | µg/m³ |
2 | Antwerpen | BE | 2019-06-17 07:00:00+00:00 | BETR801 | pm25 | 18.5 | µg/m³ |
3 | Antwerpen | BE | 2019-06-17 06:00:00+00:00 | BETR801 | pm25 | 16.0 | µg/m³ |
4 | Antwerpen | BE | 2019-06-17 05:00:00+00:00 | BETR801 | pm25 | 7.5 | µg/m³ |
Combining data from multiple tables¶
If we wanted a single DataFrame that included both these tables, we could use the concat
function. The concat
function performs the operation along one of the axes (row-wise or column-wise). By default concatenation is along axis=0
, so the resulting table combines the rows of the input tables.
air_quality = pd.concat([df1, df2], axis=0)
Note
It’s useful to know the shape of the DataFrames before (and after) the concatenation so we know that it worked as expected.
df1.shape, df2.shape, air_quality.shape
((2068, 7), (1110, 7), (3178, 7))
We can also check that the DataFrames were concatenated properly by sorting by the date.utc
columns. If we do so, we will find both no2
and pm25
values contained in the parameter column.
air_quality.sort_values("date.utc").head(10)
city | country | date.utc | location | parameter | value | unit | |
---|---|---|---|---|---|---|---|
1109 | London | GB | 2019-05-07 01:00:00+00:00 | London Westminster | pm25 | 8.0 | µg/m³ |
2067 | London | GB | 2019-05-07 01:00:00+00:00 | London Westminster | no2 | 23.0 | µg/m³ |
1098 | Antwerpen | BE | 2019-05-07 01:00:00+00:00 | BETR801 | no2 | 50.5 | µg/m³ |
1003 | Paris | FR | 2019-05-07 01:00:00+00:00 | FR04014 | no2 | 25.0 | µg/m³ |
100 | Antwerpen | BE | 2019-05-07 01:00:00+00:00 | BETR801 | pm25 | 12.5 | µg/m³ |
2066 | London | GB | 2019-05-07 02:00:00+00:00 | London Westminster | no2 | 19.0 | µg/m³ |
1097 | Antwerpen | BE | 2019-05-07 02:00:00+00:00 | BETR801 | no2 | 45.0 | µg/m³ |
1002 | Paris | FR | 2019-05-07 02:00:00+00:00 | FR04014 | no2 | 27.7 | µg/m³ |
99 | Antwerpen | BE | 2019-05-07 02:00:00+00:00 | BETR801 | pm25 | 14.0 | µg/m³ |
1108 | London | GB | 2019-05-07 02:00:00+00:00 | London Westminster | pm25 | 8.0 | µg/m³ |
Joining multiple tables by a common identifier¶
The coordinates of the air quality measurement stations are stored in a separate file. It would be useful if we could add the coordinate information to the original air quality DataFrame. To do this, we can use the merge
function.
stations_coord = pd.read_csv('data/air_quality_stations.csv')
stations_coord.head()
location | coordinates.latitude | coordinates.longitude | |
---|---|---|---|
0 | BELAL01 | 51.23619 | 4.38522 |
1 | BELHB23 | 51.17030 | 4.34100 |
2 | BELLD01 | 51.10998 | 5.00486 |
3 | BELLD02 | 51.12038 | 5.02155 |
4 | BELR833 | 51.32766 | 4.36226 |
Both tables have the column location
in common which is used as a key
to combine the information. By choosing the left
merge, only the locations available in the air_quality
(left) table, i.e. FR04014
, BETR801
and London Westminster
, are returned in the new table.
air_quality = pd.merge(air_quality, stations_coord, how="left", on="location")
air_quality.head()
city | country | date.utc | location | parameter | value | unit | coordinates.latitude | coordinates.longitude | |
---|---|---|---|---|---|---|---|---|---|
0 | Paris | FR | 2019-06-21 00:00:00+00:00 | FR04014 | no2 | 20.0 | µg/m³ | 48.83724 | 2.3939 |
1 | Paris | FR | 2019-06-20 23:00:00+00:00 | FR04014 | no2 | 21.8 | µg/m³ | 48.83724 | 2.3939 |
2 | Paris | FR | 2019-06-20 22:00:00+00:00 | FR04014 | no2 | 26.5 | µg/m³ | 48.83724 | 2.3939 |
3 | Paris | FR | 2019-06-20 21:00:00+00:00 | FR04014 | no2 | 24.9 | µg/m³ | 48.83724 | 2.3939 |
4 | Paris | FR | 2019-06-20 20:00:00+00:00 | FR04014 | no2 | 21.4 | µg/m³ | 48.83724 | 2.3939 |
We have one more table to merge with our original DataFrame. This table is called air_quality_paramters
.
air_quality_parameters = pd.read_csv('data/air_quality_parameters.csv')
air_quality_parameters.head(6)
id | description | name | |
---|---|---|---|
0 | bc | Black Carbon | BC |
1 | co | Carbon Monoxide | CO |
2 | no2 | Nitrogen Dioxide | NO2 |
3 | o3 | Ozone | O3 |
4 | pm10 | Particulate matter less than 10 micrometers in... | PM10 |
5 | pm25 | Particulate matter less than 2.5 micrometers i... | PM2.5 |
Note that, compared to the previous example, there is no common column name. However, the parameter
column in the air_quality
table and the id
column in the air_quality_parameters
table both provide the measured variable in a common format. We could rename one of the columns. Alternatively, we could use the left_on
and right_on
arguments to make the link between the two tables.
air_quality = pd.merge(air_quality, air_quality_parameters,
how='left', left_on='parameter', right_on='id')
air_quality.head()
city | country | date.utc | location | parameter | value | unit | coordinates.latitude | coordinates.longitude | id | description | name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Paris | FR | 2019-06-21 00:00:00+00:00 | FR04014 | no2 | 20.0 | µg/m³ | 48.83724 | 2.3939 | no2 | Nitrogen Dioxide | NO2 |
1 | Paris | FR | 2019-06-20 23:00:00+00:00 | FR04014 | no2 | 21.8 | µg/m³ | 48.83724 | 2.3939 | no2 | Nitrogen Dioxide | NO2 |
2 | Paris | FR | 2019-06-20 22:00:00+00:00 | FR04014 | no2 | 26.5 | µg/m³ | 48.83724 | 2.3939 | no2 | Nitrogen Dioxide | NO2 |
3 | Paris | FR | 2019-06-20 21:00:00+00:00 | FR04014 | no2 | 24.9 | µg/m³ | 48.83724 | 2.3939 | no2 | Nitrogen Dioxide | NO2 |
4 | Paris | FR | 2019-06-20 20:00:00+00:00 | FR04014 | no2 | 21.4 | µg/m³ | 48.83724 | 2.3939 | no2 | Nitrogen Dioxide | NO2 |
Dropping columns¶
There are various ways to drop one or multiple columns from a Dataframe. The easiest is the use the drop
function.
# First create a new variable so that we add it later
name = air_quality['name']
air_quality = air_quality.drop(['name'], axis=1)
air_quality.head()
city | country | date.utc | location | parameter | value | unit | coordinates.latitude | coordinates.longitude | id | description | |
---|---|---|---|---|---|---|---|---|---|---|---|
0 | Paris | FR | 2019-06-21 00:00:00+00:00 | FR04014 | no2 | 20.0 | µg/m³ | 48.83724 | 2.3939 | no2 | Nitrogen Dioxide |
1 | Paris | FR | 2019-06-20 23:00:00+00:00 | FR04014 | no2 | 21.8 | µg/m³ | 48.83724 | 2.3939 | no2 | Nitrogen Dioxide |
2 | Paris | FR | 2019-06-20 22:00:00+00:00 | FR04014 | no2 | 26.5 | µg/m³ | 48.83724 | 2.3939 | no2 | Nitrogen Dioxide |
3 | Paris | FR | 2019-06-20 21:00:00+00:00 | FR04014 | no2 | 24.9 | µg/m³ | 48.83724 | 2.3939 | no2 | Nitrogen Dioxide |
4 | Paris | FR | 2019-06-20 20:00:00+00:00 | FR04014 | no2 | 21.4 | µg/m³ | 48.83724 | 2.3939 | no2 | Nitrogen Dioxide |
Adding columns¶
Likewise, we can add columns. To do this we can define a new columns (i.e. name
) and set it equal to a list of values.
air_quality['name'] = name
air_quality.head()
city | country | date.utc | location | parameter | value | unit | coordinates.latitude | coordinates.longitude | id | description | name | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Paris | FR | 2019-06-21 00:00:00+00:00 | FR04014 | no2 | 20.0 | µg/m³ | 48.83724 | 2.3939 | no2 | Nitrogen Dioxide | NO2 |
1 | Paris | FR | 2019-06-20 23:00:00+00:00 | FR04014 | no2 | 21.8 | µg/m³ | 48.83724 | 2.3939 | no2 | Nitrogen Dioxide | NO2 |
2 | Paris | FR | 2019-06-20 22:00:00+00:00 | FR04014 | no2 | 26.5 | µg/m³ | 48.83724 | 2.3939 | no2 | Nitrogen Dioxide | NO2 |
3 | Paris | FR | 2019-06-20 21:00:00+00:00 | FR04014 | no2 | 24.9 | µg/m³ | 48.83724 | 2.3939 | no2 | Nitrogen Dioxide | NO2 |
4 | Paris | FR | 2019-06-20 20:00:00+00:00 | FR04014 | no2 | 21.4 | µg/m³ | 48.83724 | 2.3939 | no2 | Nitrogen Dioxide | NO2 |
Note
This will only work if the new data has the same number of rows as the original DataFrame.
Acknowledgments¶
This demo was inspired by the Pandas API reference tutorial.