Pandas DataFrames II¶

Since Pandas provides such an extensive set of tools for working with table data, it is worth doing another demo on it. In this demo, we will import a new dataset called air_quality_no2_long.csv which provides air quality data (Nitrogen Dioxide) from the measurement stations FR04014, BETR801 and London Westminster in Paris, Antwerp, and London, respectively.

import pandas as pd
df1 = pd.read_csv('data/air_quality_no2_long.csv')
df1.head()

	city	country	date.utc	location	parameter	value	unit
0	Paris	FR	2019-06-21 00:00:00+00:00	FR04014	no2	20.0	µg/m³
1	Paris	FR	2019-06-20 23:00:00+00:00	FR04014	no2	21.8	µg/m³
2	Paris	FR	2019-06-20 22:00:00+00:00	FR04014	no2	26.5	µg/m³
3	Paris	FR	2019-06-20 21:00:00+00:00	FR04014	no2	24.9	µg/m³
4	Paris	FR	2019-06-20 20:00:00+00:00	FR04014	no2	21.4	µg/m³

Let’s say we had another air quality dataset, except this time for particulate matter less than 2.5 micrometers (PM2.5).

df2 = pd.read_csv('data/air_quality_pm25_long.csv')
df2.head()

	city	country	date.utc	location	parameter	value	unit
0	Antwerpen	BE	2019-06-18 06:00:00+00:00	BETR801	pm25	18.0	µg/m³
1	Antwerpen	BE	2019-06-17 08:00:00+00:00	BETR801	pm25	6.5	µg/m³
2	Antwerpen	BE	2019-06-17 07:00:00+00:00	BETR801	pm25	18.5	µg/m³
3	Antwerpen	BE	2019-06-17 06:00:00+00:00	BETR801	pm25	16.0	µg/m³
4	Antwerpen	BE	2019-06-17 05:00:00+00:00	BETR801	pm25	7.5	µg/m³

Combining data from multiple tables¶

If we wanted a single DataFrame that included both these tables, we could use the concat function. The concat function performs the operation along one of the axes (row-wise or column-wise). By default concatenation is along axis=0, so the resulting table combines the rows of the input tables.

air_quality = pd.concat([df1, df2], axis=0)

Note

It’s useful to know the shape of the DataFrames before (and after) the concatenation so we know that it worked as expected.

df1.shape, df2.shape, air_quality.shape

((2068, 7), (1110, 7), (3178, 7))

We can also check that the DataFrames were concatenated properly by sorting by the date.utc columns. If we do so, we will find both no2 and pm25 values contained in the parameter column.

air_quality.sort_values("date.utc").head(10)

	city	country	date.utc	location	parameter	value	unit
1109	London	GB	2019-05-07 01:00:00+00:00	London Westminster	pm25	8.0	µg/m³
2067	London	GB	2019-05-07 01:00:00+00:00	London Westminster	no2	23.0	µg/m³
1098	Antwerpen	BE	2019-05-07 01:00:00+00:00	BETR801	no2	50.5	µg/m³
1003	Paris	FR	2019-05-07 01:00:00+00:00	FR04014	no2	25.0	µg/m³
100	Antwerpen	BE	2019-05-07 01:00:00+00:00	BETR801	pm25	12.5	µg/m³
2066	London	GB	2019-05-07 02:00:00+00:00	London Westminster	no2	19.0	µg/m³
1097	Antwerpen	BE	2019-05-07 02:00:00+00:00	BETR801	no2	45.0	µg/m³
1002	Paris	FR	2019-05-07 02:00:00+00:00	FR04014	no2	27.7	µg/m³
99	Antwerpen	BE	2019-05-07 02:00:00+00:00	BETR801	pm25	14.0	µg/m³
1108	London	GB	2019-05-07 02:00:00+00:00	London Westminster	pm25	8.0	µg/m³

Joining multiple tables by a common identifier¶

The coordinates of the air quality measurement stations are stored in a separate file. It would be useful if we could add the coordinate information to the original air quality DataFrame. To do this, we can use the merge function.

stations_coord = pd.read_csv('data/air_quality_stations.csv')
stations_coord.head()

	location	coordinates.latitude	coordinates.longitude
0	BELAL01	51.23619	4.38522
1	BELHB23	51.17030	4.34100
2	BELLD01	51.10998	5.00486
3	BELLD02	51.12038	5.02155
4	BELR833	51.32766	4.36226

Both tables have the column location in common which is used as a key to combine the information. By choosing the left merge, only the locations available in the air_quality (left) table, i.e. FR04014, BETR801 and London Westminster, are returned in the new table.

air_quality = pd.merge(air_quality, stations_coord, how="left", on="location")
air_quality.head()

	city	country	date.utc	location	parameter	value	unit	coordinates.latitude	coordinates.longitude
0	Paris	FR	2019-06-21 00:00:00+00:00	FR04014	no2	20.0	µg/m³	48.83724	2.3939
1	Paris	FR	2019-06-20 23:00:00+00:00	FR04014	no2	21.8	µg/m³	48.83724	2.3939
2	Paris	FR	2019-06-20 22:00:00+00:00	FR04014	no2	26.5	µg/m³	48.83724	2.3939
3	Paris	FR	2019-06-20 21:00:00+00:00	FR04014	no2	24.9	µg/m³	48.83724	2.3939
4	Paris	FR	2019-06-20 20:00:00+00:00	FR04014	no2	21.4	µg/m³	48.83724	2.3939

We have one more table to merge with our original DataFrame. This table is called air_quality_paramters.

air_quality_parameters = pd.read_csv('data/air_quality_parameters.csv')
air_quality_parameters.head(6)

	id	description	name
0	bc	Black Carbon	BC
1	co	Carbon Monoxide	CO
2	no2	Nitrogen Dioxide	NO2
3	o3	Ozone	O3
4	pm10	Particulate matter less than 10 micrometers in...	PM10
5	pm25	Particulate matter less than 2.5 micrometers i...	PM2.5

Note that, compared to the previous example, there is no common column name. However, the parameter column in the air_quality table and the id column in the air_quality_parameters table both provide the measured variable in a common format. We could rename one of the columns. Alternatively, we could use the left_on and right_on arguments to make the link between the two tables.

air_quality = pd.merge(air_quality, air_quality_parameters,
                       how='left', left_on='parameter', right_on='id')
air_quality.head()

	city	country	date.utc	location	parameter	value	unit	coordinates.latitude	coordinates.longitude	id	description	name
0	Paris	FR	2019-06-21 00:00:00+00:00	FR04014	no2	20.0	µg/m³	48.83724	2.3939	no2	Nitrogen Dioxide	NO2
1	Paris	FR	2019-06-20 23:00:00+00:00	FR04014	no2	21.8	µg/m³	48.83724	2.3939	no2	Nitrogen Dioxide	NO2
2	Paris	FR	2019-06-20 22:00:00+00:00	FR04014	no2	26.5	µg/m³	48.83724	2.3939	no2	Nitrogen Dioxide	NO2
3	Paris	FR	2019-06-20 21:00:00+00:00	FR04014	no2	24.9	µg/m³	48.83724	2.3939	no2	Nitrogen Dioxide	NO2
4	Paris	FR	2019-06-20 20:00:00+00:00	FR04014	no2	21.4	µg/m³	48.83724	2.3939	no2	Nitrogen Dioxide	NO2

Dropping columns¶

There are various ways to drop one or multiple columns from a Dataframe. The easiest is the use the drop function.

# First create a new variable so that we add it later
name = air_quality['name']

air_quality = air_quality.drop(['name'], axis=1)
air_quality.head()

	city	country	date.utc	location	parameter	value	unit	coordinates.latitude	coordinates.longitude	id	description
0	Paris	FR	2019-06-21 00:00:00+00:00	FR04014	no2	20.0	µg/m³	48.83724	2.3939	no2	Nitrogen Dioxide
1	Paris	FR	2019-06-20 23:00:00+00:00	FR04014	no2	21.8	µg/m³	48.83724	2.3939	no2	Nitrogen Dioxide
2	Paris	FR	2019-06-20 22:00:00+00:00	FR04014	no2	26.5	µg/m³	48.83724	2.3939	no2	Nitrogen Dioxide
3	Paris	FR	2019-06-20 21:00:00+00:00	FR04014	no2	24.9	µg/m³	48.83724	2.3939	no2	Nitrogen Dioxide
4	Paris	FR	2019-06-20 20:00:00+00:00	FR04014	no2	21.4	µg/m³	48.83724	2.3939	no2	Nitrogen Dioxide

Adding columns¶

Likewise, we can add columns. To do this we can define a new columns (i.e. name) and set it equal to a list of values.

air_quality['name'] = name
air_quality.head()

	city	country	date.utc	location	parameter	value	unit	coordinates.latitude	coordinates.longitude	id	description	name
0	Paris	FR	2019-06-21 00:00:00+00:00	FR04014	no2	20.0	µg/m³	48.83724	2.3939	no2	Nitrogen Dioxide	NO2
1	Paris	FR	2019-06-20 23:00:00+00:00	FR04014	no2	21.8	µg/m³	48.83724	2.3939	no2	Nitrogen Dioxide	NO2
2	Paris	FR	2019-06-20 22:00:00+00:00	FR04014	no2	26.5	µg/m³	48.83724	2.3939	no2	Nitrogen Dioxide	NO2
3	Paris	FR	2019-06-20 21:00:00+00:00	FR04014	no2	24.9	µg/m³	48.83724	2.3939	no2	Nitrogen Dioxide	NO2
4	Paris	FR	2019-06-20 20:00:00+00:00	FR04014	no2	21.4	µg/m³	48.83724	2.3939	no2	Nitrogen Dioxide	NO2

Note

This will only work if the new data has the same number of rows as the original DataFrame.

Acknowledgments¶

This demo was inspired by the Pandas API reference tutorial.

Programming for spatial data science

Pandas DataFrames II

Contents