Pandas DataFrames II

Since Pandas provides such an extensive set of tools for working with table data, it is worth doing another demo on it. In this demo, we will import a new dataset called air_quality_no2_long.csv which provides air quality data (Nitrogen Dioxide) from the measurement stations FR04014, BETR801 and London Westminster in Paris, Antwerp, and London, respectively.

import pandas as pd
df1 = pd.read_csv('data/air_quality_no2_long.csv')
df1.head()
city country date.utc location parameter value unit
0 Paris FR 2019-06-21 00:00:00+00:00 FR04014 no2 20.0 µg/m³
1 Paris FR 2019-06-20 23:00:00+00:00 FR04014 no2 21.8 µg/m³
2 Paris FR 2019-06-20 22:00:00+00:00 FR04014 no2 26.5 µg/m³
3 Paris FR 2019-06-20 21:00:00+00:00 FR04014 no2 24.9 µg/m³
4 Paris FR 2019-06-20 20:00:00+00:00 FR04014 no2 21.4 µg/m³

Let’s say we had another air quality dataset, except this time for particulate matter less than 2.5 micrometers (PM2.5).

df2 = pd.read_csv('data/air_quality_pm25_long.csv')
df2.head()
city country date.utc location parameter value unit
0 Antwerpen BE 2019-06-18 06:00:00+00:00 BETR801 pm25 18.0 µg/m³
1 Antwerpen BE 2019-06-17 08:00:00+00:00 BETR801 pm25 6.5 µg/m³
2 Antwerpen BE 2019-06-17 07:00:00+00:00 BETR801 pm25 18.5 µg/m³
3 Antwerpen BE 2019-06-17 06:00:00+00:00 BETR801 pm25 16.0 µg/m³
4 Antwerpen BE 2019-06-17 05:00:00+00:00 BETR801 pm25 7.5 µg/m³

Combining data from multiple tables

If we wanted a single DataFrame that included both these tables, we could use the concat function. The concat function performs the operation along one of the axes (row-wise or column-wise). By default concatenation is along axis=0, so the resulting table combines the rows of the input tables.

../_images/concat_row.svg
air_quality = pd.concat([df1, df2], axis=0)

Note

It’s useful to know the shape of the DataFrames before (and after) the concatenation so we know that it worked as expected.

df1.shape, df2.shape, air_quality.shape
((2068, 7), (1110, 7), (3178, 7))

We can also check that the DataFrames were concatenated properly by sorting by the date.utc columns. If we do so, we will find both no2 and pm25 values contained in the parameter column.

air_quality.sort_values("date.utc").head(10)
city country date.utc location parameter value unit
1109 London GB 2019-05-07 01:00:00+00:00 London Westminster pm25 8.0 µg/m³
2067 London GB 2019-05-07 01:00:00+00:00 London Westminster no2 23.0 µg/m³
1098 Antwerpen BE 2019-05-07 01:00:00+00:00 BETR801 no2 50.5 µg/m³
1003 Paris FR 2019-05-07 01:00:00+00:00 FR04014 no2 25.0 µg/m³
100 Antwerpen BE 2019-05-07 01:00:00+00:00 BETR801 pm25 12.5 µg/m³
2066 London GB 2019-05-07 02:00:00+00:00 London Westminster no2 19.0 µg/m³
1097 Antwerpen BE 2019-05-07 02:00:00+00:00 BETR801 no2 45.0 µg/m³
1002 Paris FR 2019-05-07 02:00:00+00:00 FR04014 no2 27.7 µg/m³
99 Antwerpen BE 2019-05-07 02:00:00+00:00 BETR801 pm25 14.0 µg/m³
1108 London GB 2019-05-07 02:00:00+00:00 London Westminster pm25 8.0 µg/m³

Joining multiple tables by a common identifier

The coordinates of the air quality measurement stations are stored in a separate file. It would be useful if we could add the coordinate information to the original air quality DataFrame. To do this, we can use the merge function.

../_images/merge_left.svg
stations_coord = pd.read_csv('data/air_quality_stations.csv')
stations_coord.head()
location coordinates.latitude coordinates.longitude
0 BELAL01 51.23619 4.38522
1 BELHB23 51.17030 4.34100
2 BELLD01 51.10998 5.00486
3 BELLD02 51.12038 5.02155
4 BELR833 51.32766 4.36226

Both tables have the column location in common which is used as a key to combine the information. By choosing the left merge, only the locations available in the air_quality (left) table, i.e. FR04014, BETR801 and London Westminster, are returned in the new table.

air_quality = pd.merge(air_quality, stations_coord, how="left", on="location")
air_quality.head()
city country date.utc location parameter value unit coordinates.latitude coordinates.longitude
0 Paris FR 2019-06-21 00:00:00+00:00 FR04014 no2 20.0 µg/m³ 48.83724 2.3939
1 Paris FR 2019-06-20 23:00:00+00:00 FR04014 no2 21.8 µg/m³ 48.83724 2.3939
2 Paris FR 2019-06-20 22:00:00+00:00 FR04014 no2 26.5 µg/m³ 48.83724 2.3939
3 Paris FR 2019-06-20 21:00:00+00:00 FR04014 no2 24.9 µg/m³ 48.83724 2.3939
4 Paris FR 2019-06-20 20:00:00+00:00 FR04014 no2 21.4 µg/m³ 48.83724 2.3939

We have one more table to merge with our original DataFrame. This table is called air_quality_paramters.

air_quality_parameters = pd.read_csv('data/air_quality_parameters.csv')
air_quality_parameters.head(6)
id description name
0 bc Black Carbon BC
1 co Carbon Monoxide CO
2 no2 Nitrogen Dioxide NO2
3 o3 Ozone O3
4 pm10 Particulate matter less than 10 micrometers in... PM10
5 pm25 Particulate matter less than 2.5 micrometers i... PM2.5

Note that, compared to the previous example, there is no common column name. However, the parameter column in the air_quality table and the id column in the air_quality_parameters table both provide the measured variable in a common format. We could rename one of the columns. Alternatively, we could use the left_on and right_on arguments to make the link between the two tables.

air_quality = pd.merge(air_quality, air_quality_parameters,
                       how='left', left_on='parameter', right_on='id')
air_quality.head()
city country date.utc location parameter value unit coordinates.latitude coordinates.longitude id description name
0 Paris FR 2019-06-21 00:00:00+00:00 FR04014 no2 20.0 µg/m³ 48.83724 2.3939 no2 Nitrogen Dioxide NO2
1 Paris FR 2019-06-20 23:00:00+00:00 FR04014 no2 21.8 µg/m³ 48.83724 2.3939 no2 Nitrogen Dioxide NO2
2 Paris FR 2019-06-20 22:00:00+00:00 FR04014 no2 26.5 µg/m³ 48.83724 2.3939 no2 Nitrogen Dioxide NO2
3 Paris FR 2019-06-20 21:00:00+00:00 FR04014 no2 24.9 µg/m³ 48.83724 2.3939 no2 Nitrogen Dioxide NO2
4 Paris FR 2019-06-20 20:00:00+00:00 FR04014 no2 21.4 µg/m³ 48.83724 2.3939 no2 Nitrogen Dioxide NO2

Dropping columns

There are various ways to drop one or multiple columns from a Dataframe. The easiest is the use the drop function.

# First create a new variable so that we add it later
name = air_quality['name']
air_quality = air_quality.drop(['name'], axis=1)
air_quality.head()
city country date.utc location parameter value unit coordinates.latitude coordinates.longitude id description
0 Paris FR 2019-06-21 00:00:00+00:00 FR04014 no2 20.0 µg/m³ 48.83724 2.3939 no2 Nitrogen Dioxide
1 Paris FR 2019-06-20 23:00:00+00:00 FR04014 no2 21.8 µg/m³ 48.83724 2.3939 no2 Nitrogen Dioxide
2 Paris FR 2019-06-20 22:00:00+00:00 FR04014 no2 26.5 µg/m³ 48.83724 2.3939 no2 Nitrogen Dioxide
3 Paris FR 2019-06-20 21:00:00+00:00 FR04014 no2 24.9 µg/m³ 48.83724 2.3939 no2 Nitrogen Dioxide
4 Paris FR 2019-06-20 20:00:00+00:00 FR04014 no2 21.4 µg/m³ 48.83724 2.3939 no2 Nitrogen Dioxide

Adding columns

Likewise, we can add columns. To do this we can define a new columns (i.e. name) and set it equal to a list of values.

air_quality['name'] = name
air_quality.head()
city country date.utc location parameter value unit coordinates.latitude coordinates.longitude id description name
0 Paris FR 2019-06-21 00:00:00+00:00 FR04014 no2 20.0 µg/m³ 48.83724 2.3939 no2 Nitrogen Dioxide NO2
1 Paris FR 2019-06-20 23:00:00+00:00 FR04014 no2 21.8 µg/m³ 48.83724 2.3939 no2 Nitrogen Dioxide NO2
2 Paris FR 2019-06-20 22:00:00+00:00 FR04014 no2 26.5 µg/m³ 48.83724 2.3939 no2 Nitrogen Dioxide NO2
3 Paris FR 2019-06-20 21:00:00+00:00 FR04014 no2 24.9 µg/m³ 48.83724 2.3939 no2 Nitrogen Dioxide NO2
4 Paris FR 2019-06-20 20:00:00+00:00 FR04014 no2 21.4 µg/m³ 48.83724 2.3939 no2 Nitrogen Dioxide NO2

Note

This will only work if the new data has the same number of rows as the original DataFrame.

Acknowledgments

This demo was inspired by the Pandas API reference tutorial.