This repository has been archived on 2024-10-22. You can view files and clone it, but cannot push or open issues or pull requests.
va/Assignment1/Assignment1.ipynb

2280 lines
2.9 MiB
Text
Raw Normal View History

2023-03-20 11:12:57 +00:00
{
"cells": [
{
"cell_type": "markdown",
"id": "46302376",
"metadata": {},
"source": [
"# S&DE Atelier - Visual Analytics\n",
"\n",
"# Assignment 1\n",
"\n",
"**Due** April 6, 2023 @23:55 \n",
"\n",
"**Contacts**: marco.dambros@usi.ch - carmen.armenti@usi.ch\n",
"\n",
"---\n",
"\n",
"The goal of this assignment is to use Python and Jupyter notebook to explore, analyze and visualize the datasets provided. To solve the assignment you should apply the knowledge you gained from the theoretical and practical lectures. In particular, when creating tabular or graphical representations you should apply the principles you learned from theoretical lectures and use the technologies presented during practical lectures. For what concerns the visualization library, we suggest to use the library presented in class (Seaborn, Matplotlib, Bokeh), but usage of other libraries (e.g., plotly) is also possible. You should submit a Jyputer notebook (named `SurenameName_Assignment1.ipynb`) that contains your solutions and the steps followed to arrive to these solutions. Please follow the structure of the assignment to solve the exercises.\n",
"\n",
"The datasets you need to use are described in the **Datasets description** section."
]
},
{
"cell_type": "code",
2023-03-29 13:29:39 +00:00
"execution_count": 35,
2023-03-20 11:12:57 +00:00
"id": "fcf3beb9",
"metadata": {},
"outputs": [],
"source": [
2023-03-29 13:29:39 +00:00
"#%pip install pandas seaborn matplotlib bokeh ftfy geopandas jupyter_bokeh\n",
2023-03-20 11:12:57 +00:00
"\n",
"import pandas as pd\n",
"import numpy as np\n",
2023-03-20 15:41:34 +00:00
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
2023-03-20 11:12:57 +00:00
"import bokeh\n",
2023-03-21 17:21:11 +00:00
"import ftfy\n",
2023-03-29 13:29:39 +00:00
"import matplotlib as mpl\n",
"import geopandas as gpd\n",
"from bokeh.plotting import figure, show, output_notebook\n",
"from bokeh.models import GeoJSONDataSource, ColumnDataSource, Legend, BoxSelectTool, HoverTool, TapTool, CustomJS\n",
"from bokeh.layouts import gridplot, column, row"
2023-03-20 11:12:57 +00:00
]
},
{
"cell_type": "markdown",
"id": "6f271000",
"metadata": {},
"source": [
"## Exercise 1 - Data quality (15 points) 🧼\n",
"\n",
"In the Used Cars dataset identify the missing and invalid values for the columns: `vehicle type`, `price`, `brand`, and `month of registration`. If needed, standardize the information and covert them to unique values. Please specify for each column the number of missing or invalid instances. The prices are in euros and the range of accepted prices is between €1'000 and €100'000.\n",
"Once you identified missing/invalid values for the given columns, remove all rows where one or more columns have invalid/missing data.\n",
"Show the steps that you follow to reach the solution. You can choose your preferred approach/technology to clean the dataset (e.g., Python vanilla, Pandas, OpenRefine). "
]
},
{
"cell_type": "code",
"execution_count": 2,
2023-03-20 11:12:57 +00:00
"id": "a0af6847",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"('Ü', 'sloppy-windows-1252')"
]
},
"execution_count": 2,
2023-03-20 11:12:57 +00:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# UTF-8 decoding fails thanks to this byte\n",
"ftfy.guess_bytes(b'\\xDC')"
]
},
{
"cell_type": "code",
"execution_count": 3,
2023-03-20 11:12:57 +00:00
"id": "22ce9426",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>dateCrawled</th>\n",
" <th>name</th>\n",
" <th>seller</th>\n",
" <th>offerType</th>\n",
" <th>price</th>\n",
" <th>abtest</th>\n",
" <th>vehicleType</th>\n",
" <th>yearOfRegistration</th>\n",
" <th>gearbox</th>\n",
" <th>powerPS</th>\n",
" <th>model</th>\n",
" <th>kilometer</th>\n",
" <th>monthOfRegistration</th>\n",
" <th>fuelType</th>\n",
" <th>brand</th>\n",
" <th>notRepairedDamage</th>\n",
" <th>dateCreated</th>\n",
" <th>nrOfPictures</th>\n",
" <th>postalCode</th>\n",
" <th>lastSeen</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2016-03-24 11:52:17</td>\n",
" <td>Golf_3_1.6</td>\n",
" <td>privat</td>\n",
" <td>Angebot</td>\n",
" <td>480</td>\n",
" <td>test</td>\n",
" <td>NaN</td>\n",
" <td>1993</td>\n",
" <td>manuell</td>\n",
" <td>0</td>\n",
" <td>golf</td>\n",
" <td>150000</td>\n",
" <td>0</td>\n",
" <td>benzin</td>\n",
" <td>volkswagen</td>\n",
" <td>NaN</td>\n",
" <td>2016-03-24 00:00:00</td>\n",
" <td>0</td>\n",
" <td>70435</td>\n",
" <td>2016-04-07 03:16:57</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2016-03-24 10:58:45</td>\n",
" <td>A5_Sportback_2.7_Tdi</td>\n",
" <td>privat</td>\n",
" <td>Angebot</td>\n",
" <td>18300</td>\n",
" <td>test</td>\n",
" <td>coupe</td>\n",
" <td>2011</td>\n",
" <td>manuell</td>\n",
" <td>190</td>\n",
" <td>NaN</td>\n",
" <td>125000</td>\n",
" <td>5</td>\n",
" <td>diesel</td>\n",
" <td>audi</td>\n",
" <td>ja</td>\n",
" <td>2016-03-24 00:00:00</td>\n",
" <td>0</td>\n",
" <td>66954</td>\n",
" <td>2016-04-07 01:46:50</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2016-03-14 12:52:21</td>\n",
" <td>Jeep_Grand_Cherokee_\"Overland\"</td>\n",
" <td>privat</td>\n",
" <td>Angebot</td>\n",
" <td>9800</td>\n",
" <td>test</td>\n",
" <td>suv</td>\n",
" <td>2004</td>\n",
" <td>automatik</td>\n",
" <td>163</td>\n",
" <td>grand</td>\n",
" <td>125000</td>\n",
" <td>8</td>\n",
" <td>diesel</td>\n",
" <td>jeep</td>\n",
" <td>NaN</td>\n",
" <td>2016-03-14 00:00:00</td>\n",
" <td>0</td>\n",
" <td>90480</td>\n",
" <td>2016-04-05 12:47:46</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2016-03-17 16:54:04</td>\n",
" <td>GOLF_4_1_4__3TÜRER</td>\n",
" <td>privat</td>\n",
" <td>Angebot</td>\n",
" <td>1500</td>\n",
" <td>test</td>\n",
" <td>kleinwagen</td>\n",
" <td>2001</td>\n",
" <td>manuell</td>\n",
" <td>75</td>\n",
" <td>golf</td>\n",
" <td>150000</td>\n",
" <td>6</td>\n",
" <td>benzin</td>\n",
" <td>volkswagen</td>\n",
" <td>nein</td>\n",
" <td>2016-03-17 00:00:00</td>\n",
" <td>0</td>\n",
" <td>91074</td>\n",
" <td>2016-03-17 17:40:17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2016-03-31 17:25:20</td>\n",
" <td>Skoda_Fabia_1.4_TDI_PD_Classic</td>\n",
" <td>privat</td>\n",
" <td>Angebot</td>\n",
" <td>3600</td>\n",
" <td>test</td>\n",
" <td>kleinwagen</td>\n",
" <td>2008</td>\n",
" <td>manuell</td>\n",
" <td>69</td>\n",
" <td>fabia</td>\n",
" <td>90000</td>\n",
" <td>7</td>\n",
" <td>diesel</td>\n",
" <td>skoda</td>\n",
" <td>nein</td>\n",
" <td>2016-03-31 00:00:00</td>\n",
" <td>0</td>\n",
" <td>60437</td>\n",
" <td>2016-04-06 10:17:21</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" dateCrawled name seller offerType \\\n",
"0 2016-03-24 11:52:17 Golf_3_1.6 privat Angebot \n",
"1 2016-03-24 10:58:45 A5_Sportback_2.7_Tdi privat Angebot \n",
"2 2016-03-14 12:52:21 Jeep_Grand_Cherokee_\"Overland\" privat Angebot \n",
"3 2016-03-17 16:54:04 GOLF_4_1_4__3TÜRER privat Angebot \n",
"4 2016-03-31 17:25:20 Skoda_Fabia_1.4_TDI_PD_Classic privat Angebot \n",
"\n",
" price abtest vehicleType yearOfRegistration gearbox powerPS model \\\n",
"0 480 test NaN 1993 manuell 0 golf \n",
"1 18300 test coupe 2011 manuell 190 NaN \n",
"2 9800 test suv 2004 automatik 163 grand \n",
"3 1500 test kleinwagen 2001 manuell 75 golf \n",
"4 3600 test kleinwagen 2008 manuell 69 fabia \n",
"\n",
" kilometer monthOfRegistration fuelType brand notRepairedDamage \\\n",
"0 150000 0 benzin volkswagen NaN \n",
"1 125000 5 diesel audi ja \n",
"2 125000 8 diesel jeep NaN \n",
"3 150000 6 benzin volkswagen nein \n",
"4 90000 7 diesel skoda nein \n",
"\n",
" dateCreated nrOfPictures postalCode lastSeen \n",
"0 2016-03-24 00:00:00 0 70435 2016-04-07 03:16:57 \n",
"1 2016-03-24 00:00:00 0 66954 2016-04-07 01:46:50 \n",
"2 2016-03-14 00:00:00 0 90480 2016-04-05 12:47:46 \n",
"3 2016-03-17 00:00:00 0 91074 2016-03-17 17:40:17 \n",
"4 2016-03-31 00:00:00 0 60437 2016-04-06 10:17:21 "
]
},
"execution_count": 3,
2023-03-20 11:12:57 +00:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Reading using windows-1252 works\n",
"df_used = pd.read_csv(\"./datasets/used_cars_dataset.csv\", encoding=\"windows-1252\")\n",
"df_used.head()"
]
},
{
"cell_type": "code",
"execution_count": 4,
2023-03-20 11:12:57 +00:00
"id": "a332b6a5",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'dateCrawled': ['str'],\n",
" 'name': ['str'],\n",
" 'seller': ['str'],\n",
" 'offerType': ['str'],\n",
" 'price': ['int64'],\n",
" 'abtest': ['str'],\n",
2023-03-29 13:29:39 +00:00
" 'vehicleType': ['str', 'nan'],\n",
2023-03-20 11:12:57 +00:00
" 'yearOfRegistration': ['int64'],\n",
2023-03-29 13:29:39 +00:00
" 'gearbox': ['str', 'nan'],\n",
2023-03-20 11:12:57 +00:00
" 'powerPS': ['int64'],\n",
2023-03-29 13:29:39 +00:00
" 'model': ['str', 'nan'],\n",
2023-03-20 11:12:57 +00:00
" 'kilometer': ['int64'],\n",
" 'monthOfRegistration': ['int64'],\n",
2023-03-29 13:29:39 +00:00
" 'fuelType': ['str', 'nan'],\n",
2023-03-20 11:12:57 +00:00
" 'brand': ['str'],\n",
2023-03-29 13:29:39 +00:00
" 'notRepairedDamage': ['str', 'nan'],\n",
2023-03-20 11:12:57 +00:00
" 'dateCreated': ['str'],\n",
" 'nrOfPictures': ['int64'],\n",
" 'postalCode': ['int64'],\n",
" 'lastSeen': ['str']}"
]
},
"execution_count": 4,
2023-03-20 11:12:57 +00:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Here I check the types and the presence of missing values for each column\n",
"types = {}\n",
"\n",
"for col in df_used.columns:\n",
" t = set([type(x).__name__ if type(x) != float or not np.isnan(x) else 'nan' for x in df_used[col].unique()])\n",
" types[col] = list(t)\n",
"\n",
"types"
]
},
{
"cell_type": "code",
"execution_count": 5,
2023-03-20 11:12:57 +00:00
"id": "11bfa9a2",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"dateCrawled: []\n",
"name: []\n",
"seller: []\n",
"offerType: []\n",
"price: []\n",
"abtest: []\n",
"vehicleType: []\n",
"yearOfRegistration: []\n",
"gearbox: []\n",
"powerPS: []\n",
"model: []\n",
"kilometer: []\n",
"monthOfRegistration: []\n",
"fuelType: []\n",
"brand: []\n",
"notRepairedDamage: []\n",
"dateCreated: []\n",
"nrOfPictures: []\n",
"postalCode: []\n",
"lastSeen: []\n"
]
}
],
"source": [
"# Here I check for numeric values that have decimal digits (i.e. that are not integers).\n",
"for col in df_used.columns:\n",
" print(f\"{col}: {str([x for x in df_used[col].unique() if type(x) == float and not np.isnan(x) and round(x) != x])}\")\n",
"\n",
"# As shown, there are none, therefore we can use the Int64 dtype in numeric columns"
]
},
{
"cell_type": "code",
"execution_count": 6,
2023-03-20 11:12:57 +00:00
"id": "f1c539c4",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"dateCrawled: False\n",
"name: False\n",
"seller: False\n",
"offerType: False\n",
"price: False\n",
"abtest: False\n",
"vehicleType: False\n",
"yearOfRegistration: False\n",
"gearbox: False\n",
"powerPS: False\n",
"model: False\n",
"kilometer: False\n",
"monthOfRegistration: False\n",
"fuelType: False\n",
"brand: False\n",
"notRepairedDamage: False\n",
"dateCreated: False\n",
"nrOfPictures: False\n",
"postalCode: False\n",
"lastSeen: False\n"
]
}
],
"source": [
"# Here I check if any column is unique to find potential candidates for the index\n",
"for col in df_used.columns:\n",
" print(f\"{col}: {df_used[col].is_unique}\")\n",
"\n",
"# None are unique, so I use the default numeric index"
]
},
{
"cell_type": "code",
"execution_count": 7,
2023-03-20 11:12:57 +00:00
"id": "86074e70",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>dateCrawled</th>\n",
" <th>name</th>\n",
" <th>seller</th>\n",
" <th>offerType</th>\n",
" <th>price</th>\n",
" <th>abtest</th>\n",
" <th>vehicleType</th>\n",
" <th>yearOfRegistration</th>\n",
" <th>gearbox</th>\n",
" <th>powerPS</th>\n",
" <th>model</th>\n",
" <th>kilometer</th>\n",
" <th>monthOfRegistration</th>\n",
" <th>fuelType</th>\n",
" <th>brand</th>\n",
" <th>notRepairedDamage</th>\n",
" <th>dateCreated</th>\n",
" <th>nrOfPictures</th>\n",
" <th>postalCode</th>\n",
" <th>lastSeen</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2016-03-24 11:52:17</td>\n",
" <td>Golf_3_1.6</td>\n",
" <td>privat</td>\n",
" <td>Angebot</td>\n",
" <td>480</td>\n",
" <td>test</td>\n",
" <td>&lt;NA&gt;</td>\n",
" <td>1993</td>\n",
" <td>manuell</td>\n",
" <td>0</td>\n",
" <td>golf</td>\n",
" <td>150000</td>\n",
" <td>0</td>\n",
" <td>benzin</td>\n",
" <td>volkswagen</td>\n",
" <td>&lt;NA&gt;</td>\n",
" <td>2016-03-24 00:00:00</td>\n",
" <td>0</td>\n",
" <td>70435</td>\n",
" <td>2016-04-07 03:16:57</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2016-03-24 10:58:45</td>\n",
" <td>A5_Sportback_2.7_Tdi</td>\n",
" <td>privat</td>\n",
" <td>Angebot</td>\n",
" <td>18300</td>\n",
" <td>test</td>\n",
" <td>coupe</td>\n",
" <td>2011</td>\n",
" <td>manuell</td>\n",
" <td>190</td>\n",
" <td>&lt;NA&gt;</td>\n",
" <td>125000</td>\n",
" <td>5</td>\n",
" <td>diesel</td>\n",
" <td>audi</td>\n",
" <td>ja</td>\n",
" <td>2016-03-24 00:00:00</td>\n",
" <td>0</td>\n",
" <td>66954</td>\n",
" <td>2016-04-07 01:46:50</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2016-03-14 12:52:21</td>\n",
" <td>Jeep_Grand_Cherokee_\"Overland\"</td>\n",
" <td>privat</td>\n",
" <td>Angebot</td>\n",
" <td>9800</td>\n",
" <td>test</td>\n",
" <td>suv</td>\n",
" <td>2004</td>\n",
" <td>automatik</td>\n",
" <td>163</td>\n",
" <td>grand</td>\n",
" <td>125000</td>\n",
" <td>8</td>\n",
" <td>diesel</td>\n",
" <td>jeep</td>\n",
" <td>&lt;NA&gt;</td>\n",
" <td>2016-03-14 00:00:00</td>\n",
" <td>0</td>\n",
" <td>90480</td>\n",
" <td>2016-04-05 12:47:46</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2016-03-17 16:54:04</td>\n",
" <td>GOLF_4_1_4__3TÜRER</td>\n",
" <td>privat</td>\n",
" <td>Angebot</td>\n",
" <td>1500</td>\n",
" <td>test</td>\n",
" <td>kleinwagen</td>\n",
" <td>2001</td>\n",
" <td>manuell</td>\n",
" <td>75</td>\n",
" <td>golf</td>\n",
" <td>150000</td>\n",
" <td>6</td>\n",
" <td>benzin</td>\n",
" <td>volkswagen</td>\n",
" <td>nein</td>\n",
" <td>2016-03-17 00:00:00</td>\n",
" <td>0</td>\n",
" <td>91074</td>\n",
" <td>2016-03-17 17:40:17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2016-03-31 17:25:20</td>\n",
" <td>Skoda_Fabia_1.4_TDI_PD_Classic</td>\n",
" <td>privat</td>\n",
" <td>Angebot</td>\n",
" <td>3600</td>\n",
" <td>test</td>\n",
" <td>kleinwagen</td>\n",
" <td>2008</td>\n",
" <td>manuell</td>\n",
" <td>69</td>\n",
" <td>fabia</td>\n",
" <td>90000</td>\n",
" <td>7</td>\n",
" <td>diesel</td>\n",
" <td>skoda</td>\n",
" <td>nein</td>\n",
" <td>2016-03-31 00:00:00</td>\n",
" <td>0</td>\n",
" <td>60437</td>\n",
" <td>2016-04-06 10:17:21</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" dateCrawled name seller offerType \\\n",
"0 2016-03-24 11:52:17 Golf_3_1.6 privat Angebot \n",
"1 2016-03-24 10:58:45 A5_Sportback_2.7_Tdi privat Angebot \n",
"2 2016-03-14 12:52:21 Jeep_Grand_Cherokee_\"Overland\" privat Angebot \n",
"3 2016-03-17 16:54:04 GOLF_4_1_4__3TÜRER privat Angebot \n",
"4 2016-03-31 17:25:20 Skoda_Fabia_1.4_TDI_PD_Classic privat Angebot \n",
"\n",
" price abtest vehicleType yearOfRegistration gearbox powerPS model \\\n",
"0 480 test <NA> 1993 manuell 0 golf \n",
"1 18300 test coupe 2011 manuell 190 <NA> \n",
"2 9800 test suv 2004 automatik 163 grand \n",
"3 1500 test kleinwagen 2001 manuell 75 golf \n",
"4 3600 test kleinwagen 2008 manuell 69 fabia \n",
"\n",
" kilometer monthOfRegistration fuelType brand notRepairedDamage \\\n",
"0 150000 0 benzin volkswagen <NA> \n",
"1 125000 5 diesel audi ja \n",
"2 125000 8 diesel jeep <NA> \n",
"3 150000 6 benzin volkswagen nein \n",
"4 90000 7 diesel skoda nein \n",
"\n",
" dateCreated nrOfPictures postalCode lastSeen \n",
"0 2016-03-24 00:00:00 0 70435 2016-04-07 03:16:57 \n",
"1 2016-03-24 00:00:00 0 66954 2016-04-07 01:46:50 \n",
"2 2016-03-14 00:00:00 0 90480 2016-04-05 12:47:46 \n",
"3 2016-03-17 00:00:00 0 91074 2016-03-17 17:40:17 \n",
"4 2016-03-31 00:00:00 0 60437 2016-04-06 10:17:21 "
]
},
"execution_count": 7,
2023-03-20 11:12:57 +00:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# I read again the dataset using the information about the column types I found\n",
2023-03-20 15:41:34 +00:00
"df_used = pd.read_csv(\"./datasets/used_cars_dataset.csv\", encoding=\"windows-1252\", dtype={\n",
" 'dateCrawled': str,\n",
2023-03-20 11:12:57 +00:00
" 'name': pd.StringDtype(),\n",
" 'seller': pd.StringDtype(),\n",
" 'offerType': pd.StringDtype(),\n",
" 'price': pd.Int64Dtype(),\n",
" 'abtest': pd.StringDtype(),\n",
" 'vehicleType': pd.StringDtype(),\n",
" 'yearOfRegistration': pd.Int64Dtype(),\n",
" 'gearbox': pd.StringDtype(),\n",
" 'powerPS': pd.Int64Dtype(),\n",
" 'model': pd.StringDtype(),\n",
" 'kilometer': pd.Int64Dtype(),\n",
" 'monthOfRegistration': pd.Int64Dtype(),\n",
" 'fuelType': pd.StringDtype(),\n",
" 'brand': pd.StringDtype(),\n",
" 'notRepairedDamage': pd.StringDtype(),\n",
" 'dateCreated': pd.StringDtype(),\n",
" 'nrOfPictures': pd.Int64Dtype(),\n",
" 'postalCode': pd.Int64Dtype(),\n",
" 'lastSeen': pd.StringDtype()\n",
"})\n",
"df_used.head()"
]
},
{
"cell_type": "markdown",
"id": "6a3f2455",
"metadata": {},
"source": [
"From here onwards, I investigate the missing and invalid values. If I find any invalid values, I replace them with `<NA>` to encode them as the missing values. This makes it easy to count and drop them all in one go."
]
},
{
"cell_type": "code",
"execution_count": 8,
2023-03-20 11:12:57 +00:00
"id": "8b6f9ce3",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"vehicleType: [<NA>, 'andere', 'bus', 'cabrio', 'coupe', 'kleinwagen', 'kombi', 'limousine', 'suv']\n",
"brand: ['BMW', 'alfa_romeo', 'audi', 'bmw', 'bmw ', 'chevrolet', 'chrysler', 'citroen', 'dacia', 'daewoo', 'daihatsu', 'fiat', 'ford', 'honda', 'hyundai', 'jaguar', 'jeep', 'kia', 'lada', 'lancia', 'land_rover', 'mazda', 'mercedes_benz', 'mini', 'mitsubishi', 'nissan', 'opel', 'peugeot', 'porsche', 'renault', 'rover', 'saab', 'seat', 'skoda', 'smart', 'sonstige_autos', 'subaru', 'suzuki', 'toyota', 'trabant', 'volkswagen', 'volvo']\n",
"monthOfRegistration: [0, 1, 10, 11, 12, 2, 3, 4, 5, 6, 7, 8, 9]\n"
]
}
],
"source": [
"# I look at the values of the indicated columns to find odd values. \n",
"# Indeed, some brand values use mixed case (BMW) and spaces ('bmw '). Additionally, \n",
"# a month of registration = 0 does not make sense when the other values are in the\n",
"# 1-12 range.\n",
"cols = [\"vehicleType\", \"brand\", \"monthOfRegistration\"]\n",
"\n",
"def print_col(col: str):\n",
" print(f\"{col}: {str(sorted(df_used[col].unique(), key=lambda x: str(x)))}\")\n",
"\n",
"for col in cols:\n",
" print_col(col)\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
2023-03-20 11:12:57 +00:00
"id": "98f8d101",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"brand: ['alfa_romeo', 'audi', 'bmw', 'chevrolet', 'chrysler', 'citroen', 'dacia', 'daewoo', 'daihatsu', 'fiat', 'ford', 'honda', 'hyundai', 'jaguar', 'jeep', 'kia', 'lada', 'lancia', 'land_rover', 'mazda', 'mercedes_benz', 'mini', 'mitsubishi', 'nissan', 'opel', 'peugeot', 'porsche', 'renault', 'rover', 'saab', 'seat', 'skoda', 'smart', 'sonstige_autos', 'subaru', 'suzuki', 'toyota', 'trabant', 'volkswagen', 'volvo']\n",
"monthOfRegistration: [1, 10, 11, 12, 2, 3, 4, 5, 6, 7, 8, 9, <NA>]\n"
]
}
],
"source": [
"# Some brands are written using mixed case or with spaces, hence here I normalize to stripped lowercase\n",
"df_used.brand = df_used.brand.apply(lambda x: x if type(x) is not str else x.lower().strip())\n",
"print_col(\"brand\")\n",
"\n",
"# monthOfRegistration=0 is invalid, hence i mark it as NaN\n",
"df_used[df_used.monthOfRegistration == 0] = np.nan\n",
"print_col(\"monthOfRegistration\")"
]
},
{
"cell_type": "code",
"execution_count": 10,
2023-03-20 11:12:57 +00:00
"id": "f300f49d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"notRepairedDamage: [<NA>, 'ja', 'nein']\n"
]
}
],
"source": [
"# This column only has 'ja' and 'nein' as non-missing values, we can convert it to a boolean\n",
"print_col(\"notRepairedDamage\")"
]
},
{
"cell_type": "code",
"execution_count": 11,
2023-03-20 11:12:57 +00:00
"id": "923c5354",
"metadata": {},
"outputs": [],
"source": [
"# Hence we map the column to boolean values\n",
"df_used.notRepairedDamage = df_used.notRepairedDamage.map({'ja': True, 'nein': False})"
]
},
{
"cell_type": "code",
"execution_count": 12,
2023-03-20 11:12:57 +00:00
"id": "4b847b1f",
"metadata": {},
"outputs": [],
"source": [
"# Prices not in the 1000-100'000 range are invalid, hence I convert them to NaN\n",
"df_used.loc[(df_used.price.isna()) | (df_used.price < 1000) | (df_used.price > 100_000), \"price\"] = np.nan"
]
},
{
"cell_type": "code",
"execution_count": 13,
2023-03-20 11:12:57 +00:00
"id": "bf1f417d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dateCrawled 37675\n",
"name 37675\n",
"seller 37675\n",
"offerType 37675\n",
"price 101662\n",
"abtest 37675\n",
"vehicleType 60491\n",
"yearOfRegistration 37675\n",
"gearbox 47998\n",
"powerPS 37675\n",
"model 51550\n",
"kilometer 37675\n",
"monthOfRegistration 37675\n",
"fuelType 57286\n",
"brand 37675\n",
"notRepairedDamage 87440\n",
"dateCreated 37675\n",
"nrOfPictures 37675\n",
"postalCode 37675\n",
"lastSeen 37675\n",
"dtype: int64"
]
},
"execution_count": 13,
2023-03-20 11:12:57 +00:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# This reports the number of values in each column that are missing or invalid\n",
"df_used.isna().sum()"
]
},
{
"cell_type": "code",
"execution_count": 14,
2023-03-20 11:12:57 +00:00
"id": "919e692f",
"metadata": {},
"outputs": [],
"source": [
"# Here I drop the missing values and i re-enumerate all the rows with the automatic numeric index\n",
"df_used = df_used.dropna().reset_index(drop=True)"
]
},
{
"cell_type": "markdown",
"id": "47a3929f",
"metadata": {},
"source": [
"## Exercise 2 - Data analysis (20 points) 📊\n",
"\n",
"1. We consider the norm to be that, for a given type of vehicle, on average the price of diesel is greater than the one of benzine. Provide a representation of the data which shows if, and to which extent, the various vehicle types conform to the norm.\n",
"What relationship are you showing? Please justify the choice of the representation and your answer.\n",
"2. Find an appropriate way to show and compare the range of prices for the following `brand`: **mercedes_benz**, **fiat**, **volvo**, **alfa_romeo** and **lancia**. Create a suitable graphical representation of this data. What kind of relationship are you showing? Describe what can be understood from the plot. Please justify your answer and your choice of the graphical representation.\n",
"\n",
"<aside>\n",
"💡 N.B. In this section you should work on the clean Used Cars dataset, without the missing and invalid data.\n",
"</aside>"
]
},
2023-03-20 15:41:34 +00:00
{
"cell_type": "markdown",
"id": "e2ae928d",
"metadata": {},
"source": [
"### 2.1\n",
"\n",
"By interpreting the following requirement:\n",
"\n",
"> on average the price of diesel is greater than the one of benzine\n",
"\n",
"as meaning that we expect the average price of diesel cars to be greater than the average cars of _benzin_ cars for each car type, I choose to represent the relationship between each car type and the difference of these average values (i.e. $y=E({\\text{diesel}}) - E({\\text{benzin}})$, where a positive value of $y$ would confirm the expectation).\n",
"\n",
"To represent this relationship I choose to use a simple bar chart plotting these differences. I choose to plot a single series for the difference instead of both series for both fuel types to further focus the reader on the difference and not the values. Additionally, plotting the difference only makes comparing the difference value between car types easier as they are all aligned with the origin."
]
},
2023-03-20 11:12:57 +00:00
{
"cell_type": "code",
"execution_count": 15,
2023-03-20 11:12:57 +00:00
"id": "7cc5c90f",
"metadata": {},
2023-03-20 15:41:34 +00:00
"outputs": [
{
"data": {
2023-03-29 13:29:39 +00:00
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAxoAAAIXCAYAAAAbqSg4AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/bCgiHAAAACXBIWXMAAA9hAAAPYQGoP6dpAABnXklEQVR4nO3dd3gU1dvG8Xs3ISSUUEIJUqRIAkg3iSBFOkhRIKAgHaMgCAJSBaQoiHQBIyIgRZEiTYo0QRAMoaigBAhdLNQQagrJ7vsHv+zLmgC7YUKy+P1cF5fJnDNnnz1ZJ3tn5uyYrFarVQAAAABgIHN6FwAAAADg8UPQAAAAAGA4ggYAAAAAwxE0AAAAABiOoAEAAADAcAQNAAAAAIYjaAAAAAAwHEEDAAAAgOEIGgAAAAAM5zJBY9WqVWrcuLHKlSunJk2a6LvvvrO1/fnnn+rWrZsqV66s6tWra+rUqUpMTLTb/6uvvlLdunVVvnx5vfrqq4qIiLBrd2QMAAAAAI5xiaCxevVqDR06VO3atdO6devUtGlT9evXT7/88otu376t1157TZK0ePFijRw5Ul9//bU++eQT2/4rV67U+PHj9fbbb2vFihUqVKiQunTpoqioKElyaAwAAAAAjjNZrVZrehdxP1arVXXr1lXDhg01aNAg2/bXXntNQUFBKliwoIYMGaKdO3cqR44ckqQlS5Zo/PjxCgsLk4eHhxo2bKh69eppwIABkqSEhATVq1dPbdu2Vbdu3bR27doHjgEAAADAcRn+jMapU6f0119/qVmzZnbb58yZo27dumnfvn16+umnbQFBkqpUqaIbN27o8OHDunz5sk6fPq2qVava2t3d3RUQEKC9e/dK0gPHAAAAAOAclwgaknTr1i299tprqlq1qlq3bq2tW7dKks6dOydfX1+7ffLlyydJ+ueff3Tu3DlJUoECBZL1SWp70BgAAAAAnJPhg8aNGzckSYMGDVLTpk01d+5cVatWTT169FBYWJhiY2OTXdqUOXNmSVJcXJxiYmIkKcU+cXFxkvTAMVIrg1+VBgAAAKQZ9/Qu4EEyZcok6c6ajBYtWkiSSpcurYiICH3xxRfy9PRUfHy83T5J4SBLlizy9PSUpBT7eHl5SdIDx0itqKibMptNqd4fAAAAyGhy5crqUL8MHzTy588vSfLz87Pb/tRTT+mHH35QUFCQIiMj7douXLhg2zfpkqkLFy6oRIkSdn2Sxvb19b3vGKllsVhlsXBWAwAAAP89Gf7SqaefflpZs2bVgQMH7LZHRkaqSJEiCgwMVEREhO0SK0navXu3smbNqlKlSsnHx0fFihVTeHi4rT0hIUH79u1TYGCgJD1wDAAAAADOyfBBw9PTUyEhIfrkk0+0du1a/fHHH/r000+1a9cudenSRfXq1VPevHnVp08fHTlyRFu2bNHkyZPVtWtX27qLrl276osvvtDKlSt1/Phxvfvuu4qNjVWrVq0kyaExAAAAADguw99HI8kXX3yhL7/8UufPn1eJEiXUq1cv1atXT5J05swZjRo1Svv27VOOHDnUqlUr9erVS2bz/+eoOXPmaMGCBYqOjlbZsmU1bNgwlS5d2tbuyBjOunjxeuqfMAAAAJAB5c2b3aF+LhM0XBFBAwAAAI8bR4NGhr90CgAAAIDryfCfOgUAAACkBbPZxK0I/sXIT00laAAAAOA/x2w2KWeOLHJz5wKfuyUmWBR99ZYhYYOgAQAAgP8cs9kkN3ezhvTYppOR0eldToZQ3C+nPgytLbPZRNAAAAAAHsbJyGgd+e1yepfxWOJcEQAAAADDETQAAAAAGI6gAQAAAMBwBA0AAAAAhiNoAAAAADAcQQMAAACA4QgaAAAAAAxH0AAAAABgOIIGAAAAAMMRNAAAAAAYjqABAAAAwHAEDQAAAACGI2gAAAAAMBxBAwAAAIDhCBoAAAAADEfQAAAAAGA4ggYAAAAAwxE0AAAAABiOoAEAAADAcAQNAAAAAIYjaAAAAAAwHEEDAAAAgOEIGgAAAAAMR9AAAAAAYDiCBgAAAADDETQAAAAAGI6gAQAAAMBwBA0AAAAAhiNoAAAAADAcQQMAAACA4QgaAAAAAAxH0AAAAABgOIIGAAAAAMMRNAAAAAAYjqABAAAAwHAEDQAAAACGI2gAAAAAMBxBAwAAAIDhCBoAAAAADEfQAAAAAGA4ggYAAAAAwxE0AAAAABiOoAEAAADAcAQNAAAAAIYjaAAAAAAwHEEDAAAAgOEIGgAAAAAMR9AAAAAAYDiCBgAAAADDuUTQOH/+vPz9/ZP9W7FihSTp8OHDat++vSpWrKg6depowYIFdvtbLBZNmzZNNWrUUMWKFfX666/r7Nmzdn0eNAYAAAAAx7mndwGOOHLkiDJnzqwtW7bIZDLZtmfPnl1XrlxRly5dVKdOHY0aNUq//vqrRo0apaxZsyo4OFiSFBoaqkWLFmncuHHy9fXVhAkTFBISojVr1sjDw8OhMQAAAAA4ziWCRmRkpIoWLap8+fIla5s/f74yZcqk0aNHy93dXSVKlNCZM2c0a9YsBQcHKz4+XnPnzlX//v1Vq1YtSdKUKVNUo0YNbdq0SU2bNtXSpUvvOwYAAAAA57jEpVNHjx5ViRIlUmzbt2+fgoKC5O7+/5mpSpUqOn36tC5duqQjR47o5s2bqlq1qq3d29tbZcqU0d69ex0aAwAAAIBzXOaMRq5cudSuXTudOnVKTz75pN58803VrFlT586dk5+fn13/pDMf//zzj86dOydJKlCgQLI+SW0PGiNPnjypqttsNslsNj24IwAAAB4pNzeX+Ht7ujBqbjJ80EhISNDJkyf11FNPafDgwcqWLZvWrVunN954Q1988YViY2Pl4eFht0/mzJklSXFxcYqJiZGkFPtcvXpVkh44Rmrlzp3Vbk0JAAAAkNF5e3sZMk6GDxru7u4KDw+Xm5ubPD09JUlly5bVsWPHNGfOHHl6eio+Pt5un6RwkCVLFts+8fHxtq+T+nh53ZnEB42RWlFRNzmjAQAAkAG5uZkNe0P9uLl2LUaJiZZ7tufKldWhcTJ80JCkrFmTP5mSJUtq586d8vX11YULF+zakr7Pnz+/EhISbNuKFCli18ff31+SHjhGalksVlks1lTvDwAAADxqiYkWJSTcO2g4KsNfnHbs2DFVrlxZ4eHhdtt///13PfXUUwoMDNT+/fuVmJhoa9u9e7eKFSsmHx8flSpVStmyZbPb/9q1a4qIiFBgYKAkPXAMAAAAAM7J8EGjRIkSKl68uEaPHq19+/bpxIkT+vDDD/Xrr7/qzTffVHBwsG7cuKGhQ4fq+PHjWrFihebNm6du3bpJurM2o3379po4caK+//57HTlyRH379pWvr68aNGggSQ8cAwAAAIBzTFarNcNf23Pp0iVNmjRJP/74o65du6YyZcqof//+CggIkCQdPHhQY8aMUUREhPLmzauuXbuqffv2tv0TExM1efJkrVixQrGxsQoMDNR7772nQoUK2fo8aIzUuHjx+kPtDwAAgLTh7m5WrlxZ9Uq9lTry2+X0LidDKFXOR0u2tNCVKzfve+lU3rzZHRrPJYKGqyJoAAAAZEwEjeSMDhoZ/tIpAAAAAK6HoAEAAADAcAQNAAAAAIYjaAAAAAAwHEEDAAAAgOEIGgAAAAAMR9AAAAAAYDiCBgAAAADDETQAAAAAGI6gAQAAAMBwBA0AAAAAhiNoAAAAADAcQQMAAACA4QgaAAAAAAxH0AAAAABgOIIGAAAAAMMRNAAAAAAYjqABAAAAwHAEDQAAAACGI2gAAAAAMBxBAwAAAIDhCBoAAAAADEfQAAAAAGA4ggYAAAAAwxE0AAAAABiOoAEAAADAcAQNAAAAAIYjaAAAAAAwHEEDAAAAgOEIGgAAAAAMR9AAAAAAYDiCBgAAAADDETQAAAAAGI6gAQAAAMBwBA0AAAAAhiNoAAAAADAcQQMAAACA4QgaAAAAAAxH0AAAAABgOIIGAAAAAMMRNAAAAAAYjqABAAAAwHAEDQAAAACGI2gAAAAAMBxBAwAAAIDhCBoAAAAADEfQAAAAAGA
2023-03-20 15:41:34 +00:00
"text/plain": [
"<Figure size 900x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df_diff = df_used \\\n",
" .loc[(df_used.fuelType == 'benzin') | (df_used.fuelType == 'diesel'), ['vehicleType', 'fuelType', 'price']]\\\n",
" .groupby(['vehicleType', 'fuelType']) \\\n",
" .mean() \\\n",
" .sort_values(['vehicleType', 'fuelType'], ascending=[True, True]) \\\n",
" .groupby('vehicleType') \\\n",
" .diff() \\\n",
" .reset_index() \\\n",
" .set_index('vehicleType')\n",
"\n",
"df_diff = df_diff.loc[df_diff.fuelType == 'diesel', ['price']].rename({'price': 'diffPrice'}, axis=1).reset_index()\n",
"\n",
"sns.set_theme(palette=\"hls\")\n",
"\n",
"# Initialize the matplotlib figure\n",
"f, ax = plt.subplots(figsize=(9, 6))\n",
"\n",
"# Plot the total crashes\n",
"sns.set_color_codes(\"pastel\")\n",
"sns.barplot(x=\"vehicleType\", y=\"diffPrice\", data=df_diff,\n",
" label=\"avg(diesel) - avg(benzin)\", color=sns.xkcd_rgb[\"ultramarine\"])\n",
"\n",
"# Add a legend and informative axis label\n",
"ax.legend(ncol=2, loc=\"lower right\", frameon=True)\n",
"ax.set(ylabel=\"Diesel - benzin difference\", ylim=[-1000, 6000], \n",
" xlabel=\"Vehicle type\")\n",
"sns.despine(left=True, bottom=True)"
]
},
{
"cell_type": "markdown",
"id": "7cc33371",
"metadata": {},
"source": [
"### 2.2\n",
2023-03-21 17:21:11 +00:00
"\n",
"To compare the range of prices between car brands, I choose to plot the distribution of car prices for each car brand. To achieve this, I choose to use a variant of the box plot called boxen plot, which ditches whiskers in favour of showing octiles, 16-tiles and so on with coloured rectangles similar to the inner quartiles with exponentially smaller heights.\n",
"\n",
"From the plot we can see that the `mercedes_benz` car type has the highest median price, and it also has the most right skewed price distribution out of all car brands. `volvo` has the second-highest average and also a skewed price distribution. Both `lancia` and `fiat` are instead more uniformly distributed towards lower prices, while `alfa_romeo` has a similar distribution however with some skewing towards the expensive side. [`trabant`](https://www.youtube.com/watch?v=npMKIUTa3uI) is the cheapest car type.\n",
"\n",
"I choose to use a box-plot style graph as it is an effective representation to show some salient characterististics for one-dimensional distributions, such as the median and the quartiles (25% percentile, 75% percentile). I choose a `boxenplot` in particular to better capture the right-skewedness of some distributions with the additional percentiles considered by the octiles (87.5%), 16-tiles (93.75%) and so on exponentially."
2023-03-20 15:41:34 +00:00
]
},
{
"cell_type": "code",
"execution_count": 16,
2023-03-20 15:41:34 +00:00
"id": "ca97e7c8",
"metadata": {},
"outputs": [
{
"data": {
2023-03-29 13:29:39 +00:00
"image/png": "iVBORw0KGgoAAAANSUhEUgAABh8AAAK5CAYAAACvwT+gAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/bCgiHAAAACXBIWXMAAA9hAAAPYQGoP6dpAACSJElEQVR4nOzde5zc870/8PfsZTaCJlREfm5ZSxRV4tY6kiL0lJQoqhe3U1F16VY1SkNKhKqUqkZUXNL0Qh3ELXHr1UE4PZRSl5YQiaAaqUpckuzszszvj2022ea2m3x3vjO7z+fj4WHznZnv9/19z+zs7vc1n88nUywWiwEAAAAAAJCQqrQLAAAAAAAAuhfhAwAAAAAAkCjhAwAAAAAAkCjhAwAAAAAAkCjhAwAAAAAAkCjhAwAAAAAAkCjhAwAAAAAAkCjhAwAAAAAAkKiatAugeygWi/HPf34QhUIx7VJSU1WViY03Xr/H9yFCL5bSh2X0opU+tNKHZfSilT4soxet9GEZvWilD630YRm9aKUPy+hFK31YRi9a6UMrfVimqioTH/7wBqU5VkmOQreXyWSiqiqTdhmpqqrK6MO/6EUrfVhGL1rpQyt9WEYvWunDMnrRSh+W0YtW+tBKH5bRi1b6sIxetNKHZfSilT600odlStkD4QMAAAAAAJAo4QMAAAAAAJAo4QMAAAAAAJAo4QMAAAAAAJAo4QMAAAAAAJAo4QMAAAAAAJAo4QMAAAAAAJComrQLgI4qFouRy+XSLmOV8vmqWLKkOpqamqKlpVCy42az2chkMiU7HgAAAADAmggfqAjFYjEmTrw85sx5Je1Syk59fUM0No4SQAAAAAAAZcO0S1SEXC5XkuChUCjEm2++GW+++WYUCqUbvbAuZs+eVdYjQgAAAACAnsfIByrOSX16R20Xfci/pVCI376bjYiIT/XtHTVV5ZvPNRcjrl+4KO0yAAAAAABWIHyg4tRmImq7aIqhTCYT1Zmlx8lETVlPZVRMuwAAAAAAgJUq3491AwAAAAAAFUn4AAAAAAAAJEr4AAAAAAAAJEr4AAAAAAAAJEr4AAAAAAAAJEr4AAAAAAAAJEr4AAAAAAAAJEr4AAAAAAAAJEr4AAAAAAAAJEr4AAAAAAAAJEr4AAAAAAAAJEr4AAAAAAAAJEr4AAAAAAAAJEr4AAAAAAAAJEr4AAAAAAAAJKom7QLg3xWLxcjlcu225XJNKVVTGcqtP/l8VSxZUh1NTU3R0lJod1s2m41MJpNSZQAAAABAKQgfKCvFYjEmTrw85sx5Je1SKsrYsaPTLqHD6usborFxlAACAAAAALox0y5RVnK5nOChm5s9e9YKI1sAAAAAgO7FyAfK1tfqt4zaqtZPxzcXCvHj2a+nXFH5+lr9FlFbVd5ZYnOhGD+e/VraZQAAAAAAJSB8oGzVVmUiW+YX1MtFbVVVBfSqsOa7AAAAAADdQrlfrQQAAAAAACqM8AEAAAAAAEiU8AEAAAAAAEiU8AEAAAAAAEiU8AEAAAAAAEhUTdoFQEREsViMXC4XuVxT2qVQApX6PGez2chkMmmXAQAAAABlT/hA6orFYkyceHnMmfNK2qVQImPHjk67hLVSX98QjY2jBBAAAAAAsAamXSJ1uVxO8EBFmD17VuRyubTLAAAAAICyZ+QDZeX0wR+LK596Ju0y6GKj9tg1stXVaZfRYc35Qlz+xFNplwEAAAAAFUP4QFmpqTYYpyfIVldXVPgAAAAAAHSO8IFUtS40XZmLD9Mzdeb1ms9XxZIl1dHU1BQtLYUurGr1LJQNAAAAQKkJH0iNhaapRJW4WLaFsgGoRPl8Pu0SykZHe5HP56N6LUeXrstjy1l3PS8qg9cfAD2dOW5IjYWmoTQslA1ApXnppZlx6qmnxssvv5R2KanraC9efnlmnHfeWTFrVud7ti6PLWfd9byoDF5/ANDNRj5MnDgx7rzzznjggQciIuLZZ5+Ns88+O1577bU47rjj4tvf/nbKFbK8YrGYdgnQaed8+sDI1lTGW2cun49LfvXb1q9Tnt6sXKagWhVTUwGUj3w+H7fccmMsXrw4br75xjjrrO/02E8Od7QX+Xw+pk69KZYsWRJTp97UqZ6ty2PLWXc9LyqD1x8AtKqMK2hr6dprr43a2tq47777YsMNN0y7HJZTKBRi0qQfpV0GdFq2pqZiwoflVeJ0UaW09db1cfLJX+/yAKLcQ5hSKYc+CJygfM2Y8WC89dZbERHx1lvz4pFHHox99z0g1ZrSMmNGx3oxY8aDMX/+2vVsXR5bzmbM6J7nRWWYMcPrDwAiunn4sHDhwthhhx1iq622SrsUllMsFuPKK38Qr702N+1SACIi4tVXZ8e5545KuwxKqFSB05qUQxBTDvRhmXLtRakCu4ULF8T9909vt+2+++6OXXfdPfr06dvlxy8nHe3FuvSsu/a7u54XlcHrDwCWqbjwYebMmXH55ZfHn/70p1i8eHH0798/jjnmmBg5cmS7+w0bNizeeOONiIi466674ve//31suOGGcdlll8VDDz0U//znP+NDH/pQHHDAATFmzJhYb731OnT84447LgYOHBgvvPBCzJ49O84///wYMWJE3HXXXTFlypSYM2dObLLJJvG5z30uTj755Kiuro7XX389DjjggPjhD38Y119/fcyaNSu22267uOyyy+JXv/pV/PKXv4yWlpb4zGc+E+eff37bH3b/8z//ExMnToyXX345+vfvH5/5zGfitNNOi2w2GxERCxYsiAkTJsQDDzwQ77zzTuy4447xzW9+Mz7+8Y8n2PHkNTU1xdy5c9IuA3qUC47+YmRrK+4tv8vlWlrigl/enHYZpEDgBJ1TX98QjY2jujyAmDbt9hUWV87nW2L69DviuONGruJR3VNHe7EuPeuu/e6u50Vl8PoDgGUq6krU4sWLY+TIkbHPPvvEzTffHNXV1TF16tT4/ve/H3vvvXe7+952221x2mmnxWabbRZjxoyJjTfeOBobG2PevHlx1VVXxYc//OH405/+FOeee25su+228eUvf7nDdUydOjUuu+yy2H777aNfv37xs5/9LC6//PIYPXp07LPPPvHnP/85LrzwwnjnnXdizJgxbY+74oor4nvf+1586EMfisbGxvjSl74U++67b9xwww3x+OOPxwUXXBBDhw6NYcOGxcMPPxxnnHFGnHPOOfEf//EfMXfu3Ljoooti9uzZMWHChMjn8zFy5Mhobm6Oyy67LDbeeOP4xS9+ESeeeGLcdNNN8bGPfSyptieuudnCt1Bq2dqaqKutTbuMsjbu1JMjq0fdXlNzc1ww6dq0y4CKM3v2rMjlclFXV9dlx3jppRfj6aefXGF7oVCIp556Ivbee0hsu+2gLjt+OeloL9alZ9213931vKgMXn8A0F7FhQ/HH398HHPMMbH++utHRMTpp58ekydPjhdffLHdfTfeeOOora2NXr16Rb9+/SIiYp999ok999wztt9++4iI2GKLLeLGG2+MmTNndqqOHXbYIQ499NCIaJ1C6Prrr49jjz02jjnmmIiIGDhwYCxYsCAuu+yyOP3009seN3LkyNhrr70iIuJTn/pU3HDDDXHhhRfGeuutFw0NDTFx4sR46aWXYtiwYXHNNdfE5z//+fjiF78YERFbbbVVjBs3Lv7rv/4rXn/99Zg1a1Y8//zzcffdd8egQa2/vIwbNy6effbZ+MlPfhITJkzo1DmVUm1tNu0SoMfJNbekXUJZyrUs68tYF6QBVqm+vqFt9G1XeeKJxyKTyUSxWFzhtkwmE3/84//1mIt2He3FuvSsu/a7u54XlcHrDwDaq6jwYeONN46jjz467rnnnvjLX/4Sc+fOjRdeeCEiWj9JsCZHH310PPDAA3HnnXfGnDlz4uWXX47XX389ttlmm07VsfXWW7d9/c9//jP+8Y9/xO67797uPnvttVc0NzfHK6+8Eh/+8IdXeFzv3r1jk002aTfdU69evSKXax0V8Je//CWeeeaZuO2229puX/oLzKxZs2LmzJmx4YYbtgUPEa2/zOyxxx7xyCOPdOp8Sq2uri622mq
2023-03-20 15:41:34 +00:00
"text/plain": [
2023-03-21 17:21:11 +00:00
"<Figure size 1800x800 with 1 Axes>"
2023-03-20 15:41:34 +00:00
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
2023-03-21 17:21:11 +00:00
"brands = ['mercedes_benz', 'fiat', 'volvo', 'alfa_romeo', 'lancia', 'trabant']\n",
2023-03-20 15:41:34 +00:00
"\n",
"df_price = df_used \\\n",
2023-03-21 17:21:11 +00:00
" .loc[df_used.brand.isin(brands), ['brand', 'price']] \\\n",
" .sort_values('brand', ascending=True)\n",
2023-03-20 15:41:34 +00:00
"\n",
"sns.set_theme(palette=\"hls\")\n",
"\n",
"# Initialize the matplotlib figure\n",
2023-03-21 17:21:11 +00:00
"f, ax = plt.subplots(figsize=(18, 8))\n",
2023-03-20 15:41:34 +00:00
"\n",
"mkfunc = lambda x, pos: '%1.0fk' % (x * 1e-3)\n",
"mkformatter = mpl.ticker.FuncFormatter(mkfunc)\n",
"ax.xaxis.set_major_formatter(mkformatter)\n",
"\n",
"# Draw a nested boxplot to show bills by day and time\n",
"sns.boxenplot(y=\"brand\", x=\"price\", data=df_price)\n",
"\n",
2023-03-21 17:21:11 +00:00
"ax.set(ylabel=\"\", xlim=[0, 100000], xticks=range(0, 105001, 5000),\n",
2023-03-20 15:41:34 +00:00
" xlabel=\"Distribution of prices per vehicle type and fuel type\")\n",
" \n",
"sns.despine(offset=10, trim=True)"
]
2023-03-20 11:12:57 +00:00
},
{
"attachments": {
"Banks%20-%20market%20cap.png": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABgwAAAQoCAYAAADWsvWXAAAMP2lDQ1BJQ0MgUHJvZmlsZQAASImVVwdYU8kWnluSkEBoAQSkhN4E6QSQEkILvSPYCEmAUEIMBBU7uqjg2sUCNnRVRMEKiB2xswg27IsFFWVdLNiVNymg677yvfm+ufPff87858y5M/feAUDtJEckykXVAcgTForjQgLoY1NS6aSnAAUYoAIHADjcAhEzJiYCwDLU/r28uwEQaXvVXqr1z/7/WjR4/AIuAEgMxOm8Am4exAcBwKu4InEhAEQpbzalUCTFsAItMQwQ4oVSnCnHVVKcLsd7ZTYJcSyIWwFQUuFwxJkAqHZAnl7EzYQaqv0QOwp5AiEAanSIffPy8nkQp0FsDW1EEEv1Gek/6GT+TTN9WJPDyRzG8rnIilKgoECUy5n2f6bjf5e8XMmQD0tYVbLEoXHSOcO83czJD5diFYj7hOlR0RBrQvxBwJPZQ4xSsiShiXJ71IBbwII5AzoQO/I4geEQG0AcLMyNilDw6RmCYDbEcIWgUwWF7ASIdSFeyC8IilfYbBbnxyl8oQ0ZYhZTwZ/niGV+pb7uS3ISmQr911l8tkIfUy3OSkiGmAKxeZEgKQpiVYgdCnLiwxU2Y4qzWFFDNmJJnDR+c4jj+MKQALk+VpQhDo5T2JflFQzNF9ucJWBHKfD+wqyEUHl+sFYuRxY/nAvWwRcyE4d0+AVjI4bmwuMHBsnnjj3jCxPjFTofRIUBcfKxOEWUG6Owx035uSFS3hRi14KieMVYPKkQLki5Pp4hKoxJkMeJF2dzwmLk8eDLQARggUBABxJY00E+yAaC9r7GPngn7wkGHCAGmYAP7BXM0IhkWY8QXuNBMfgTIj4oGB4XIOvlgyLIfx1m5Vd7kCHrLZKNyAFPIM4D4SAX3ktko4TD3pLAY8gI/uGdAysXxpsLq7T/3/ND7HeGCZkIBSMZ8khXG7IkBhEDiaHEYKINro/74t54BLz6w+qMM3DPoXl8tyc8IXQSHhKuE7oJtyYJSsQ/RRkJuqF+sCIX6T/mAreEmm54AO4D1aEyroPrA3vcFfph4n7QsxtkWYq4pVmh/6T9txn88DQUdmRHMkoeQfYnW/88UtVW1W1YRZrrH/MjjzV9ON+s4Z6f/bN+yD4PtuE/W2ILsQPYOewUdgE7ijUCOnYCa8LasGNSPLy6HstW15C3OFk8OVBH8A9/Q09WmskCx1rHXscv8r5C/lTpOxqw8kXTxILMrEI6E34R+HS2kOswiu7s6OwCgPT7In99vYmVfTcQnbbv3Lw/APA5MTg4eOQ7F3YCgH0ecPsf/s5ZM+CnQxmA84e5EnGRnMOlFwJ8S6jBnaYHjIAZsIbzcQbuwBv4gyAQBqJBAkgBE2H0WXCdi8EUMAPMBaWgHCwDq8F6sAlsBTvBHrAfNIKj4BQ4Cy6BDnAd3IGrpwe8AP3gHfiMIAgJoSI0RA8xRiwQO8QZYSC+SBASgcQhKUgakokIEQkyA5mHlCMrkPXIFqQG2YccRk4hF5BO5BbyAOlFXiOfUAxVQbVQQ9QSHY0yUCYajiagE9BMdDJajM5Hl6Br0Wp0N9qAnkIvodfRbvQFOoABTBnTwUwwe4yBsbBoLBXLwMTYLKwMq8CqsTqsGT7nq1g31od9xIk4Dafj9nAFh+KJOBefjM/CF+Pr8Z14A96KX8Uf4P34NwKVYECwI3gR2ISxhEzCFEIpoYKwnXCIcAbupR7COyKRqEO0InrAvZhCzCZOJy4mbiDWE08SO4mPiAMkEkmPZEfyIUWTOKRCUilpHWk36QTpCqmH9EFJWclYyVkpWClVSahUolShtEvpuNIVpadKn8nqZAuyFzmazCNPIy8lbyM3ky+Te8ifKRoUK4oPJYGSTZlLWUupo5yh3KW8UVZWNlX2VI5VFijPUV6rvFf5vPID5Y8qmiq2KiyV8SoSlSUqO1ROqtxSeUOlUi2p/tRUaiF1CbWGepp6n/pBlabqoMpW5anOVq1UbVC9ovpSjaxmocZUm6hWrFahdkDtslqfOlndUp2lzlGfpV6pfli9S31Ag6bhpBGtkaexWGOXxgWNZ5okTUvNIE2e5nzNrZqnNR/RMJoZjUXj0ubRttHO0Hq0iFpWWmytbK1yrT1a7Vr92prartpJ2lO1K7WPaXfrYDqWOmydXJ2lOvt1buh8GmE4gjmCP2LRiLoRV0a81x2p66/L1y3Trde9rvtJj64XpJejt1yvUe+ePq5vqx+rP0V/o/4Z/b6RWiO9R3JHlo3cP/K2AWpgaxBnMN1gq0GbwYChkWGIochwneFpwz4jHSN/o2yjVUbHjXqNaca+xgLjVcYnjJ/TtelMei59Lb2V3m9iYBJqIjHZYtJu8tnUyjTRtMS03vSeGcWMYZZhtsqsxazf3Ng80nyGea35bQuyBcMiy2KNxTmL95ZWlsmWCywbLZ9Z6VqxrYqtaq3uWlOt/awnW1dbX7Mh2jBscmw22HTYorZutlm2lbaX7VA7dzuB3Qa7zlGEUZ6jhKOqR3XZq9gz7Yvsa+0fOOg4RDiUODQ6vBxtPjp19PLR50Z/c3RzzHXc5njHSdMpzKnEqdnptbOtM9e50vmaC9Ul2GW2S5PLK1c7V77rRtebbjS3SLcFbi1uX9093MXude69HuYeaR5VHl0MLUYMYzHjvCfBM8BztudRz49e7l6FXvu9/vK2987x3uX9bIzVGP6YbWMe+Zj6cHy2+HT70n3TfDf7dvuZ+HH8qv0e+pv58/y3+z9l2jCzmbuZLwMcA8QBhwLes7xYM1knA7HAkMCywPYgzaDEoPVB94NNgzODa4P7Q9xCpoecDCWEhocuD+1iG7K57Bp2f5hH2Myw1nCV8Pjw9eEPI2wjxBHNkWhkWOTKyLtRFlHCqMZoEM2OXhl9L8YqZnLMkVhibExsZeyTOKe4GXHn4mnxk+J3xb9LCEhYmnAn0TpRktiSpJY0Pqkm6X1yYPKK5O6xo8fOHHspRT9FkNKUSkpNSt2eOjAuaNzqcT3j3caXjr8xwWrC1AkXJupPzJ14bJLaJM6kA2mEtOS0XWlfONGcas5AOju9Kr2fy+Ku4b7g+fNW8Xr5PvwV/KcZPhkrMp5l+mSuzOzN8suqyOoTsATrBa+yQ7M3Zb/Pic7ZkTOYm5xbn6eUl5Z3WKgpzBG25hvlT83vFNmJSkXdk70mr57cLw4Xby9ACiYUNBVqwR/5Nom15BfJgyLfosqiD1OSphyYqjFVOLVtmu20RdOeFgcX/zYdn86d3jLDZMbcGQ9mMmdumYXMSp/VMtts9vzZPXNC5uycS5mbM/f3EseSFSVv5yXPa55vOH/O/Ee/hPxSW6paKi7tWuC9YNNCfKFgYfsil0XrFn0r45VdLHcsryj/spi7+OKvTr+u/XVwScaS9qXuSzcuIy4TLrux3G/5zhUaK4pXPFoZubJhFX1V2aq3qyetvlDhWrFpDWWNZE332oi1TevM1y1b92V91vrrlQGV9VUGVYuq3m/gbbiy0X9j3SbDTeWbPm0WbL65JWRLQ7VldcVW4tairU+2JW079xvjt5rt+tvLt3/dIdzRvTNuZ2uNR03NLoNdS2vRWklt7+7xuzv2BO5pqrOv21KvU1++F+yV7H2+L23fjf3h+1sOMA7UHbQ4WHWIdqisAWmY1tDfmNXY3ZTS1Hk47HBLs3fzoSMOR3YcNTlaeUz72NLjlOPzjw+eKD4xcFJ0su9U5qlHLZNa7pwee/paa2xr+5nwM+fPBp89fY557sR5n/NHL3hdOHyRcbHxkvulhja3tkO/u/1+qN29veGyx+WmDs+O5s4xncev+F05dTXw6tlr7GuXrkdd77yReONm1/iu7pu8m89u5d56dbvo9uc7c+4S7pbdU79Xcd/gfvUfNn/Ud7t3H3sQ+KDtYfzDO4+4j148Lnj8pWf+E+qTiqfGT2ueOT872hvc2/F83POeF6IXn/tK/9T4s+ql9cuDf/n/1dY/tr/nlfjV4OvFb/Te7Hjr+rZlIGbg/ru8d5/fl33Q+7DzI+PjuU/Jn55+nvKF9GXtV5uvzd/C
}
},
"cell_type": "markdown",
"id": "f4e84bcf",
"metadata": {},
"source": [
"## Exercise 3 - Data analysis (20 points) 📊\n",
"\n",
"The following graph represents the financial meltdown's impact on banks since the 2008 financial crisis began, and compares the market value of each bank as of 2007 - in blue - and 2009 - in green. The **main** purpose of the graph is to show the loss of each bank after the financial crisis and to enlight the little decline pre-versus-post meltdown of J.P. Morgan; the **secondary** purpose is to provide a sense of the relative sizes of the banks in terms of market value (e.g., J.P. Morgan is not a small bank).\n",
"Is there a better solution to achieve these two goals? How would you compare both the remaining market value of each bank after the loss caused by the crisis and their decline?\n",
"\n",
"List all the problems that you detect in the design of this graph with respect to the quantive message the graph is supposed to deliver.\n",
"\n",
"Propose and implement a different graph that delivers effectively the message.\n",
"\n",
"Use the data in the *market_value_decline* dataset to populate the new graph.\n",
"\n",
"![Banks%20-%20market%20cap.png](attachment:Banks%20-%20market%20cap.png)"
]
},
2023-03-28 12:00:00 +00:00
{
"cell_type": "markdown",
"id": "139bb76f",
"metadata": {},
"source": [
"<!-- List all the problems that you detect in the design of this graph with respect to the quantive message the graph is supposed to deliver. -->\n",
"The given graph is not suited to show quantitative values because it relies on the areas of two-dimensional objects to convey the magnitude of the values it shows. Humans are not suited to understand at a glance the difference between the areas of two objects, instead being capable to grasp much one-dimensional differences, like length. This is why the use of a bar chart would have been more suited to show these quantitative values.\n",
"\n",
"The given graph does not provide a convenient way to show the market decline of each bank. As we are required to mentally compute the difference of market value before and after the stock market crash, it is hard to compare the banks to see which one has lost the least or the most value. \n",
"\n",
"<!-- How would you compare both the remaining market value of each bank after the loss caused by the crisis and their decline? -->\n",
"As we want to convey that JP Morgan has one of the lowest relative market value differences, I would plot directly this difference as another bar chart.\n",
"\n",
"<!-- Is there a better solution to achieve these two goals? -->\n",
"<!-- Propose and implement a different graph that delivers effectively the message. -->\n",
"We can implement a better graph with a table lens bar chart showing both the relative market value decrease for each bank and the pre- and post-market collapse market values. The left side shows the former message (i.e. fulfilling the main purpose), while the right side shows the latter message (i.e. fulfilling the secondary purpose)."
]
},
2023-03-20 11:12:57 +00:00
{
"cell_type": "code",
"execution_count": 17,
2023-03-20 11:12:57 +00:00
"id": "eb956ed4",
"metadata": {},
"outputs": [],
2023-03-21 17:21:11 +00:00
"source": [
"df_m = pd.read_csv(\"./datasets/market_value_decline.csv\").rename(columns={\n",
" 'Unnamed: 0': 'bank',\n",
" 'market_value_2007': '2007',\n",
" 'market_value_2009': '2009'\n",
"})\n",
"\n",
"df_mkt = df_m\n",
"df_mkt[\"diff\"] = 100 * (df_mkt['2009'] - df_mkt['2007']) / df_mkt['2007']\n",
"df_mkt = df_mkt.sort_values(['diff'], ascending=False)\n",
"\n",
"# sort source DF according to new order by diff\n",
"df_m = df_m.reindex(df_mkt.index)"
]
},
{
"cell_type": "code",
"execution_count": 18,
2023-03-21 17:21:11 +00:00
"id": "4a29684b",
"metadata": {},
"outputs": [],
"source": [
"df_mval = pd.melt(df_m.loc[:, ['bank', '2007', '2009']], id_vars=['bank'], var_name='year', value_name='market_value')"
]
},
{
"cell_type": "code",
"execution_count": 19,
2023-03-21 17:21:11 +00:00
"id": "d3d58d25",
"metadata": {},
"outputs": [
{
"data": {
2023-03-29 13:29:39 +00:00
"image/png": "iVBORw0KGgoAAAANSUhEUgAABUgAAAKrCAYAAAAj9WcAAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/bCgiHAAAACXBIWXMAAA9hAAAPYQGoP6dpAADcF0lEQVR4nOzdeVhUdf/G8XtmAMEFEDVxQc19V9yX0tyeXFPTtDRMSdPUyiUVdy1F3HOpNNNKzcq1zEwrMy0t0zSXx3DfwNwF00Bgzvz+8Oc8EViIDAMz79d1demc+Z5zPp8vgx1uzmKy2Ww2AQAAAAAAAIAbMju7AAAAAAAAAABwFgJSAAAAAAAAAG6LgBQAAAAAAACA2yIgBQAAAAAAAOC2CEgBAAAAAAAAuC0CUgAAAAAAAABui4AUAAAAAAAAgNsiIAUAAAAAAADgtjycXQAgSTabTdeu3ZJh2JxdSoYwm00KCMjlMj25Wj8SPWUX9JQ9uFpPrtaP5Lo95cuX29llIA1c7TgvK3HF7+2shPl1LObXsZhfx2J+HcsZx3mcQYoswWQyyWw2ObuMDGM2m1yqJ1frR6Kn7IKesgdX68nV+pFctydkD6722ctKXPF7Oythfh2L+XUs5texmF/Hcsa8EpACAAAAAAAAcFtcYg8AwP8zmx3/W2CLxZzsT1fgaj25Wj+Sa/eE7IGvl2O44vd2VuKM+TUMG5frAoATEJACAKA74WhePx+ZPSyZsj9fX59M2U9mcrWeXK0fyTV7QtZntRp89hyM+XWszJzfJKtVsTFxhKQAkMkISAEA0P+fPeph0daBsxRzLMrZ5QD4F/5liqrJ/CHOLgNpYLGY1Sd8iY6eveDsUoAsrWyxQC0aFSqz2URACgCZjIAUAIC/iDkWpauHTjq7DABwKUfPXtD+4+ecXQYAAECqCEgBAAAAAACAVBiGIas16W/LTIqPtygh4basVs74vl8Wi4fM5qx1/2wCUgAAAAAAAOAvbDabbty4pri4m6m+f+WKWYZhZHJVrsPHJ7d8fQNkMjn2IblpRUAKAAAAAAAA/MXdcDR37rzy8sqRIsizWEycPZoONptNCQm3dfPmdUmSn18+J1d0BwEpAAAAAAAA8P8Mw2oPR3Pn9k11jIeHWUlJnEGaHl5eOSRJN29eV548ebPE5fbOrwAAAAAAAADIIqxWq6T/BXnIeHfn9u/3d3UWAlIAAAAAAADgb7LK/TFdUVabWwJSAAAAAAAAAG6LgBQAAAAAAACA2yIgBQAAAAAAAOC2CEgzwPr169WlSxdVr15dwcHB6tSpkz7++OMM3cf169e1atWqDN1makJCQhQWFubw/QAAAAAAAABZgYezC8juVq9ercmTJ2v06NGqWbOmbDabduzYoUmTJunKlSsaOHBghuxn2rRpioqK0lNPPZUh2wMAAAAAAABAQPrAVqxYoU6dOqlz5872ZSVLltTFixe1dOnSDAtIbTZbhmwHAAAAAAAAWcubb87RmjUrtX79ZuXOndu+/P3339VHHy3TZ59t1vnzUVqwYL5+/XWfJKlmzdoaOHCQihQpah9//PgxLVnyjg4c2Kc//vhDefMG6LHHmurFF19SjhzekqRHHqml0NAXtGPH9zp16qRCQnqqV68+mdtwFsMl9g/IbDZr3759io2NTbb8hRde0CeffCJJOn/+vAYPHqz69eurUqVKatSokaZPny7DMCRJa9euVYsWLex/Vq5cWU8++aR++eUXSVJYWJjWrVunn3/+WeXKlZMkxcbGasyYMXr00UdVqVIl1a9fX2PGjFFcXJwkadeuXapYsaK2bdumtm3bqnLlymrZsqW++eYbe40JCQkKDw9X/fr1VbNmzWQ13XXixAn16dNHwcHBeuSRRzR06FBdvnzZ/n5ISIjGjh2rp556SrVq1dL69eszeIYBAAAAAABcW9u27ZWQcFvfffdNsuWbNm1U06b/0aVLF9Wv3/O6fv2aRo+eoLCwsTp/Plr9+99ZJklXrlzRgAG9FR8fp1GjJmjGjLlq1uw/Wr36E61cmfxWkMuWvacWLR7XpElT1bhx00zrM6viDNIH1Lt3bw0ePFiNGjVS3bp1VatWLdWrV09VqlSRr6+vJOnFF19UgQIF9N577ylXrlzasmWLpkyZouDgYDVv3lyS9Pvvv+vjjz/W9OnTlStXLk2YMEFhYWH66quvNHr0aMXHx+vChQuaN2+epDuh6cWLFzV//nzly5dPe/fu1ahRo1S6dGn17NlTkmS1WjV9+nSNHj1ahQoV0qxZszRixAht375duXLl0qRJk/Ttt98qIiJChQsX1oIFC7Rnzx4FBQVJki5evKhu3bqpXbt2CgsLU1xcnObNm6euXbtqw4YNypkzpyRp1apVmj59usqVK6cCBQpk8lcAAAAAWV3z2pVUJqigs8tAOsXejNOl6zecXYbLK1ss0NklAHCi4sVLqHLlqtq0aaPatu0gSTp4cL+ios5qzJgJeu+9RfL29tYbb7ylXLnunGFaq1ZtdenSXitWLNOAAa/o5MnjKlOmnCZNmqqcOXNJkmrXrqs9e3Zp375fFBLS076/qlWD9fTTz2Z2m1kWAekDatmypQIDA7V06VLt2LFD27ZtkySVKFFC4eHhqlSpktq3b69WrVqpUKFCkqSePXtq0aJFOnLkiD0gTUxM1MSJE1WhQgVJUq9evTRgwABdvnxZDz30kLy9veXp6WkPIBs2bKjatWvbzygtWrSoli9frqNHjyarb9CgQapfv74kqX///tq8ebOOHj2qMmXKaO3atRo/frwaN24sSQoPD9dPP/1kX/ejjz5SYGCgxowZY1/2xhtvqF69etq0aZOefPJJSVKFChXUrl27jJ1YAAAAuATDsGrs8+2dXQYegGFYZTZbnF2GW0iyWmUY3F4NcFdt2z6hqVMn68KF3xUYWEgbN25QsWLFVblyVY0aNUzBwTWUI4e3kpKSJEk5c+ZS1arB2r17lySpTp16qlOnnpKSknTq1ElFR5/TiRPHdf36dfn6+iXbV5kyZTO9v6yMgDQDVK9eXdWrV5dhGIqMjNS2bdu0fPly9enTR19//bWeffZZbdq0SQcOHNCZM2d05MgRXblyJcXl7KVKlbL/PU+ePJLuBKep6datm7799lutW7dOp0+f1vHjxxUVFaWSJUsmG/fX13fvYZGYmKhTp04pMTFRVapUsb+fI0cOVaxY0f768OHDOnbsmIKDg5Nt8/bt2zpx4oT9dfHixdM0TwAAAHA/ZrNFW94cpZjok84uBengX6Skmg0I140bcbJajX9fwYVYLGb5+vpkau+GYSMgBdxY06b/0Zw5s7Rp0xd65pkQbd36tbp37ylJio2N0ZYtX2vLlq9TrOfvn1eSZBiGFi58U2vXrlJc3J966KGCqlixknLkyJHi2TY+Pj4O7yc7ISB9ABcuXNDChQvVt29fBQYGymw2q2LFiqpYsaKaN2+utm3bavv27Vq2bJni4+PVsmVLdezYUVWrVlX37t1TbM/LyyvFstQezmQYhvr27atjx46pbdu2at26tSpVqqSxY8emeZsmkynV7Xt4/O8jYRiG6tWrp/Hjx6fYxt0AV5K8vb1TvA8AAADcFRN9UldORzq7DDwAq9VQUpJ7BaR3uXPvADJXzpw51aRJM23d+o1KlSqtuLg4tWrVRtKdHKZmzTp65pmUl8VbLHfO8l++/H198smHGjZslBo3bmo/Ua5Pnx6Z10Q2RUD6ALy8vLRq1SoVKlRIL7zwQrL37t5/NDo6Wv/973+1Y8cO5c+fX5IUExOjq1ev3teT6e8GmpL022+/afv27Vq5cqWqVasm6c5ZoWfPnrXfP/TfPPzww8qRI4f27t1rv6w/KSlJkZGRqlu3riSpTJky2rhxowoVKmQPWmNiYjRixAj16tVL9erVS3P9AAAAAAAA+Gdt27bXxo2f65NPVqhWrbrKn//OrRarV6+h06dPqXTpsvaT22w2myZOHKOgoGIqU6acDhz4VQ8/XFJt2jxh397ly5d04sQJVahQMdX94Q6eYv8AAgIC1Lt3b82ZM0ezZ8/Wb7/9pnPnzmn
2023-03-21 17:21:11 +00:00
"text/plain": [
"<Figure size 1500x800 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sns.set_theme(palette=\"hls\")\n",
"\n",
"# Initialize the matplotlib figure\n",
"f, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 8), sharey=True)\n",
"\n",
"ax2.xaxis.set_major_formatter(mpl.ticker.FuncFormatter(lambda x, pos: '%.0fB' % (x)))\n",
"ax1.xaxis.set_major_formatter(mpl.ticker.FuncFormatter(lambda x, pos: '%.0f%%' % (x)))\n",
"\n",
"sns.barplot(y=\"bank\", x=\"diff\", data=df_mkt, ax=ax1, color=sns.xkcd_rgb[\"purplish red\"])\n",
"sns.barplot(\n",
" data=df_mval, ax=ax2,\n",
" x=\"market_value\", y=\"bank\", hue=\"year\",\n",
" palette=[sns.xkcd_rgb[\"prussian blue\"], sns.xkcd_rgb[\"sienna\"]]\n",
")\n",
"\n",
"# Add a legend and informative axis label\n",
"ax2.set(ylabel=\"Institution\", xlim=[0, 300],\n",
" xlabel=\"Market value\")\n",
"ax1.set(ylabel=\"Institution\", xlim=[-1, 0], xticks=range(-100, 1, 10),\n",
" xlabel=\"Market value decrease\")\n",
"sns.despine(left=True, bottom=True)"
]
2023-03-20 11:12:57 +00:00
},
{
"cell_type": "markdown",
"id": "06e7f954",
"metadata": {},
"source": [
"## Exercise 4 - Data visualisation and exploration (30 points) 🔍\n",
"\n",
"You'll need to work with the *'airports'* and *airports-delays* datasets. Examine the datasets and perform cleansing if needed, before performing the exercise.\n",
"\n",
"1. Create a dataframe that provides, for each country, <del>the mean of flights delayed</del>. Display these information by binning the flights delayed in 6 bins. The resulting dataframe should have the countries as rows and the 6 bins as columns. For this exercise you cannot use pivot_table but only groupby. \n",
"\n",
"<span style=\"color: red\">According to answer of question to professor:</span>\n",
"> Bin by delay_duration value, compute delay mean per-bin per-country \n",
2023-03-20 11:12:57 +00:00
"\n",
"2. Create a dataframe from a*irports-delays* which shows for each continent and country:\n",
" 1. max, min and mean of **delay_duration**;\n",
" 2. mean, sum of **flights_cancelled**;\n",
" 3. mean, sum of **flights_delayed**;\n",
" 4. mean, sum of **flights_planned**.\n",
"\n",
"3. Show a representation of the relationship between the number of flights planned and the number of flights delayed for each continent. It should be possible to see the relationship and the presence of outliers for each continent. What do you observe? You may want to display the median of the values for a better explaination."
]
},
{
"cell_type": "code",
2023-03-28 12:00:00 +00:00
"execution_count": 20,
2023-03-20 11:12:57 +00:00
"id": "b4fde7e4",
"metadata": {},
"outputs": [],
"source": [
"df_del = pd.read_csv(\"./datasets/airports-delays.csv\", index_col='ID', sep=\";\", na_values=['\\\\N']) \\\n",
" .dropna(subset=['tz_database_timezone'])"
]
},
{
"cell_type": "code",
2023-03-28 12:00:00 +00:00
"execution_count": 21,
"id": "25391739",
"metadata": {},
"outputs": [],
"source": [
"def tz_to_continent(tz: str) -> str:\n",
" tz_mappings = {\n",
" 'Asia/': 'Asia',\n",
" 'Africa/': 'Africa',\n",
" 'America/': 'America',\n",
" 'Europe/': 'Europe',\n",
" 'Australia/': 'Oceania',\n",
" 'Pacific/': 'Oceania',\n",
" 'Antarctica/': 'Antarctica',\n",
" 'Arctic/Longyearbyen': 'Europe',\n",
" 'Atlantic/Azores': 'Europe',\n",
" 'Atlantic/Bermuda': 'America',\n",
" 'Atlantic/Canary': 'Africa',\n",
" 'Atlantic/Cape_Verde': 'Africa',\n",
" 'Atlantic/Faeroe': 'Europe',\n",
" 'Atlantic/Reykjavik': 'Europe',\n",
" 'Atlantic/St_Helena': 'Africa',\n",
" 'Atlantic/Stanley': 'America',\n",
" 'Indian/Antananarivo': 'Africa',\n",
" 'Indian/Chagos': 'Asia',\n",
" 'Indian/Christmas': 'Oceania',\n",
" 'Indian/Cocos': 'Oceania',\n",
" 'Indian/Comoro': 'Africa',\n",
" 'Indian/Mahe': 'Africa',\n",
" 'Indian/Maldives': 'Asia',\n",
" 'Indian/Mauritius': 'Africa',\n",
" 'Indian/Mayotte': 'Africa',\n",
" 'Indian/Reunion': 'Africa',\n",
" }\n",
" if type(tz) != str:\n",
" raise ValueError(\"tz not str\")\n",
" to_return = [v for (k, v) in tz_mappings.items() if tz.startswith(k)]\n",
" if len(to_return) == 0:\n",
" raise ValueError(f\"'{tz}' no continent found\")\n",
" return to_return[0]\n",
"\n",
"df_del[\"continent\"] = df_del[\"tz_database_timezone\"].apply(tz_to_continent)"
]
},
{
"cell_type": "code",
2023-03-28 12:00:00 +00:00
"execution_count": 22,
"id": "f8906707",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th>delay_duration_bin</th>\n",
" <th>(15.999, 30.0]</th>\n",
" <th>(30.0, 35.0]</th>\n",
" <th>(35.0, 41.0]</th>\n",
" <th>(41.0, 47.0]</th>\n",
" <th>(47.0, 59.0]</th>\n",
" <th>(59.0, 850.0]</th>\n",
" </tr>\n",
" <tr>\n",
" <th>country</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Afghanistan</th>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.00</td>\n",
" <td>44.0</td>\n",
" <td>0.000000</td>\n",
" <td>60.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Albania</th>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.00</td>\n",
" <td>0.0</td>\n",
" <td>56.000000</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Algeria</th>\n",
" <td>26.5</td>\n",
" <td>33.857143</td>\n",
" <td>38.75</td>\n",
" <td>43.0</td>\n",
" <td>51.200000</td>\n",
" <td>73.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>American Samoa</th>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.00</td>\n",
" <td>43.0</td>\n",
" <td>48.000000</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Angola</th>\n",
" <td>28.0</td>\n",
" <td>35.000000</td>\n",
" <td>36.00</td>\n",
" <td>45.0</td>\n",
" <td>51.666667</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"delay_duration_bin (15.999, 30.0] (30.0, 35.0] (35.0, 41.0] (41.0, 47.0] \\\n",
"country \n",
"Afghanistan 0.0 0.000000 0.00 44.0 \n",
"Albania 0.0 0.000000 0.00 0.0 \n",
"Algeria 26.5 33.857143 38.75 43.0 \n",
"American Samoa 0.0 0.000000 0.00 43.0 \n",
"Angola 28.0 35.000000 36.00 45.0 \n",
"\n",
"delay_duration_bin (47.0, 59.0] (59.0, 850.0] \n",
"country \n",
"Afghanistan 0.000000 60.0 \n",
"Albania 56.000000 0.0 \n",
"Algeria 51.200000 73.0 \n",
"American Samoa 48.000000 0.0 \n",
"Angola 51.666667 0.0 "
]
},
2023-03-28 12:00:00 +00:00
"execution_count": 22,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_4_1 = df_del.copy()\n",
"\n",
"# The following statements bins the data by the value of delay_duration.\n",
"# The bins are chosen as equally-spaced percentile values of the data. This is done to \n",
"# better distribute the data between bins, as it is quite skewed towards low values\n",
"df_4_1[\"delay_duration_bin\"] = pd.qcut(df_del.delay_duration, 6)\n",
"\n",
"# The dataframe will contain countries as row indices, the 6 bins as columns and values\n",
"# corresponding to the mean delay_duration per country, per bin. When no delay_duration \n",
"# falls in a particular bin for some country, that bin has a value of 0\n",
"df_4_1 = df_4_1.loc[:, ['country', 'delay_duration', 'delay_duration_bin']] \\\n",
" .groupby(['country', 'delay_duration_bin']) \\\n",
" .mean() \\\n",
" .fillna(0) \\\n",
" .reset_index() \\\n",
" .pivot(index='country', columns='delay_duration_bin', values='delay_duration') \n",
"\n",
"df_4_1.head()"
]
},
{
"cell_type": "code",
2023-03-28 12:00:00 +00:00
"execution_count": 23,
"id": "a677ce07",
"metadata": {},
2023-03-21 17:21:11 +00:00
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th></th>\n",
" <th>dur_min</th>\n",
" <th>dur_mean</th>\n",
" <th>dur_max</th>\n",
" <th>cancelled_sum</th>\n",
" <th>cancelled_mean</th>\n",
" <th>delayed_sum</th>\n",
" <th>delayed_mean</th>\n",
" <th>planned_sum</th>\n",
" <th>planned_mean</th>\n",
2023-03-21 17:21:11 +00:00
" </tr>\n",
" <tr>\n",
" <th>continent</th>\n",
" <th>country</th>\n",
2023-03-21 17:21:11 +00:00
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th rowspan=\"5\" valign=\"top\">Africa</th>\n",
" <th>Algeria</th>\n",
" <td>26.0</td>\n",
" <td>43.739130</td>\n",
" <td>82.0</td>\n",
" <td>6</td>\n",
" <td>0.26087</td>\n",
" <td>360</td>\n",
" <td>15.652174</td>\n",
" <td>1864</td>\n",
" <td>81.043478</td>\n",
2023-03-21 17:21:11 +00:00
" </tr>\n",
" <tr>\n",
" <th>Angola</th>\n",
2023-03-21 17:21:11 +00:00
" <td>28.0</td>\n",
" <td>42.714286</td>\n",
" <td>53.0</td>\n",
2023-03-21 17:21:11 +00:00
" <td>9</td>\n",
" <td>1.12500</td>\n",
" <td>97</td>\n",
" <td>12.125000</td>\n",
" <td>472</td>\n",
" <td>59.000000</td>\n",
2023-03-21 17:21:11 +00:00
" </tr>\n",
" <tr>\n",
" <th>Benin</th>\n",
" <td>69.0</td>\n",
" <td>69.000000</td>\n",
" <td>69.0</td>\n",
2023-03-21 17:21:11 +00:00
" <td>0</td>\n",
" <td>0.00000</td>\n",
" <td>7</td>\n",
" <td>7.000000</td>\n",
" <td>28</td>\n",
" <td>28.000000</td>\n",
2023-03-21 17:21:11 +00:00
" </tr>\n",
" <tr>\n",
" <th>Burkina Faso</th>\n",
" <td>35.0</td>\n",
" <td>35.000000</td>\n",
" <td>35.0</td>\n",
2023-03-21 17:21:11 +00:00
" <td>0</td>\n",
" <td>0.00000</td>\n",
2023-03-21 17:21:11 +00:00
" <td>18</td>\n",
" <td>18.000000</td>\n",
" <td>65</td>\n",
" <td>65.000000</td>\n",
2023-03-21 17:21:11 +00:00
" </tr>\n",
" <tr>\n",
" <th>Cameroon</th>\n",
" <td>28.0</td>\n",
" <td>51.250000</td>\n",
" <td>83.0</td>\n",
" <td>3</td>\n",
" <td>0.75000</td>\n",
" <td>61</td>\n",
" <td>15.250000</td>\n",
" <td>339</td>\n",
" <td>84.750000</td>\n",
2023-03-21 17:21:11 +00:00
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" dur_min dur_mean dur_max cancelled_sum \\\n",
"continent country \n",
"Africa Algeria 26.0 43.739130 82.0 6 \n",
" Angola 28.0 42.714286 53.0 9 \n",
" Benin 69.0 69.000000 69.0 0 \n",
" Burkina Faso 35.0 35.000000 35.0 0 \n",
" Cameroon 28.0 51.250000 83.0 3 \n",
2023-03-21 17:21:11 +00:00
"\n",
" cancelled_mean delayed_sum delayed_mean \\\n",
"continent country \n",
"Africa Algeria 0.26087 360 15.652174 \n",
" Angola 1.12500 97 12.125000 \n",
" Benin 0.00000 7 7.000000 \n",
" Burkina Faso 0.00000 18 18.000000 \n",
" Cameroon 0.75000 61 15.250000 \n",
2023-03-21 17:21:11 +00:00
"\n",
" planned_sum planned_mean \n",
"continent country \n",
"Africa Algeria 1864 81.043478 \n",
" Angola 472 59.000000 \n",
" Benin 28 28.000000 \n",
" Burkina Faso 65 65.000000 \n",
" Cameroon 339 84.750000 "
2023-03-21 17:21:11 +00:00
]
},
2023-03-28 12:00:00 +00:00
"execution_count": 23,
2023-03-21 17:21:11 +00:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# 4.2\n",
"df_4_2 = df_del.loc[:, ['country', 'continent', 'delay_duration', 'flights_cancelled', 'flights_delayed', 'flights_planned']] \\\n",
" .sort_values(['continent', 'country']) \\\n",
" .groupby(['continent', 'country']) \\\n",
" .agg(dur_min=('delay_duration', 'min'), \\\n",
" dur_mean=('delay_duration', 'mean'), \\\n",
" dur_max=('delay_duration', 'max'), \\\n",
" cancelled_sum=('flights_cancelled', 'sum'), \\\n",
" cancelled_mean=('flights_cancelled', 'mean'), \\\n",
" delayed_sum=('flights_delayed', 'sum'), \\\n",
" delayed_mean=('flights_delayed', 'mean'), \\\n",
" planned_sum=('flights_planned', 'sum'), \\\n",
" planned_mean=('flights_planned', 'mean'))\n",
" \n",
"df_4_2.head()"
]
},
{
"cell_type": "code",
2023-03-28 12:00:00 +00:00
"execution_count": 24,
"id": "a29b8c2f",
"metadata": {},
"outputs": [
{
"data": {
2023-03-29 13:29:39 +00:00
"image/png": "iVBORw0KGgoAAAANSUhEUgAABR4AAAK5CAYAAADZ4TKfAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/bCgiHAAAACXBIWXMAAA9hAAAPYQGoP6dpAACtO0lEQVR4nOzdeXhTZfr/8c8JTSiFttCKoqwFZEdRXKAVZJCRgqDAuAAqiAOI+zIyiAKz6Iwg+h0RdVxwUHDBnwjIInTGDZAijjsoWChldXCgjF1oISk5vz8wsWnSNGnTJmner+vqZXLOs9zPc04TvPuccwzTNE0BAAAAAAAAQAhZwh0AAAAAAAAAgPqHxCMAAAAAAACAkCPxCAAAAAAAACDkSDwCAAAAAAAACDkSjwAAAAAAAABCjsQjAAAAAAAAgJAj8QgAAAAAAAAg5Eg8AgAAAAAAAAi5uHAHgLr3+eef66yz2ikuzhbuUABJUlmZXT/8sIfzEhGF8xKRiPMSkYpzE5GI8xKRiPMSkaiszK4zz0ytlbZZ8RijnE5nuEMA3FznI+clIgnnJSIR5yUiFecmIhHnJSIR5yUiUW2ejyQeAQAAAAAAAIQciUcAAAAAAAAAIUfiEQAAAAAAAEDIkXgEAAAAAAAAEHIkHgEAAAAAAACEXFy4AwAAAAAAAED0cTqdOnmyLNxhwI8GDeJksYRv3SGJRwAAAAAAAATMNE0VFh7V8ePHwh0KAhAf31hJSSkyDKPO+ybxCA+macrhcFRZRlKNTlir1RqWEx4AAAAAANRMYeFRnThRotNPP13x8Y34//sIZZqmjh8v1eHDh1VYKCUnp9Z5DCQe4cHhcGju3IdrvZ+pU2fKZrPVej8AAAAAACB0nM6TOn78mE4//XQ1a5YS7nBQhUaNGkmS/vvf/yoxsVmdX3bNw2UAAAAAAAAQkJMnT0qS4uMbhTkSBMp1rMJxP05WPKJS1w+bobg4z1WJjjK7Xl/9iCRp7LAZssYFvmqxrMyu136uCwAAAAAAoheXV0ePcB4rEo+oVFyczW9i0VrFfgAAAAAAAMQuLrUGAAAAAAAAoozr4b+RjMQjAAAAAAAAQu6mm8brppvGe20vLi7W2LGjdd555+qDD96XJF1++SA99NCDNe5zxYrl6tGjmw4ePFijdsrKyvTQQw/qoosu0MUXX6hPP93is9zixYt06aX91Lv3eXr++ec8xnHw4EH16NFNK1YsD7jfQOt8+OEHevDB6YEPKEy41BoAAAAAAAB14tixY7rllsn6/vvv9dRT89WvX39J0rx5T6lx4yZhju4XH3/8sd55Z4WmTLlVffr0VbduXb3KFBcXa+7cx3TppZdq/PgJatmypd5+e6l7f/PmzfXaa2+odevWIY/vlVdeCXmbtYHEIwAAAAAAAGrdL0nHHZo//xmlp6e793Xt2i2MkXkrKPhJkjRixEi1atXKZ5nCwgI5nU4NHHiZLrjgAq/9NptN5557bm2GGfG41BoAAAAAAAC1qqTkmKZMuUXff79Dzz77nEfSUZLPS5Szstbp3nvv0UUXXaD09D76wx9mqaSkxF3H6XTq+eef06BBA3XBBefrrrvuUEFBQZWxnDx5UkuWvKGRI69S797nadCggfrb3/5PJ06ckCQ99NCD7lgyMy/3ebn4ihXLdfnlv5YkzZw5Qz16eCdOfV02/dVXX2n8+Bt14YW9NWjQQC1evFgTJ97sdZn54cOHdd99v4z9j3/8g0pKjkk6dQn7Z5/9W5999m/16NFNn376aZVjDhcSjzEoGm4+WhtM04zZsQMAAAAAEC4lJSW69dYp2r79Oz3//Au66KKLAqr3pz/9UWeddZaeemq+Jky4WcuWva3nn3/Ovf+JJx7X3//+rH7zm6s1b95TSk5uqr/97f8Canf27Ed12WWDNH/+Mxo79nq9/vpruvPOO2Sapm65ZYpuuWWKJOnJJ5/SzJkzvdro3/9SPfnkU5KkW26Zotdee6PKfnfv3q2JE2+WJM2d+7huv/0OLVjwgr744guvsk8/PV8tWpyp+fOf1rhx47V06Vt65plnJEkzZ85U165d1bVrV7322hvq1i2yVouWx6XWMSgrK0vjx08Kdxh1yjRNLVq0QJI0btxEGYYR5ogAAAAAAKj/SktLddttU9zJtfIrFqvSv/+lmjr195KkPn36avPmzdqwYb3uvfc+FRYW6rXXXtX48Tfp1ltvkyRlZFyiw4f/q48//rjSNnNzd2nZsrd1zz33auLEU7mR9PR0NW/eXNOnP6CNGzeof/9L3fdl7Nq1q1q2bOnVTkpKirp2PXXfx9atWwd0SfWLL76gJk2a6LnnXlCjRo0kSWlp7XXDDWO9yv7615fr97+fJkm6+OI+ys7e5H7ATYcOHd33w4z0S7lZ8RiDDh8+LIfDEe4w6pTD4dCBA/t04MC+mBs7AAAAAADh8u2327Rr1y698spitWnTRg8++KCOHDkcUN1evXp5vD/jjDNUWloqSfrmm69VVlamSy8d4FFm8OBMv23++9+fSZKGDh3qsX3IkKFq0KCB/v3vfwcUW3V8+ukW9evX3510lE6N0Vdis3fv3h7vW7ZspaKiolqLrbaQeAQAAAAAAECtSEpK0ksvLdT555+vRx+drcLCAk2fPj2gW6HFx8d7vLdYLHI6nZLkvpdjs2bNPMo0b97cb5uuh8acdppnubi4ODVt2rRWk3tHjx5VSkqK1/bU1FSvbeWTk5Ln2KMJiUcAAAAAAADUik6dOqtz586SpHPOOVcTJ07S5s3ZWrjwHzVqt2nTUwnH/Px8j+0//fST33rJyU0lyWvVpcPh0E8//aSmTZvWKC5/zjijhVe80qmEZH1F4hEAAAAAAAB1YsqUW9WjR0899dQ8bd36TbXbOe+8XoqPj9c//7nOY/tHH33kt96FF14gSXr33Xc9tq9du1YnT57U+eefX+2YqnLBBRfo4483up+eLUnbt3+nAwcOBN1WgwbRkdLj4TIAAAAAAACoE3FxcZo9e46uueY3mjp1qpYufVtNmjQJup2EhMa65ZYpmj//KTVqlKCLLrpYGzdu0Pr1H/mt16FDR1111Qg9/fR8HT9+XL1799aOHTv07LPP6KKLLtYll/Sr5siqNnnyZK1bt1ZTptyi8eNvUlFRoebPf0oWiyXoh+AmJibp66+/0pYtn6hLl65KTk6upahrhsQjYtpHH72n7OwN6tKlmw4ePKDBg4dJkrKyVstqtSo//4hSU09TSUmJDMNQ27bttGPHd+rSpZtyc3fJ4bArPb2/BgwY5NFuTs4OZWWt9mivZ89e2rr1Kw0ePEydOnWp03GWj6emfefk7NDq1ctlGIauuGJEnY8FAAAAABDd2rVrp9/9bqoeeeTP+vOf/6THHptbrXYmTZqshIQELV68WIsXL1KvXufp/vun6uGH/+y33p///LDatGmj5cuXa8GCF3XGGWfohhtu1JQpt8piqb2VhG3atNXzz7+gJ554XPfdd49SUlI0adJkvfDC80pISAiqrbFjx+rbb7dpypRb9Mgjf9EVVwyrpahrxjADuZsn6pV77rlHd989VU2aJHnts9vtmjv3YUnS+BF/ljXO5rHfUWbXKytmVbrfn/J1p06dKZst8Lo1VX5crr5LSo7pySfneNzQtkmTRElScXHgN5M1DEP33DNNCQmNJUkOh11///s8FRUVerRnGIZM01RiYpJuvfVuWa11M/7y8dS0b4fDrmeffdI9P02aJOq22+6p8Vjs9uM6dGifWrRoI5stvuoKQB3gvEQk4rxEpOLcRCTivEQkqg/npcNh19Gjh9S2bTuvh7/Av08+2Syr1arevS9wbyssLFT//pfo/vun6oYbbqyVfo8fP669e/coJaWFz/9/t9uPq2VL/w/lqa7ouCAcqAVvvfW611O0iouLgko6SpJpmlq69A33+02bNrifglW+PVdfRUVFys7eWJPQg1I+npr2vWnTBo/5KS6u27EAAAAAABCtvvvuO02ePEmLFy/SZ599pvfee0933HGbkpKSNHToFeEOr1ZwqXWMcjjsstvtXtt9basNddVPZf3
"text/plain": [
"<Figure size 1500x800 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df_4_3 = df_del.loc[:, ['continent', 'flights_planned', 'flights_delayed']] \\\n",
" .rename(columns={'flights_planned': 'planned', 'flights_delayed': 'delayed'}) \\\n",
" .melt(id_vars=['continent'], value_vars=['planned', 'delayed'], var_name=\"Kind of flight\", \\\n",
" value_name=\"# of flights\") \\\n",
" .sort_values('continent')\n",
"\n",
"f, ax1 = plt.subplots(figsize=(15, 8))\n",
"\n",
"sns.set_theme(style=\"ticks\", palette=\"pastel\")\n",
"\n",
"# Draw a nested boxplot to show bills by day and time\n",
2023-03-28 12:00:00 +00:00
"ax1.set(xlim=[0, 700])\n",
"sns.boxplot(x=\"# of flights\", y=\"continent\",\n",
" hue=\"Kind of flight\", palette=[\"m\", \"g\"],\n",
" data=df_4_3)\n",
"sns.despine(offset=10, trim=True)"
]
},
{
"cell_type": "markdown",
"id": "04aa4de5",
"metadata": {},
"source": [
"I observe that in all continents there is a significant higher number of planned flights than the number of delayed flights. This can be determined by the inter-quartile range positions of both series' boxplots with respect to each other."
2023-03-21 17:21:11 +00:00
]
2023-03-20 11:12:57 +00:00
},
{
"cell_type": "markdown",
"id": "e2f9c1aa",
"metadata": {},
"source": [
"## Exercise 5 - Geospatial data analysis (35 points) 🌍\n",
"\n",
"Use the *airports*, *routes*, *countries* and *europe.geojson* files. Create an interactive map representation - related to European countries only - such that, when a country is selected the map shows the number of flights left from the country selected and directed to each of the other countries, if flights with those destinations exist. The information should be represented as a choropleth map, essentially dynamically creating it when a country is selected.\n",
"\n",
"**Hints**:\n",
"1. If `A` is a GeoDataFrame and `B` a DataFrame, the result of `A.merge(B,..)` is a GeoDataFrame, whereas the result of `B.merge(A,..)` is a DataFrame. The function `to_json()` on a DataFrame with a geometry column does **not** work.\n",
"2. When updating the map, to access the color mapper you can use the following method:\n",
"```\n",
"color_mapper = p.select_one(LinearColorMapper)\n",
"```\n",
"where `p` is the figure.\n",
"\n",
"3. You can discard Guernsey and Gibraltar that are not present in the geojson.\n",
"\n",
"\n",
"<aside>\n",
"💡 Note that you have all the information you need in the files mentioned above. \n",
"</aside>"
]
},
{
"cell_type": "code",
2023-03-29 15:17:07 +00:00
"execution_count": 194,
2023-03-29 13:29:39 +00:00
"id": "5d1fad2a",
2023-03-20 11:12:57 +00:00
"metadata": {},
2023-03-29 13:29:39 +00:00
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th>country</th>\n",
" <th>Albania</th>\n",
" <th>Andorra</th>\n",
" <th>Austria</th>\n",
" <th>Belarus</th>\n",
" <th>Belgium</th>\n",
" <th>Bosnia and Herzegovina</th>\n",
" <th>Bulgaria</th>\n",
" <th>Croatia</th>\n",
" <th>Czech Republic</th>\n",
2023-03-29 15:17:07 +00:00
" <th>Denmark</th>\n",
2023-03-29 13:29:39 +00:00
" <th>...</th>\n",
" <th>San Marino</th>\n",
" <th>Serbia</th>\n",
" <th>Slovakia</th>\n",
" <th>Slovenia</th>\n",
" <th>Spain</th>\n",
" <th>Sweden</th>\n",
" <th>Switzerland</th>\n",
" <th>Ukraine</th>\n",
" <th>United Kingdom</th>\n",
" <th>Vatican City</th>\n",
" </tr>\n",
" <tr>\n",
" <th>country_dest</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Albania</th>\n",
2023-03-29 15:17:07 +00:00
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
2023-03-29 13:29:39 +00:00
" <td>...</td>\n",
2023-03-29 15:17:07 +00:00
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
2023-03-29 13:29:39 +00:00
" </tr>\n",
" <tr>\n",
" <th>Andorra</th>\n",
2023-03-29 15:17:07 +00:00
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
2023-03-29 13:29:39 +00:00
" <td>...</td>\n",
2023-03-29 15:17:07 +00:00
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
2023-03-29 13:29:39 +00:00
" </tr>\n",
" <tr>\n",
" <th>Austria</th>\n",
2023-03-29 15:17:07 +00:00
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>15.0</td>\n",
" <td>2.0</td>\n",
" <td>2.0</td>\n",
" <td>1.0</td>\n",
" <td>3.0</td>\n",
" <td>6.0</td>\n",
" <td>1.0</td>\n",
" <td>5.0</td>\n",
2023-03-29 13:29:39 +00:00
" <td>...</td>\n",
2023-03-29 15:17:07 +00:00
" <td>0.0</td>\n",
" <td>3.0</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" <td>40.0</td>\n",
" <td>4.0</td>\n",
" <td>11.0</td>\n",
" <td>10.0</td>\n",
" <td>11.0</td>\n",
" <td>0.0</td>\n",
2023-03-29 13:29:39 +00:00
" </tr>\n",
" <tr>\n",
" <th>Belarus</th>\n",
2023-03-29 15:17:07 +00:00
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>2.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>2.0</td>\n",
" <td>0.0</td>\n",
2023-03-29 13:29:39 +00:00
" <td>...</td>\n",
2023-03-29 15:17:07 +00:00
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>2.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
2023-03-29 13:29:39 +00:00
" </tr>\n",
" <tr>\n",
" <th>Belgium</th>\n",
2023-03-29 15:17:07 +00:00
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>2.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>4.0</td>\n",
" <td>5.0</td>\n",
" <td>4.0</td>\n",
" <td>5.0</td>\n",
2023-03-29 13:29:39 +00:00
" <td>...</td>\n",
2023-03-29 15:17:07 +00:00
" <td>0.0</td>\n",
" <td>2.0</td>\n",
" <td>1.0</td>\n",
" <td>3.0</td>\n",
" <td>60.0</td>\n",
" <td>6.0</td>\n",
" <td>6.0</td>\n",
" <td>2.0</td>\n",
" <td>17.0</td>\n",
" <td>0.0</td>\n",
2023-03-29 13:29:39 +00:00
" </tr>\n",
" </tbody>\n",
"</table>\n",
2023-03-29 15:17:07 +00:00
"<p>5 rows × 46 columns</p>\n",
2023-03-29 13:29:39 +00:00
"</div>"
],
"text/plain": [
"country Albania Andorra Austria Belarus Belgium \\\n",
"country_dest \n",
2023-03-29 15:17:07 +00:00
"Albania 0.0 0.0 1.0 0.0 0.0 \n",
"Andorra 0.0 0.0 0.0 0.0 0.0 \n",
"Austria 1.0 0.0 15.0 2.0 2.0 \n",
"Belarus 0.0 0.0 2.0 0.0 0.0 \n",
"Belgium 0.0 0.0 2.0 0.0 1.0 \n",
2023-03-29 13:29:39 +00:00
"\n",
2023-03-29 15:17:07 +00:00
"country Bosnia and Herzegovina Bulgaria Croatia Czech Republic \\\n",
"country_dest \n",
"Albania 0.0 0.0 0.0 0.0 \n",
"Andorra 0.0 0.0 0.0 0.0 \n",
"Austria 1.0 3.0 6.0 1.0 \n",
"Belarus 0.0 0.0 0.0 2.0 \n",
"Belgium 0.0 4.0 5.0 4.0 \n",
2023-03-29 13:29:39 +00:00
"\n",
2023-03-29 15:17:07 +00:00
"country Denmark ... San Marino Serbia Slovakia Slovenia Spain \\\n",
"country_dest ... \n",
"Albania 0.0 ... 0.0 0.0 0.0 1.0 0.0 \n",
"Andorra 0.0 ... 0.0 0.0 0.0 0.0 0.0 \n",
"Austria 5.0 ... 0.0 3.0 1.0 2.0 40.0 \n",
"Belarus 0.0 ... 0.0 0.0 0.0 0.0 1.0 \n",
"Belgium 5.0 ... 0.0 2.0 1.0 3.0 60.0 \n",
2023-03-29 13:29:39 +00:00
"\n",
2023-03-29 15:17:07 +00:00
"country Sweden Switzerland Ukraine United Kingdom Vatican City \n",
"country_dest \n",
"Albania 0.0 0.0 0.0 1.0 0.0 \n",
"Andorra 0.0 0.0 0.0 0.0 0.0 \n",
"Austria 4.0 11.0 10.0 11.0 0.0 \n",
"Belarus 1.0 1.0 2.0 1.0 0.0 \n",
"Belgium 6.0 6.0 2.0 17.0 0.0 \n",
2023-03-29 13:29:39 +00:00
"\n",
2023-03-29 15:17:07 +00:00
"[5 rows x 46 columns]"
2023-03-29 13:29:39 +00:00
]
},
2023-03-29 15:17:07 +00:00
"execution_count": 194,
2023-03-29 13:29:39 +00:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
2023-03-29 13:29:39 +00:00
"df_air = pd.read_csv(\"./datasets/airports.csv\", index_col='ID', na_values=['\\\\N'], dtype={'ID': pd.Int64Dtype()}) \\\n",
" .drop(columns=['latitude', 'longitude'])\n",
"\n",
"df_routes = pd.read_csv(\"./datasets/routes.csv\", na_values=['\\\\N'], sep=\";\") \\\n",
" .rename(lambda x: x.strip(), axis=1)\n",
"\n",
2023-03-29 15:17:07 +00:00
"# Note that I consider a country to be 'European' if the 'continent' country in countries.csv is equal to 'eu' \n",
"df_countries = pd.read_csv(\"./datasets/countries.csv\") \n",
"df_countries = df_countries.loc[df_countries.continent == 'eu', :] \\\n",
2023-03-29 13:29:39 +00:00
" .rename(columns={'name': 'country'}).drop(columns=['continent'])\n",
"\n",
"df_countries.loc[df_countries.country == 'Faroe Is.', 'country'] = 'Faroe Islands'\n",
"\n",
"df_id_country = df_air.join(df_countries.set_index('country'), on='country', how='right', lsuffix='_air') \\\n",
" .reset_index(drop=True) \\\n",
" .loc[:, ['IATA', 'country']] \\\n",
" .set_index('IATA')\n",
"\n",
"# Right join twice with source airport country and destination airport country\n",
"# A right join assures we include all countries in the final dataframe\n",
"df_routes_count = df_routes \\\n",
" .loc[:, ['source_airport', 'destination_airport']] \\\n",
" .join(df_id_country, how='right', on='source_airport') \\\n",
" .join(df_id_country, how='right', on='destination_airport', rsuffix='_dest')\n",
"\n",
2023-03-29 15:17:07 +00:00
"# Count only a pair of notna source and destination airport as a valid flight\n",
2023-03-29 13:29:39 +00:00
"# When this is not a case the row is an artifact of the right join. We assign 0\n",
"# as a value so that in the final sum the value will still appear to include \n",
"# no-flight countries, albeit with a total number of routes to 0\n",
"df_routes_count['# routes'] = 0\n",
"df_routes_count.loc[df_routes_count.source_airport.notna() & \\\n",
" df_routes_count.destination_airport.notna(), '# routes'] = 1\n",
"\n",
"# destination as rows, source as columns\n",
"df_routes_count = df_routes_count \\\n",
" .groupby(['country_dest', 'country']).agg({'# routes': 'sum'}) \\\n",
" .rename(columns={0: '# routes'}) \\\n",
" .unstack() \\\n",
" .fillna(0) \\\n",
" .sort_values('country_dest')\n",
"\n",
2023-03-29 15:17:07 +00:00
"# Change type of cells to float and remove column level for geopandas compatibility\n",
"df_routes_count = df_routes_count[df_routes_count.columns].astype(float)\n",
2023-03-29 13:29:39 +00:00
"df_routes_count.columns = df_routes_count.columns.droplevel(0)\n",
"df_routes_count.head()"
]
},
{
"cell_type": "code",
2023-03-29 15:17:07 +00:00
"execution_count": 195,
"id": "87bd101d",
2023-03-29 13:29:39 +00:00
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>NAME</th>\n",
" <th>geometry</th>\n",
" <th>Albania</th>\n",
" <th>Andorra</th>\n",
" <th>Austria</th>\n",
" <th>Belarus</th>\n",
" <th>Belgium</th>\n",
" <th>Bosnia and Herzegovina</th>\n",
" <th>Bulgaria</th>\n",
" <th>Croatia</th>\n",
" <th>...</th>\n",
" <th>San Marino</th>\n",
" <th>Serbia</th>\n",
" <th>Slovakia</th>\n",
" <th>Slovenia</th>\n",
" <th>Spain</th>\n",
" <th>Sweden</th>\n",
" <th>Switzerland</th>\n",
" <th>Ukraine</th>\n",
" <th>United Kingdom</th>\n",
" <th>Vatican City</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>Albania</td>\n",
" <td>POLYGON ((19.43621 41.02107, 19.45055 41.06000...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
2023-03-29 15:17:07 +00:00
" <th>1</th>\n",
2023-03-29 13:29:39 +00:00
" <td>Bosnia and Herzegovina</td>\n",
" <td>POLYGON ((17.64984 42.88908, 17.57853 42.94382...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>2.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>3.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>3.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
2023-03-29 15:17:07 +00:00
" <th>2</th>\n",
2023-03-29 13:29:39 +00:00
" <td>Bulgaria</td>\n",
" <td>POLYGON ((27.87917 42.84110, 27.89500 42.80250...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>3.0</td>\n",
" <td>0.0</td>\n",
" <td>4.0</td>\n",
" <td>0.0</td>\n",
" <td>6.0</td>\n",
" <td>0.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>2.0</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>8.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>10.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
2023-03-29 15:17:07 +00:00
" <tr>\n",
" <th>3</th>\n",
" <td>Denmark</td>\n",
" <td>MULTIPOLYGON (((11.51389 54.82972, 11.56444 54...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>5.0</td>\n",
" <td>0.0</td>\n",
" <td>5.0</td>\n",
" <td>1.0</td>\n",
" <td>1.0</td>\n",
" <td>6.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>2.0</td>\n",
" <td>0.0</td>\n",
" <td>2.0</td>\n",
" <td>31.0</td>\n",
" <td>9.0</td>\n",
" <td>7.0</td>\n",
" <td>0.0</td>\n",
" <td>22.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>Ireland</td>\n",
" <td>MULTIPOLYGON (((-9.65639 53.22222, -9.66333 53...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>3.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>3.0</td>\n",
" <td>...</td>\n",
" <td>0.0</td>\n",
" <td>0.0</td>\n",
" <td>1.0</td>\n",
" <td>0.0</td>\n",
" <td>53.0</td>\n",
" <td>2.0</td>\n",
" <td>3.0</td>\n",
" <td>0.0</td>\n",
" <td>63.0</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
2023-03-29 13:29:39 +00:00
" </tbody>\n",
"</table>\n",
2023-03-29 15:17:07 +00:00
"<p>5 rows × 48 columns</p>\n",
2023-03-29 13:29:39 +00:00
"</div>"
],
"text/plain": [
" NAME geometry \\\n",
2023-03-29 15:17:07 +00:00
"0 Albania POLYGON ((19.43621 41.02107, 19.45055 41.06000... \n",
"1 Bosnia and Herzegovina POLYGON ((17.64984 42.88908, 17.57853 42.94382... \n",
"2 Bulgaria POLYGON ((27.87917 42.84110, 27.89500 42.80250... \n",
"3 Denmark MULTIPOLYGON (((11.51389 54.82972, 11.56444 54... \n",
"4 Ireland MULTIPOLYGON (((-9.65639 53.22222, -9.66333 53... \n",
2023-03-29 13:29:39 +00:00
"\n",
" Albania Andorra Austria Belarus Belgium Bosnia and Herzegovina \\\n",
2023-03-29 15:17:07 +00:00
"0 0.0 0.0 1.0 0.0 0.0 0.0 \n",
"1 0.0 0.0 1.0 0.0 0.0 2.0 \n",
"2 0.0 0.0 3.0 0.0 4.0 0.0 \n",
"3 0.0 0.0 5.0 0.0 5.0 1.0 \n",
"4 0.0 0.0 1.0 0.0 3.0 0.0 \n",
2023-03-29 13:29:39 +00:00
"\n",
" Bulgaria Croatia ... San Marino Serbia Slovakia Slovenia Spain \\\n",
2023-03-29 15:17:07 +00:00
"0 0.0 0.0 ... 0.0 0.0 0.0 1.0 0.0 \n",
"1 0.0 1.0 ... 0.0 3.0 0.0 1.0 0.0 \n",
"2 6.0 0.0 ... 0.0 2.0 0.0 0.0 8.0 \n",
"3 1.0 6.0 ... 0.0 2.0 0.0 2.0 31.0 \n",
"4 1.0 3.0 ... 0.0 0.0 1.0 0.0 53.0 \n",
2023-03-29 13:29:39 +00:00
"\n",
" Sweden Switzerland Ukraine United Kingdom Vatican City \n",
2023-03-29 15:17:07 +00:00
"0 0.0 0.0 0.0 1.0 0.0 \n",
"1 3.0 1.0 0.0 0.0 0.0 \n",
"2 0.0 1.0 0.0 10.0 0.0 \n",
"3 9.0 7.0 0.0 22.0 0.0 \n",
"4 2.0 3.0 0.0 63.0 0.0 \n",
2023-03-29 13:29:39 +00:00
"\n",
2023-03-29 15:17:07 +00:00
"[5 rows x 48 columns]"
2023-03-29 13:29:39 +00:00
]
},
2023-03-29 15:17:07 +00:00
"execution_count": 195,
2023-03-29 13:29:39 +00:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
2023-03-29 15:17:07 +00:00
"# Note the inner join to drop countries that we do not consider part of 'Europe'\n",
"# (according to the countries.csv file)\n",
2023-03-29 13:29:39 +00:00
"yurop = gpd.read_file(\"./datasets/europe.geojson\") \\\n",
" .loc[:, ['NAME', 'geometry']] \\\n",
" .set_index('NAME') \\\n",
2023-03-29 15:17:07 +00:00
" .join(df_routes_count, how='inner') \\\n",
" .reset_index(names='NAME')\n",
2023-03-29 13:29:39 +00:00
"\n",
"yurop.head()"
]
},
{
"cell_type": "code",
2023-03-29 15:17:07 +00:00
"execution_count": 218,
2023-03-29 13:29:39 +00:00
"id": "11612845",
"metadata": {},
"outputs": [
{
"data": {
"application/vnd.bokehjs_exec.v0+json": "",
"text/html": [
2023-03-29 15:17:07 +00:00
"<script id=\"p13596\">\n",
2023-03-29 13:29:39 +00:00
" (function() {\n",
" const xhr = new XMLHttpRequest()\n",
" xhr.responseType = 'blob';\n",
2023-03-29 15:17:07 +00:00
" xhr.open('GET', \"http://localhost:51173/autoload.js?bokeh-autoload-element=p13596&bokeh-absolute-url=http://localhost:51173&resources=none\", true);\n",
2023-03-29 13:29:39 +00:00
" xhr.onload = function (event) {\n",
" const script = document.createElement('script');\n",
" const src = URL.createObjectURL(event.target.response);\n",
" script.src = src;\n",
" document.body.appendChild(script);\n",
" };\n",
" xhr.send();\n",
" })();\n",
"</script>"
]
},
"metadata": {
"application/vnd.bokehjs_exec.v0+json": {
2023-03-29 15:17:07 +00:00
"server_id": "3422a6f3dd7d4e33b8dd93a4c485662e"
2023-03-29 13:29:39 +00:00
}
},
"output_type": "display_data"
}
],
"source": [
"from bokeh.events import Tap\n",
"from bokeh.application import Application\n",
"from bokeh.application.handlers import FunctionHandler\n",
"from bokeh.palettes import Reds\n",
2023-03-29 15:17:07 +00:00
"from bokeh.models import LinearColorMapper, ColorBar\n",
2023-03-29 14:08:27 +00:00
"from shapely import Point\n",
2023-03-29 15:17:07 +00:00
"from bokeh.models import Title\n",
2023-03-29 13:29:39 +00:00
"\n",
"def figure_flights(doc):\n",
2023-03-29 15:17:07 +00:00
" palette = Reds[8]\n",
2023-03-29 13:29:39 +00:00
" palette = palette[::-1]\n",
"\n",
" color_mapper = LinearColorMapper(palette = palette, low = 0, high = 600)\n",
2023-03-29 15:17:07 +00:00
" \n",
2023-03-29 13:29:39 +00:00
" color_bar = ColorBar(color_mapper = color_mapper, \n",
" width = 20, height = 600,\n",
" label_standoff = 8,\n",
" location = (0,0))\n",
"\n",
" \n",
2023-03-29 15:17:07 +00:00
" p = figure(title = '', \n",
2023-03-29 13:29:39 +00:00
" frame_height = 600,\n",
" frame_width = 800, \n",
" toolbar_location = 'below',\n",
" tools = \"pan, wheel_zoom, box_zoom, reset\")\n",
"\n",
2023-03-29 15:17:07 +00:00
" geo_ds = GeoJSONDataSource()\n",
2023-03-29 14:08:27 +00:00
" \n",
2023-03-29 13:29:39 +00:00
" plotted_districts = p.patches('xs','ys', source = geo_ds,\n",
" line_color = 'black', \n",
" line_width = 0.25)\n",
2023-03-29 15:17:07 +00:00
" \n",
" p.add_layout(Title(text=\"WARNING: color scale dynamically changes according to selected country\", \\\n",
" text_font_style=\"italic\"), 'above')\n",
2023-03-29 13:29:39 +00:00
"\n",
" p.patches(\"xs\",\"ys\", source = geo_ds,\n",
2023-03-29 15:17:07 +00:00
" fill_color = {\"field\" : \"flights\",\n",
2023-03-29 13:29:39 +00:00
" \"transform\" : color_mapper},\n",
" line_color = \"gray\", \n",
" line_width = 0.25, \n",
" fill_alpha = 1)\n",
" \n",
2023-03-29 14:08:27 +00:00
" p.xgrid.grid_line_color = None\n",
" p.ygrid.grid_line_color = None\n",
" p.axis.visible = False\n",
" \n",
2023-03-29 15:17:07 +00:00
" ht = HoverTool(renderers = [plotted_districts])\n",
" p.add_tools(ht)\n",
" \n",
" def set_gdf_as_datasource(gdf):\n",
" if gdf is None or len(gdf) == 0:\n",
" geo_ds.geojson = yurop.to_json()\n",
" ht.tooltips = [(\"Country\", \"@NAME\")]\n",
" color_mapper.high = 600\n",
" p.title.text = '# flights to each country: click to select country of departure'\n",
" else:\n",
" routes_from_country = gdf.iloc[0, 2:].to_frame(name='flights')\n",
" gdf_country_flights = yurop.set_index('NAME').loc[:, ['geometry']] \\\n",
" .join(routes_from_country) \\\n",
" .reset_index(names='NAME')\n",
"\n",
" country = gdf.iloc[0, :]['NAME']\n",
" max_flights = gdf_country_flights['flights'].max()\n",
" \n",
" geo_ds.geojson = gdf_country_flights.to_json()\n",
" ht.tooltips = [(\"Country\", \"@NAME\"), (f\"# flights from {country}\", \"@flights\")]\n",
" p.title.text = f\"# flights to each country from {country}: click to select another country of departure\"\n",
" \n",
" # The max value of the colorscale is dynamically computed to the ceiling of the first two significant \n",
" # digits of the max value\n",
" color_mapper.high = np.round(np.power(10, np.ceil(np.log10(max_flights) * 10) / 10) / 10) * 10\n",
" \n",
" set_gdf_as_datasource(None)\n",
2023-03-29 13:29:39 +00:00
"\n",
2023-03-29 14:08:27 +00:00
" tool = TapTool()\n",
" \n",
" def event(x):\n",
" # Figure out the country that intersects the coordinates we clicked\n",
" intersects = yurop[yurop.intersects(Point(x.x, x.y))]\n",
2023-03-29 15:17:07 +00:00
" set_gdf_as_datasource(intersects)\n",
2023-03-29 14:08:27 +00:00
" \n",
" tap = p.add_tools()\n",
2023-03-29 13:29:39 +00:00
" p.on_event(Tap, event)\n",
" \n",
" p.add_layout(color_bar, \"right\")\n",
" doc.add_root(p)\n",
"\n",
"handler = FunctionHandler(figure_flights)\n",
"app = Application(handler)\n",
"\n",
"show(app)"
]
2023-03-20 11:12:57 +00:00
},
{
"cell_type": "markdown",
"id": "9b9c5983",
"metadata": {},
"source": [
"## Datasets description\n",
"\n",
"### **Used Cars**\n",
"\n",
"Please find the dataset in the datasets folder.\n",
"\n",
"This dataset is scraped from Ebay. The content of the dataset is in German, but it should not impose critical issues in understanding the data. The fields included in the dataset are as following:\n",
"\n",
"**dateCrawled**: when this ad was first crawled, all field-values are taken from this date\\\n",
"**name**: ”name” of the car\\\n",
"**seller**: private or dealer\\\n",
"**offerTypeprice**: the price in euro on the ad to sell the car\\\n",
"**abtestvehicleTypeyearOfRegistration** : at which year the car was first registered\\\n",
"**gearboxpowerPS**: power of the car in PS\\\n",
"**modelkilometer**: how many kilometers the car has driven\\\n",
"**monthOfRegistration**: at which month the car was first registered\\\n",
"**fuelType**: vehicle fuel type\\\n",
"**brand**: vehicle brand\\\n",
"**notRepairedDamage**: if the car has a damage which is not repaired yet\\\n",
"**dateCreated**: the date for which the ad at ebay was created\\\n",
"**nrOfPictures**: number of pictures in the ad\\\n",
"**postalCodelastSeenOnline**: when the crawler saw this ad last online\n",
"\n",
"### **Airports, Routes and Ariports Delays**\n",
"\n",
"Please find the datasets in the datasets folder.\n",
"\n",
"The datasets used in this section can be found in the datasets folder.\n",
"Datasets description are as follows.\n",
"\n",
"### **Airports**\n",
"\n",
"As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe, as shown in the map above. Each entry contains the following information:\n",
"\n",
"**Airport ID**: Unique OpenFlights identifier for this airport\\\n",
"**Name**: Name of airport. May or may not contain the City name\\\n",
"**City**: Main city served by airport. May be spelled differently from Name\\\n",
"**Country**: Country or territory where airport is located. See Countries to cross-reference to ISO 3166-1 codes\\\n",
"**IATA**: 3-letter IATA code. Null if not assigned/unknown\\\n",
"**ICAO**: 4-letter ICAO code. Null if not assigned/unknown\\\n",
"**Latitude**: Decimal degrees, usually to six significant digits. Negative is South, positive is North\\\n",
"**Longitude**: Decimal degrees, usually to six significant digits. Negative is West, positive is East\\\n",
"**Altitude**: In feet\\\n",
"**Timezone**: Hours offset from UTC. Fractional hours are expressed as decimals, eg. India is 5.5\\\n",
"**DST**: Daylight savings time. One of E (Europe), A (US/Canada), S (South America), O (Australia), Z (New Zealand), N (None) or U (Unknown)\\\n",
"**Tz database time zone**: Timezone in \"tz\" (Olson) format, eg. \"America/Los_Angeles\"\\\n",
"**Type**: Type of the airport. Value \"airport\" for air terminals\\\n",
"**Source**: Source of the data. \"OurAirports\" for data sourced from OurAirports\n",
"\n",
"### **Airports Delays**\n",
"**Airport ID**: Unique OpenFlights identifier for this airport\\\n",
"**Name**: Name of airport. May or may not contain the City name\\\n",
"**City**: Main city served by airport. May be spelled differently from Name\\\n",
"**Country**: Country or territory where airport is located. See Countries to cross-reference to ISO 3166-1 codes\\\n",
"**IATA**: 3-letter IATA code. Null if not assigned/unknown\\\n",
"**ICAO**: 4-letter ICAO code. Null if not assigned/unknown\\\n",
"**Latitude**: Decimal degrees, usually to six significant digits. Negative is South, positive is North\\\n",
"**Longitude**: Decimal degrees, usually to six significant digits. Negative is West, positive is East\\\n",
"**Altitude**: In feet\\\n",
"**Timezone**: Hours offset from UTC. Fractional hours are expressed as decimals, eg. India is 5.5\\\n",
"**DST**: Daylight savings time. One of E (Europe), A (US/Canada), S (South America), O (Australia), Z (New Zealand), N (None) or U (Unknown)\\\n",
"**Tz database time zone**: Timezone in \"tz\" (Olson) format, eg. \"America/Los_Angeles\"\\\n",
"**Type**: Type of the airport. Value \"airport\" for air terminals\\\n",
"**Source**: Source of the data. \"OurAirports\" for data sourced from OurAirports\\\n",
"**Flights planned**: The number of fligths the related airport planned\\\n",
"**Flights cancelled**: The number of flights cancelled\\\n",
"**Flights delayed**: The number of flights delayed\\\n",
"**Delay duration**: The delay duration (in minutes)\n",
"\n",
"\n",
"### **Routes**\n",
"\n",
"As of June 2014, the OpenFlights/Airline Route Mapper Route Database contains 67663 routes between 3321 airports on 548 airlines spanning the globe, as shown in the map above. Each entry contains the following information:\n",
"\n",
"**Airline**: 2-letter (IATA) or 3-letter (ICAO) code of the airline\\\n",
"**Airline ID**: Unique OpenFlights identifier for airline (see Airline)\\\n",
"**Source airport**: 3-letter (IATA) or 4-letter (ICAO) code of the source airport\\\n",
"**Source airport ID**: Unique OpenFlights identifier for source airport (see Airport)\\\n",
"**Destination airport**: 3-letter (IATA) or 4-letter (ICAO) code of the destination airport\\\n",
"**Destination airport ID**: Unique OpenFlights identifier for destination airport (see Airport)\\\n",
"**Codeshare**: \"Y\" if this flight is a codeshare (that is, not operated by Airline, but another carrier), empty otherwise\\\n",
"**Stops**: Number of stops on this flight (\"0\" for direct)\\\n",
"**Equipment**: 3-letter codes for plane type(s) generally used on this flight, separated by spaces\\\n",
"The data is UTF-8 encoded. The special value \\N is used for \"NULL\" to indicate that no value is available, and is understood automatically by MySQL if imported\n",
"\n",
"\n",
"<aside>\n",
"💡 Notes:\n",
"\n",
"- Routes are directional: if an airline operates services from A to B and from B to A, both A-B and B-A are listed separately.\n",
"- Routes where one carrier operates both its own and codeshare flights are listed only once.\n",
"</aside>\n",
"\n",
"\n",
"### **Countries**\n",
"\n",
"Please find the dataset in the datasets folder.\n",
"\n",
"This dataset contains the information related to European countries.\n"
]
}
],
"metadata": {
"kernelspec": {
2023-03-29 13:29:39 +00:00
"display_name": "Python 3 (ipykernel)",
2023-03-20 11:12:57 +00:00
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
2023-03-29 13:29:39 +00:00
"version": "3.10.10"
2023-03-20 11:12:57 +00:00
}
},
"nbformat": 4,
"nbformat_minor": 5
}