This repository has been archived on 2024-10-22. You can view files and clone it, but cannot push or open issues or pull requests.
va/Assignment1/Assignment1.ipynb

1756 lines
2.8 MiB
Text
Raw Normal View History

2023-03-20 11:12:57 +00:00
{
"cells": [
{
"cell_type": "markdown",
"id": "46302376",
"metadata": {},
"source": [
"# S&DE Atelier - Visual Analytics\n",
"\n",
"# Assignment 1\n",
"\n",
"**Due** April 6, 2023 @23:55 \n",
"\n",
"**Contacts**: marco.dambros@usi.ch - carmen.armenti@usi.ch\n",
"\n",
"---\n",
"\n",
"The goal of this assignment is to use Python and Jupyter notebook to explore, analyze and visualize the datasets provided. To solve the assignment you should apply the knowledge you gained from the theoretical and practical lectures. In particular, when creating tabular or graphical representations you should apply the principles you learned from theoretical lectures and use the technologies presented during practical lectures. For what concerns the visualization library, we suggest to use the library presented in class (Seaborn, Matplotlib, Bokeh), but usage of other libraries (e.g., plotly) is also possible. You should submit a Jyputer notebook (named `SurenameName_Assignment1.ipynb`) that contains your solutions and the steps followed to arrive to these solutions. Please follow the structure of the assignment to solve the exercises.\n",
"\n",
"The datasets you need to use are described in the **Datasets description** section."
]
},
{
"cell_type": "code",
"execution_count": 1,
2023-03-20 11:12:57 +00:00
"id": "fcf3beb9",
"metadata": {},
"outputs": [],
"source": [
"#%pip install pandas seaborn matplotlib bokeh ftfy\n",
"\n",
"import pandas as pd\n",
"import numpy as np\n",
2023-03-20 15:41:34 +00:00
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
2023-03-20 11:12:57 +00:00
"import bokeh\n",
2023-03-21 17:21:11 +00:00
"import ftfy\n",
"import matplotlib as mpl"
2023-03-20 11:12:57 +00:00
]
},
{
"cell_type": "markdown",
"id": "6f271000",
"metadata": {},
"source": [
"## Exercise 1 - Data quality (15 points) 🧼\n",
"\n",
"In the Used Cars dataset identify the missing and invalid values for the columns: `vehicle type`, `price`, `brand`, and `month of registration`. If needed, standardize the information and covert them to unique values. Please specify for each column the number of missing or invalid instances. The prices are in euros and the range of accepted prices is between €1'000 and €100'000.\n",
"Once you identified missing/invalid values for the given columns, remove all rows where one or more columns have invalid/missing data.\n",
"Show the steps that you follow to reach the solution. You can choose your preferred approach/technology to clean the dataset (e.g., Python vanilla, Pandas, OpenRefine). "
]
},
{
"cell_type": "code",
"execution_count": 2,
2023-03-20 11:12:57 +00:00
"id": "a0af6847",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"('Ü', 'sloppy-windows-1252')"
]
},
"execution_count": 2,
2023-03-20 11:12:57 +00:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# UTF-8 decoding fails thanks to this byte\n",
"ftfy.guess_bytes(b'\\xDC')"
]
},
{
"cell_type": "code",
"execution_count": 3,
2023-03-20 11:12:57 +00:00
"id": "22ce9426",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>dateCrawled</th>\n",
" <th>name</th>\n",
" <th>seller</th>\n",
" <th>offerType</th>\n",
" <th>price</th>\n",
" <th>abtest</th>\n",
" <th>vehicleType</th>\n",
" <th>yearOfRegistration</th>\n",
" <th>gearbox</th>\n",
" <th>powerPS</th>\n",
" <th>model</th>\n",
" <th>kilometer</th>\n",
" <th>monthOfRegistration</th>\n",
" <th>fuelType</th>\n",
" <th>brand</th>\n",
" <th>notRepairedDamage</th>\n",
" <th>dateCreated</th>\n",
" <th>nrOfPictures</th>\n",
" <th>postalCode</th>\n",
" <th>lastSeen</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2016-03-24 11:52:17</td>\n",
" <td>Golf_3_1.6</td>\n",
" <td>privat</td>\n",
" <td>Angebot</td>\n",
" <td>480</td>\n",
" <td>test</td>\n",
" <td>NaN</td>\n",
" <td>1993</td>\n",
" <td>manuell</td>\n",
" <td>0</td>\n",
" <td>golf</td>\n",
" <td>150000</td>\n",
" <td>0</td>\n",
" <td>benzin</td>\n",
" <td>volkswagen</td>\n",
" <td>NaN</td>\n",
" <td>2016-03-24 00:00:00</td>\n",
" <td>0</td>\n",
" <td>70435</td>\n",
" <td>2016-04-07 03:16:57</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2016-03-24 10:58:45</td>\n",
" <td>A5_Sportback_2.7_Tdi</td>\n",
" <td>privat</td>\n",
" <td>Angebot</td>\n",
" <td>18300</td>\n",
" <td>test</td>\n",
" <td>coupe</td>\n",
" <td>2011</td>\n",
" <td>manuell</td>\n",
" <td>190</td>\n",
" <td>NaN</td>\n",
" <td>125000</td>\n",
" <td>5</td>\n",
" <td>diesel</td>\n",
" <td>audi</td>\n",
" <td>ja</td>\n",
" <td>2016-03-24 00:00:00</td>\n",
" <td>0</td>\n",
" <td>66954</td>\n",
" <td>2016-04-07 01:46:50</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2016-03-14 12:52:21</td>\n",
" <td>Jeep_Grand_Cherokee_\"Overland\"</td>\n",
" <td>privat</td>\n",
" <td>Angebot</td>\n",
" <td>9800</td>\n",
" <td>test</td>\n",
" <td>suv</td>\n",
" <td>2004</td>\n",
" <td>automatik</td>\n",
" <td>163</td>\n",
" <td>grand</td>\n",
" <td>125000</td>\n",
" <td>8</td>\n",
" <td>diesel</td>\n",
" <td>jeep</td>\n",
" <td>NaN</td>\n",
" <td>2016-03-14 00:00:00</td>\n",
" <td>0</td>\n",
" <td>90480</td>\n",
" <td>2016-04-05 12:47:46</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2016-03-17 16:54:04</td>\n",
" <td>GOLF_4_1_4__3TÜRER</td>\n",
" <td>privat</td>\n",
" <td>Angebot</td>\n",
" <td>1500</td>\n",
" <td>test</td>\n",
" <td>kleinwagen</td>\n",
" <td>2001</td>\n",
" <td>manuell</td>\n",
" <td>75</td>\n",
" <td>golf</td>\n",
" <td>150000</td>\n",
" <td>6</td>\n",
" <td>benzin</td>\n",
" <td>volkswagen</td>\n",
" <td>nein</td>\n",
" <td>2016-03-17 00:00:00</td>\n",
" <td>0</td>\n",
" <td>91074</td>\n",
" <td>2016-03-17 17:40:17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2016-03-31 17:25:20</td>\n",
" <td>Skoda_Fabia_1.4_TDI_PD_Classic</td>\n",
" <td>privat</td>\n",
" <td>Angebot</td>\n",
" <td>3600</td>\n",
" <td>test</td>\n",
" <td>kleinwagen</td>\n",
" <td>2008</td>\n",
" <td>manuell</td>\n",
" <td>69</td>\n",
" <td>fabia</td>\n",
" <td>90000</td>\n",
" <td>7</td>\n",
" <td>diesel</td>\n",
" <td>skoda</td>\n",
" <td>nein</td>\n",
" <td>2016-03-31 00:00:00</td>\n",
" <td>0</td>\n",
" <td>60437</td>\n",
" <td>2016-04-06 10:17:21</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" dateCrawled name seller offerType \\\n",
"0 2016-03-24 11:52:17 Golf_3_1.6 privat Angebot \n",
"1 2016-03-24 10:58:45 A5_Sportback_2.7_Tdi privat Angebot \n",
"2 2016-03-14 12:52:21 Jeep_Grand_Cherokee_\"Overland\" privat Angebot \n",
"3 2016-03-17 16:54:04 GOLF_4_1_4__3TÜRER privat Angebot \n",
"4 2016-03-31 17:25:20 Skoda_Fabia_1.4_TDI_PD_Classic privat Angebot \n",
"\n",
" price abtest vehicleType yearOfRegistration gearbox powerPS model \\\n",
"0 480 test NaN 1993 manuell 0 golf \n",
"1 18300 test coupe 2011 manuell 190 NaN \n",
"2 9800 test suv 2004 automatik 163 grand \n",
"3 1500 test kleinwagen 2001 manuell 75 golf \n",
"4 3600 test kleinwagen 2008 manuell 69 fabia \n",
"\n",
" kilometer monthOfRegistration fuelType brand notRepairedDamage \\\n",
"0 150000 0 benzin volkswagen NaN \n",
"1 125000 5 diesel audi ja \n",
"2 125000 8 diesel jeep NaN \n",
"3 150000 6 benzin volkswagen nein \n",
"4 90000 7 diesel skoda nein \n",
"\n",
" dateCreated nrOfPictures postalCode lastSeen \n",
"0 2016-03-24 00:00:00 0 70435 2016-04-07 03:16:57 \n",
"1 2016-03-24 00:00:00 0 66954 2016-04-07 01:46:50 \n",
"2 2016-03-14 00:00:00 0 90480 2016-04-05 12:47:46 \n",
"3 2016-03-17 00:00:00 0 91074 2016-03-17 17:40:17 \n",
"4 2016-03-31 00:00:00 0 60437 2016-04-06 10:17:21 "
]
},
"execution_count": 3,
2023-03-20 11:12:57 +00:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Reading using windows-1252 works\n",
"df_used = pd.read_csv(\"./datasets/used_cars_dataset.csv\", encoding=\"windows-1252\")\n",
"df_used.head()"
]
},
{
"cell_type": "code",
"execution_count": 4,
2023-03-20 11:12:57 +00:00
"id": "a332b6a5",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{'dateCrawled': ['str'],\n",
" 'name': ['str'],\n",
" 'seller': ['str'],\n",
" 'offerType': ['str'],\n",
" 'price': ['int64'],\n",
" 'abtest': ['str'],\n",
2023-03-21 17:21:11 +00:00
" 'vehicleType': ['str', 'nan'],\n",
2023-03-20 11:12:57 +00:00
" 'yearOfRegistration': ['int64'],\n",
2023-03-21 17:21:11 +00:00
" 'gearbox': ['str', 'nan'],\n",
2023-03-20 11:12:57 +00:00
" 'powerPS': ['int64'],\n",
2023-03-21 17:21:11 +00:00
" 'model': ['str', 'nan'],\n",
2023-03-20 11:12:57 +00:00
" 'kilometer': ['int64'],\n",
" 'monthOfRegistration': ['int64'],\n",
2023-03-21 17:21:11 +00:00
" 'fuelType': ['str', 'nan'],\n",
2023-03-20 11:12:57 +00:00
" 'brand': ['str'],\n",
2023-03-21 17:21:11 +00:00
" 'notRepairedDamage': ['str', 'nan'],\n",
2023-03-20 11:12:57 +00:00
" 'dateCreated': ['str'],\n",
" 'nrOfPictures': ['int64'],\n",
" 'postalCode': ['int64'],\n",
" 'lastSeen': ['str']}"
]
},
"execution_count": 4,
2023-03-20 11:12:57 +00:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# Here I check the types and the presence of missing values for each column\n",
"types = {}\n",
"\n",
"for col in df_used.columns:\n",
" t = set([type(x).__name__ if type(x) != float or not np.isnan(x) else 'nan' for x in df_used[col].unique()])\n",
" types[col] = list(t)\n",
"\n",
"types"
]
},
{
"cell_type": "code",
"execution_count": 5,
2023-03-20 11:12:57 +00:00
"id": "11bfa9a2",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"dateCrawled: []\n",
"name: []\n",
"seller: []\n",
"offerType: []\n",
"price: []\n",
"abtest: []\n",
"vehicleType: []\n",
"yearOfRegistration: []\n",
"gearbox: []\n",
"powerPS: []\n",
"model: []\n",
"kilometer: []\n",
"monthOfRegistration: []\n",
"fuelType: []\n",
"brand: []\n",
"notRepairedDamage: []\n",
"dateCreated: []\n",
"nrOfPictures: []\n",
"postalCode: []\n",
"lastSeen: []\n"
]
}
],
"source": [
"# Here I check for numeric values that have decimal digits (i.e. that are not integers).\n",
"for col in df_used.columns:\n",
" print(f\"{col}: {str([x for x in df_used[col].unique() if type(x) == float and not np.isnan(x) and round(x) != x])}\")\n",
"\n",
"# As shown, there are none, therefore we can use the Int64 dtype in numeric columns"
]
},
{
"cell_type": "code",
"execution_count": 6,
2023-03-20 11:12:57 +00:00
"id": "f1c539c4",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"dateCrawled: False\n",
"name: False\n",
"seller: False\n",
"offerType: False\n",
"price: False\n",
"abtest: False\n",
"vehicleType: False\n",
"yearOfRegistration: False\n",
"gearbox: False\n",
"powerPS: False\n",
"model: False\n",
"kilometer: False\n",
"monthOfRegistration: False\n",
"fuelType: False\n",
"brand: False\n",
"notRepairedDamage: False\n",
"dateCreated: False\n",
"nrOfPictures: False\n",
"postalCode: False\n",
"lastSeen: False\n"
]
}
],
"source": [
"# Here I check if any column is unique to find potential candidates for the index\n",
"for col in df_used.columns:\n",
" print(f\"{col}: {df_used[col].is_unique}\")\n",
"\n",
"# None are unique, so I use the default numeric index"
]
},
{
"cell_type": "code",
"execution_count": 7,
2023-03-20 11:12:57 +00:00
"id": "86074e70",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>dateCrawled</th>\n",
" <th>name</th>\n",
" <th>seller</th>\n",
" <th>offerType</th>\n",
" <th>price</th>\n",
" <th>abtest</th>\n",
" <th>vehicleType</th>\n",
" <th>yearOfRegistration</th>\n",
" <th>gearbox</th>\n",
" <th>powerPS</th>\n",
" <th>model</th>\n",
" <th>kilometer</th>\n",
" <th>monthOfRegistration</th>\n",
" <th>fuelType</th>\n",
" <th>brand</th>\n",
" <th>notRepairedDamage</th>\n",
" <th>dateCreated</th>\n",
" <th>nrOfPictures</th>\n",
" <th>postalCode</th>\n",
" <th>lastSeen</th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>0</th>\n",
" <td>2016-03-24 11:52:17</td>\n",
" <td>Golf_3_1.6</td>\n",
" <td>privat</td>\n",
" <td>Angebot</td>\n",
" <td>480</td>\n",
" <td>test</td>\n",
" <td>&lt;NA&gt;</td>\n",
" <td>1993</td>\n",
" <td>manuell</td>\n",
" <td>0</td>\n",
" <td>golf</td>\n",
" <td>150000</td>\n",
" <td>0</td>\n",
" <td>benzin</td>\n",
" <td>volkswagen</td>\n",
" <td>&lt;NA&gt;</td>\n",
" <td>2016-03-24 00:00:00</td>\n",
" <td>0</td>\n",
" <td>70435</td>\n",
" <td>2016-04-07 03:16:57</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1</th>\n",
" <td>2016-03-24 10:58:45</td>\n",
" <td>A5_Sportback_2.7_Tdi</td>\n",
" <td>privat</td>\n",
" <td>Angebot</td>\n",
" <td>18300</td>\n",
" <td>test</td>\n",
" <td>coupe</td>\n",
" <td>2011</td>\n",
" <td>manuell</td>\n",
" <td>190</td>\n",
" <td>&lt;NA&gt;</td>\n",
" <td>125000</td>\n",
" <td>5</td>\n",
" <td>diesel</td>\n",
" <td>audi</td>\n",
" <td>ja</td>\n",
" <td>2016-03-24 00:00:00</td>\n",
" <td>0</td>\n",
" <td>66954</td>\n",
" <td>2016-04-07 01:46:50</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2</th>\n",
" <td>2016-03-14 12:52:21</td>\n",
" <td>Jeep_Grand_Cherokee_\"Overland\"</td>\n",
" <td>privat</td>\n",
" <td>Angebot</td>\n",
" <td>9800</td>\n",
" <td>test</td>\n",
" <td>suv</td>\n",
" <td>2004</td>\n",
" <td>automatik</td>\n",
" <td>163</td>\n",
" <td>grand</td>\n",
" <td>125000</td>\n",
" <td>8</td>\n",
" <td>diesel</td>\n",
" <td>jeep</td>\n",
" <td>&lt;NA&gt;</td>\n",
" <td>2016-03-14 00:00:00</td>\n",
" <td>0</td>\n",
" <td>90480</td>\n",
" <td>2016-04-05 12:47:46</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3</th>\n",
" <td>2016-03-17 16:54:04</td>\n",
" <td>GOLF_4_1_4__3TÜRER</td>\n",
" <td>privat</td>\n",
" <td>Angebot</td>\n",
" <td>1500</td>\n",
" <td>test</td>\n",
" <td>kleinwagen</td>\n",
" <td>2001</td>\n",
" <td>manuell</td>\n",
" <td>75</td>\n",
" <td>golf</td>\n",
" <td>150000</td>\n",
" <td>6</td>\n",
" <td>benzin</td>\n",
" <td>volkswagen</td>\n",
" <td>nein</td>\n",
" <td>2016-03-17 00:00:00</td>\n",
" <td>0</td>\n",
" <td>91074</td>\n",
" <td>2016-03-17 17:40:17</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4</th>\n",
" <td>2016-03-31 17:25:20</td>\n",
" <td>Skoda_Fabia_1.4_TDI_PD_Classic</td>\n",
" <td>privat</td>\n",
" <td>Angebot</td>\n",
" <td>3600</td>\n",
" <td>test</td>\n",
" <td>kleinwagen</td>\n",
" <td>2008</td>\n",
" <td>manuell</td>\n",
" <td>69</td>\n",
" <td>fabia</td>\n",
" <td>90000</td>\n",
" <td>7</td>\n",
" <td>diesel</td>\n",
" <td>skoda</td>\n",
" <td>nein</td>\n",
" <td>2016-03-31 00:00:00</td>\n",
" <td>0</td>\n",
" <td>60437</td>\n",
" <td>2016-04-06 10:17:21</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" dateCrawled name seller offerType \\\n",
"0 2016-03-24 11:52:17 Golf_3_1.6 privat Angebot \n",
"1 2016-03-24 10:58:45 A5_Sportback_2.7_Tdi privat Angebot \n",
"2 2016-03-14 12:52:21 Jeep_Grand_Cherokee_\"Overland\" privat Angebot \n",
"3 2016-03-17 16:54:04 GOLF_4_1_4__3TÜRER privat Angebot \n",
"4 2016-03-31 17:25:20 Skoda_Fabia_1.4_TDI_PD_Classic privat Angebot \n",
"\n",
" price abtest vehicleType yearOfRegistration gearbox powerPS model \\\n",
"0 480 test <NA> 1993 manuell 0 golf \n",
"1 18300 test coupe 2011 manuell 190 <NA> \n",
"2 9800 test suv 2004 automatik 163 grand \n",
"3 1500 test kleinwagen 2001 manuell 75 golf \n",
"4 3600 test kleinwagen 2008 manuell 69 fabia \n",
"\n",
" kilometer monthOfRegistration fuelType brand notRepairedDamage \\\n",
"0 150000 0 benzin volkswagen <NA> \n",
"1 125000 5 diesel audi ja \n",
"2 125000 8 diesel jeep <NA> \n",
"3 150000 6 benzin volkswagen nein \n",
"4 90000 7 diesel skoda nein \n",
"\n",
" dateCreated nrOfPictures postalCode lastSeen \n",
"0 2016-03-24 00:00:00 0 70435 2016-04-07 03:16:57 \n",
"1 2016-03-24 00:00:00 0 66954 2016-04-07 01:46:50 \n",
"2 2016-03-14 00:00:00 0 90480 2016-04-05 12:47:46 \n",
"3 2016-03-17 00:00:00 0 91074 2016-03-17 17:40:17 \n",
"4 2016-03-31 00:00:00 0 60437 2016-04-06 10:17:21 "
]
},
"execution_count": 7,
2023-03-20 11:12:57 +00:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# I read again the dataset using the information about the column types I found\n",
2023-03-20 15:41:34 +00:00
"df_used = pd.read_csv(\"./datasets/used_cars_dataset.csv\", encoding=\"windows-1252\", dtype={\n",
" 'dateCrawled': str,\n",
2023-03-20 11:12:57 +00:00
" 'name': pd.StringDtype(),\n",
" 'seller': pd.StringDtype(),\n",
" 'offerType': pd.StringDtype(),\n",
" 'price': pd.Int64Dtype(),\n",
" 'abtest': pd.StringDtype(),\n",
" 'vehicleType': pd.StringDtype(),\n",
" 'yearOfRegistration': pd.Int64Dtype(),\n",
" 'gearbox': pd.StringDtype(),\n",
" 'powerPS': pd.Int64Dtype(),\n",
" 'model': pd.StringDtype(),\n",
" 'kilometer': pd.Int64Dtype(),\n",
" 'monthOfRegistration': pd.Int64Dtype(),\n",
" 'fuelType': pd.StringDtype(),\n",
" 'brand': pd.StringDtype(),\n",
" 'notRepairedDamage': pd.StringDtype(),\n",
" 'dateCreated': pd.StringDtype(),\n",
" 'nrOfPictures': pd.Int64Dtype(),\n",
" 'postalCode': pd.Int64Dtype(),\n",
" 'lastSeen': pd.StringDtype()\n",
"})\n",
"df_used.head()"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "6a3f2455",
"metadata": {},
"source": [
"From here onwards, I investigate the missing and invalid values. If I find any invalid values, I replace them with `<NA>` to encode them as the missing values. This makes it easy to count and drop them all in one go."
]
},
{
"cell_type": "code",
"execution_count": 8,
2023-03-20 11:12:57 +00:00
"id": "8b6f9ce3",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"vehicleType: [<NA>, 'andere', 'bus', 'cabrio', 'coupe', 'kleinwagen', 'kombi', 'limousine', 'suv']\n",
"brand: ['BMW', 'alfa_romeo', 'audi', 'bmw', 'bmw ', 'chevrolet', 'chrysler', 'citroen', 'dacia', 'daewoo', 'daihatsu', 'fiat', 'ford', 'honda', 'hyundai', 'jaguar', 'jeep', 'kia', 'lada', 'lancia', 'land_rover', 'mazda', 'mercedes_benz', 'mini', 'mitsubishi', 'nissan', 'opel', 'peugeot', 'porsche', 'renault', 'rover', 'saab', 'seat', 'skoda', 'smart', 'sonstige_autos', 'subaru', 'suzuki', 'toyota', 'trabant', 'volkswagen', 'volvo']\n",
"monthOfRegistration: [0, 1, 10, 11, 12, 2, 3, 4, 5, 6, 7, 8, 9]\n"
]
}
],
"source": [
"# I look at the values of the indicated columns to find odd values. \n",
"# Indeed, some brand values use mixed case (BMW) and spaces ('bmw '). Additionally, \n",
"# a month of registration = 0 does not make sense when the other values are in the\n",
"# 1-12 range.\n",
"cols = [\"vehicleType\", \"brand\", \"monthOfRegistration\"]\n",
"\n",
"def print_col(col: str):\n",
" print(f\"{col}: {str(sorted(df_used[col].unique(), key=lambda x: str(x)))}\")\n",
"\n",
"for col in cols:\n",
" print_col(col)\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
2023-03-20 11:12:57 +00:00
"id": "98f8d101",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"brand: ['alfa_romeo', 'audi', 'bmw', 'chevrolet', 'chrysler', 'citroen', 'dacia', 'daewoo', 'daihatsu', 'fiat', 'ford', 'honda', 'hyundai', 'jaguar', 'jeep', 'kia', 'lada', 'lancia', 'land_rover', 'mazda', 'mercedes_benz', 'mini', 'mitsubishi', 'nissan', 'opel', 'peugeot', 'porsche', 'renault', 'rover', 'saab', 'seat', 'skoda', 'smart', 'sonstige_autos', 'subaru', 'suzuki', 'toyota', 'trabant', 'volkswagen', 'volvo']\n",
"monthOfRegistration: [1, 10, 11, 12, 2, 3, 4, 5, 6, 7, 8, 9, <NA>]\n"
]
}
],
"source": [
"# Some brands are written using mixed case or with spaces, hence here I normalize to stripped lowercase\n",
"df_used.brand = df_used.brand.apply(lambda x: x if type(x) is not str else x.lower().strip())\n",
"print_col(\"brand\")\n",
"\n",
"# monthOfRegistration=0 is invalid, hence i mark it as NaN\n",
"df_used[df_used.monthOfRegistration == 0] = np.nan\n",
"print_col(\"monthOfRegistration\")"
]
},
{
"cell_type": "code",
"execution_count": 10,
2023-03-20 11:12:57 +00:00
"id": "f300f49d",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"notRepairedDamage: [<NA>, 'ja', 'nein']\n"
]
}
],
"source": [
"# This column only has 'ja' and 'nein' as non-missing values, we can convert it to a boolean\n",
"print_col(\"notRepairedDamage\")"
]
},
{
"cell_type": "code",
"execution_count": 11,
2023-03-20 11:12:57 +00:00
"id": "923c5354",
"metadata": {},
"outputs": [],
"source": [
"# Hence we map the column to boolean values\n",
"df_used.notRepairedDamage = df_used.notRepairedDamage.map({'ja': True, 'nein': False})"
]
},
{
"cell_type": "code",
"execution_count": 12,
2023-03-20 11:12:57 +00:00
"id": "4b847b1f",
"metadata": {},
"outputs": [],
"source": [
"# Prices not in the 1000-100'000 range are invalid, hence I convert them to NaN\n",
"df_used.loc[(df_used.price.isna()) | (df_used.price < 1000) | (df_used.price > 100_000), \"price\"] = np.nan"
]
},
{
"cell_type": "code",
"execution_count": 13,
2023-03-20 11:12:57 +00:00
"id": "bf1f417d",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"dateCrawled 37675\n",
"name 37675\n",
"seller 37675\n",
"offerType 37675\n",
"price 101662\n",
"abtest 37675\n",
"vehicleType 60491\n",
"yearOfRegistration 37675\n",
"gearbox 47998\n",
"powerPS 37675\n",
"model 51550\n",
"kilometer 37675\n",
"monthOfRegistration 37675\n",
"fuelType 57286\n",
"brand 37675\n",
"notRepairedDamage 87440\n",
"dateCreated 37675\n",
"nrOfPictures 37675\n",
"postalCode 37675\n",
"lastSeen 37675\n",
"dtype: int64"
]
},
"execution_count": 13,
2023-03-20 11:12:57 +00:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"# This reports the number of values in each column that are missing or invalid\n",
"df_used.isna().sum()"
]
},
{
"cell_type": "code",
"execution_count": 14,
2023-03-20 11:12:57 +00:00
"id": "919e692f",
"metadata": {},
"outputs": [],
"source": [
"# Here I drop the missing values and i re-enumerate all the rows with the automatic numeric index\n",
"df_used = df_used.dropna().reset_index(drop=True)"
]
},
{
2023-03-20 15:41:34 +00:00
"attachments": {},
2023-03-20 11:12:57 +00:00
"cell_type": "markdown",
"id": "47a3929f",
"metadata": {},
"source": [
"## Exercise 2 - Data analysis (20 points) 📊\n",
"\n",
"1. We consider the norm to be that, for a given type of vehicle, on average the price of diesel is greater than the one of benzine. Provide a representation of the data which shows if, and to which extent, the various vehicle types conform to the norm.\n",
"What relationship are you showing? Please justify the choice of the representation and your answer.\n",
"2. Find an appropriate way to show and compare the range of prices for the following `brand`: **mercedes_benz**, **fiat**, **volvo**, **alfa_romeo** and **lancia**. Create a suitable graphical representation of this data. What kind of relationship are you showing? Describe what can be understood from the plot. Please justify your answer and your choice of the graphical representation.\n",
"\n",
"<aside>\n",
"💡 N.B. In this section you should work on the clean Used Cars dataset, without the missing and invalid data.\n",
"</aside>"
]
},
2023-03-20 15:41:34 +00:00
{
"attachments": {},
"cell_type": "markdown",
"id": "e2ae928d",
"metadata": {},
"source": [
"### 2.1\n",
"\n",
"By interpreting the following requirement:\n",
"\n",
"> on average the price of diesel is greater than the one of benzine\n",
"\n",
"as meaning that we expect the average price of diesel cars to be greater than the average cars of _benzin_ cars for each car type, I choose to represent the relationship between each car type and the difference of these average values (i.e. $y=E({\\text{diesel}}) - E({\\text{benzin}})$, where a positive value of $y$ would confirm the expectation).\n",
"\n",
"To represent this relationship I choose to use a simple bar chart plotting these differences. I choose to plot a single series for the difference instead of both series for both fuel types to further focus the reader on the difference and not the values. Additionally, plotting the difference only makes comparing the difference value between car types easier as they are all aligned with the origin."
]
},
2023-03-20 11:12:57 +00:00
{
"cell_type": "code",
"execution_count": 15,
2023-03-20 11:12:57 +00:00
"id": "7cc5c90f",
"metadata": {},
2023-03-20 15:41:34 +00:00
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAAAxoAAAIXCAYAAAAbqSg4AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8qNh9FAAAACXBIWXMAAA9hAAAPYQGoP6dpAABo0ElEQVR4nO3deVhU5f/G8XsGRHDBhVQ0cxfNfUMpM1dyz71VTU2zLCvNNS2x0lxwQyPNPa3clzStTE3TyO2bmguZGy2GmAiksggzvz/8MTkBMoMHYez9ui6viznnOc985pnxMDfnPOeYrFarVQAAAABgIHNOFwAAAADg3kPQAAAAAGA4ggYAAAAAwxE0AAAAABiOoAEAAADAcAQNAAAAAIYjaAAAAAAwHEEDAAAAgOEIGgAAAAAM5zJBY8OGDWrXrp1q1qyp9u3ba+vWrbZ1J0+eVM+ePVWnTh01a9ZMCxcutNvWYrEoJCRETZo0Ue3atdWvXz9FRETYtcmsDwAAAACOc4mgsXHjRr355pt68skntXnzZrVr105Dhw7Vjz/+qCtXrqhv374qV66c1q5dq8GDB2vWrFlau3atbfvQ0FCtWLFC7733nlauXCmTyaQBAwYoKSlJkhzqAwAAAIDjTFar1ZrTRdyO1WpVy5Yt1bp1a40cOdK2/Pnnn1fDhg0lSZ988ol27Nghd3d3SdL06dP19ddf68svv1RSUpICAgI0fPhwPf3005KkuLg4NWnSRBMnTlT79u01b9682/YBAAAAwDm5/ojG2bNn9ccff6hjx452yxcuXKiBAwfq4MGD8vf3twUESQoICNC5c+d0+fJlhYeH69q1awoICLCt9/b2VrVq1XTgwAFJyrQPAAAAAM7J9UHj/PnzkqTr16/r+eef10MPPaQePXpox44dkqTIyEj5+vrabVO8eHFJ0oULFxQZGSlJKlmyZJo2f/75p0N9AAAAAHBOrg8aV69elSSNHDlSHTp00KJFi9S4cWMNGjRIYWFhSkhIkIeHh902efPmlSQlJiYqPj5ektJtk5iYKEmZ9pFVufysNAAAACDbuGfeJGflyZNH0s05GV26dJEkPfjggzpx4oQWL14sT09P26TuVKnhIF++fPL09JQkJSUl2X5ObePl5SVJmfaRVdHR12Q2m7K8PQAAAJDbFCmS36F2uT5opJ7S5OfnZ7e8UqVK+vbbb3X//fcrKirKbl3q4xIlSig5Odm2rEyZMnZtqlatanuO2/WRVRaLVRYLRzUAAADw35PrT52qVq2a8ufPryNHjtgtP3XqlMqUKSN/f38dOnRIKSkptnVhYWEqX768fHx8VLVqVRUoUED79u2zrY+Li9OJEyfUoEEDScq0DwAAAADOyfVBw9PTU/3799cHH3ygzZs369dff9WHH36ovXv3qm/fvurWrZuuXr2qMWPG6PTp01q3bp2WLl2qgQMHSro5N6Nnz54KDg7W9u3bFR4eriFDhsjX11eBgYGSlGkfAAAAAJyT6++jkWrx4sVavny5Ll68qIoVK2rw4MFq1aqVJOno0aOaMGGCTpw4oWLFiqlfv37q2bOnbduUlBRNnz5d69atU0JCgvz9/fX222+rdOnStjaZ9ZEVly79fUfbAwAAALlNsWIFHWrnMkHDFRE0AAAAcK9xNGjk+lOnAAAAALieXH/VKQAAACA7mM0mbkXwL0ZeNZWgAQAAgP8cs9mkwoXyyc2dE3xulZJsUUzsdUPCBkEDAAAA/zlms0lu7maNHrRTZ0/F5HQ5uUIFv8J6P7S5zGYTQQMAAAC4E2dPxSj8p8s5XcY9iWNFAAAAAAxH0AAAAABgOIIGAAAAAMMRNAAAAAAYjqABAAAAwHAEDQAAAACGI2gAAAAAMBxBAwAAAIDhCBoAAAAADEfQAAAAAGA4ggYAAAAAwxE0AAAAABiOoAEAAADAcAQNAAAAAIYjaAAAAAAwHEEDAAAAgOEIGgAAAAAMR9AAAAAAYDiCBgAAAADDETQAAAAAGI6gAQAAAMBwBA0AAAAAhiNoAAAAADAcQQMAAACA4QgaAAAAAAxH0AAAAABgOIIGAAAAAMMRNAAAAAAYjqABAAAAwHAEDQAAAACGI2gAAAAAMBxBAwAAAIDhCBoAAAAADEfQAAAAAGA4ggYAAAAAwxE0AAAAABiOoAEAAADAcAQNAAAAAIYjaAAAAAAwHEEDAAAAgOEIGgAAAAAMR9AAAAAAYDiCBgAAAADDETQAAAAAGI6gAQAAAMBwBA0AAAAAhiNoAAAAADAcQQMAAACA4QgaAAAAAAznEkHjjz/+UJUqVdL8W716tSTp5MmT6tmzp+rUqaNmzZpp4cKFdttbLBaFhISoSZMmql27tvr166eIiAi7Npn1AQAAAMBx7jldgCN+/vln5c2bV998841MJpNtecGCBXXlyhX17dtXrVq10vjx43X48GGNHz9ehQsXVrdu3SRJoaGhWrFihd5//32VKFFCU6dO1YABA7R582Z5eHg41AcAAAAAx7lE0Dh16pTKly+v4sWLp1m3dOlSeXh4KCgoSO7u7qpYsaIiIiI0f/58devWTUlJSVq0aJGGDx+upk2bSpJmzJihJk2aaNu2bWrfvr1WrVp12z4AAAAAOMclTp36+eefValSpXTXHTx4UP7+/nJ3/yczBQQE6Ny5c7p8+bLCw8N17do1BQQE2NZ7e3urWrVqOnDggEN9AAAAAHCOyxzRKFasmJ555hmdP39eZcuW1aBBg9SkSRNFRkbKz8/Prn3qkY8LFy4oMjJSklSyZMk0bf78809JyrQPHx+fLNVtNptkNpsybwgAAIC7ys3NJf7eniOMGptcHzSSkpJ0/vx5eXl5acSIEcqXL58+//xzDRgwQIsXL1ZCQoI8PDzstsmbN68kKTExUfHx8ZKUbpvY2FhJyrSPrCpaNL/dnBIAAAAgt/P29jKkn1wfNDw8PHTgwAG5u7vbwkCNGjV05swZLVy4UJ6enkpKSrLbJjUc5MuXT56enpJuBpbUn1PbeHndHMTM+siq6OhrHNEAAADIhdzczIZ9ob7XxMXFKyXFkuH6IkXyO9RPrg8aUvpf9v38/LRnzx75+voqKirKbl3q4xIlSig5Odm2rEyZMnZtqlatKkmZ9pFVFotVFos1y9sDAAAAd1tKikXJyRkHDUfl+pPTwsPDVbduXR08eNBu+bFjx1SpUiX5+/vr0KFDSklJsa0LCwtT+fLl5ePjo6pVq6pAgQLat2+fbX1cXJxOnDihBg0aSFKmfQAAAABwTq4PGn5+fqpcubLGjx+vgwcP6syZM3r//fd1+PBhvfjii+rWrZuuXr2qMWPG6PTp01q3bp2WLl2qgQMHSrp56lXPnj0VHBys7du3Kzw8XEOGDJGvr68CAwMlKdM+AAAAADjHZLVac/25PdHR0QoODtbu3bsVFxenatWqadiwYbYjEkePHtWECRN04sQJFStWTP369VPPnj1t26ekpGj69Olat26dEhIS5O/vr7ffflulS5e2tcmsj6y4dOnvO9oeAAAA2cPd3awiRfLryVbrFf4TtzOQpKo1fbTymy66cuXabU+dKlasoEP9uUTQcFUEDQAAgNyJoJGW0UEj1586BQAAAMD1EDQAAAAAGI6gAQAAAMBwBA0AAAAAhiNoAAAAADAcQQMAAACA4QgaAAAAAAxH0AAAAABgOIIGAAAAAMMRNAAAAAAYjqABAAAAwHAEDQAAAACGI2gAAAAAMBxBAwAAAIDhCBoAAAAADEfQAAAAAGA4ggYAAAAAwxE0AAAAABiOoAEAAADAcAQNAAAAAIYjaAAAAAAwHEEDAAAAgOEIGgAAAAAMR9AAAAAAYDiCBgAAAADDETQAAAAAGI6gAQAAAMBwBA0AAAAAhiNoAAAAADAcQQMAAACA4QgaAAAAAAxH0AAAAABgOIIGAAAAAMMRNAAAAAAYjqABAAAAwHAEDQAAAACGI2gAAAAAMBxBAwAAAIDhCBoAAAAADEfQAAAAAGA4ggYAAAAAwxE0AAAAABiOoAEAAADAcAQNAAAAAIYjaAAAAAAwHEEDAAAAgOEIGgAAAAAMR9A
"text/plain": [
"<Figure size 900x600 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"df_diff = df_used \\\n",
" .loc[(df_used.fuelType == 'benzin') | (df_used.fuelType == 'diesel'), ['vehicleType', 'fuelType', 'price']]\\\n",
" .groupby(['vehicleType', 'fuelType']) \\\n",
" .mean() \\\n",
" .sort_values(['vehicleType', 'fuelType'], ascending=[True, True]) \\\n",
" .groupby('vehicleType') \\\n",
" .diff() \\\n",
" .reset_index() \\\n",
" .set_index('vehicleType')\n",
"\n",
"df_diff = df_diff.loc[df_diff.fuelType == 'diesel', ['price']].rename({'price': 'diffPrice'}, axis=1).reset_index()\n",
"\n",
"sns.set_theme(palette=\"hls\")\n",
"\n",
"# Initialize the matplotlib figure\n",
"f, ax = plt.subplots(figsize=(9, 6))\n",
"\n",
"# Plot the total crashes\n",
"sns.set_color_codes(\"pastel\")\n",
"sns.barplot(x=\"vehicleType\", y=\"diffPrice\", data=df_diff,\n",
" label=\"avg(diesel) - avg(benzin)\", color=sns.xkcd_rgb[\"ultramarine\"])\n",
"\n",
"# Add a legend and informative axis label\n",
"ax.legend(ncol=2, loc=\"lower right\", frameon=True)\n",
"ax.set(ylabel=\"Diesel - benzin difference\", ylim=[-1000, 6000], \n",
" xlabel=\"Vehicle type\")\n",
"sns.despine(left=True, bottom=True)"
]
},
{
"attachments": {},
"cell_type": "markdown",
"id": "7cc33371",
"metadata": {},
"source": [
"### 2.2\n",
2023-03-21 17:21:11 +00:00
"\n",
"To compare the range of prices between car brands, I choose to plot the distribution of car prices for each car brand. To achieve this, I choose to use a variant of the box plot called boxen plot, which ditches whiskers in favour of showing octiles, 16-tiles and so on with coloured rectangles similar to the inner quartiles with exponentially smaller heights.\n",
"\n",
"From the plot we can see that the `mercedes_benz` car type has the highest median price, and it also has the most right skewed price distribution out of all car brands. `volvo` has the second-highest average and also a skewed price distribution. Both `lancia` and `fiat` are instead more uniformly distributed towards lower prices, while `alfa_romeo` has a similar distribution however with some skewing towards the expensive side. [`trabant`](https://www.youtube.com/watch?v=npMKIUTa3uI) is the cheapest car type.\n",
"\n",
"I choose to use a box-plot style graph as it is an effective representation to show some salient characterististics for one-dimensional distributions, such as the median and the quartiles (25% percentile, 75% percentile). I choose a `boxenplot` in particular to better capture the right-skewedness of some distributions with the additional percentiles considered by the octiles (87.5%), 16-tiles (93.75%) and so on exponentially."
2023-03-20 15:41:34 +00:00
]
},
{
"cell_type": "code",
"execution_count": 16,
2023-03-20 15:41:34 +00:00
"id": "ca97e7c8",
"metadata": {},
"outputs": [
{
"data": {
2023-03-21 17:21:11 +00:00
"image/png": "iVBORw0KGgoAAAANSUhEUgAABh8AAAK5CAYAAACvwT+gAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8qNh9FAAAACXBIWXMAAA9hAAAPYQGoP6dpAACPU0lEQVR4nOzdfZxc890//vfszs5GgoSWSBtkbVB11xRFSaNReokK6qLqpq2o+1RJuQQlQktKg2Zbd4n0hss9bVD0q1Uafr0RpUrbhEgoJVTFXZKd3dn5/ZErm6zc7SZn58zsPJ+PRx52z5k5533ec2Yk5zXn88kUi8ViAAAAAAAAJKQm7QIAAAAAAICeRfgAAAAAAAAkSvgAAAAAAAAkSvgAAAAAAAAkSvgAAAAAAAAkSvgAAAAAAAAkSvgAAAAAAAAkSvgAAAAAAAAkKpt2AfQMxWIx/vOf96OtrZh2KampqcnEhhv2qfo+ROjFEvqwlF4spg9L6cVi+rCYPiylF4vpw1J6sZg+LKUXi+nDYvqwlF4spg9L6cVi+rCUXixWU5OJD31o3dLsqyR7ocfLZDJRU5NJu4xU1dRk9OH/6MVi+rCUXiymD0vpxWL6sJg+LKUXi+nDUnqxmD4spReL6cNi+rCUXiymD0vpxWL6sJReLFbK4xc+AAAAAAAAiRI+AAAAAAAAiRI+AAAAAAAAiRI+AAAAAAAAiRI+AAAAAAAAiRI+AAAAAAAAiRI+AAAAAAAAicqmXQB0RbFYjHw+n3YZK1Qo1MSiRbXR3Nwcra1tJd13LpeLTCZT0n0CAAAAAKyM8IGKUSwWo6lpYsyd+0LapZSdhobGGD16jAACAAAAACgLhl2iYuTz+W4PHtra2uLVV1+NV199NdraSnv3wtqYM2d22d4RAgAAAABUH3c+UJGO69s76rrhS/6tbW3x4Du5iIjYp1/vyNaUdz7XUoyY/PaCtMsAAAAAAOhA+EBFqstE1HXDEEOZTCZqM0v2kYls2Q9jVEy7AAAAAACA5ZT317oBAAAAAICKI3wAAAAAAAASJXwAAAAAAAASJXwAAAAAAAASJXwAAAAAAAASJXwAAAAAAAASJXwAAAAAAAASJXwAAAAAAAASJXwAAAAAAAASJXwAAAAAAAASJXwAAAAAAAASJXwAAAAAAAASJXwAAAAAAAASJXwAAAAAAAASJXwAAAAAAAASlU27APigYrEY+Xx+ueX5fHMK1VSOcupPoVATixbVRnNzc7S2tnVYl8vlIpPJpFQZAAAAAFAKwgfKSrFYjKamiTF37gtpl1Jxxo0bm3YJndLQ0BijR48RQAAAAABAD2bYJcpKPp8XPPRwc+bMXuGdLQAAAABAz+HOB8rWKQ2bRl3N0m/Ht7S1xY/mvJxiReXtlIaBUVdTvnliS1sxfjTnn2mXAQAAAACUgPCBslVXk4lcGV9MLzd1NTVl3q+21T8EAAAAAOgRyvlKJQAAAAAAUIGEDwAAAAAAQKKEDwAAAAAAQKKEDwAAAAAAQKKEDwAAAAAAQKKyaRcAERHFYjHy+Xzk881pl0IJVOrrnMvlIpPJpF0GAAAAAJQ94QOpKxaL0dQ0MebOfSHtUiiRcePGpl3CGmloaIzRo8cIIAAAAABgNQy7ROry+bzggYowZ87syOfzaZcBAAAAAGXPnQ+UlVOH7BCTnnw67TLoZmN2/kTkamvTLqPTWgptMXHGk2mXAQAAAAAVQ/hAWcnWuhmnGuRqaysqfAAAAAAAukb4QKoWTzRdmZMPU526cr4WCjWxaFFtNDc3R2trWzdWtWomygYAAACg1IQPpMZE01SiSpws20TZAFSiQqGQdgllo7O9KBQKUbuGd5euzXPLWU89LiqD8w+AameMG1JjomkoDRNlA1BpnntuVpx00knx/PPPpV1K6jrbi+efnxXnnXdmzJ7d9Z6tzXPLWU89LiqD8w8AetidD01NTfHzn/88HnrooYiIePbZZ+Oss86KuXPnxt577x0/+MEPUq6QZRWLxbRLgC47+/Ofi1y2Mj4684VCXPLAg4t/Tnl4s3IZgmplDE0FUD4KhULceuuNsXDhwrjllhvjzDO/XbXfHO5sLwqFQtx++02xaNGiuP32m7rUs7V5bjnrqcdFZXD+AcBilXEFbQ1dddVVkclk4t57741111037XJYRltbW1x99ZVplwFdlstmKyZ8WFYlDhdVSptv3hAnnPCNbg8gyj2EKaVy6IXQCcrT9OkPx+uvvx4REa+/Pi8effThGDZs71RrSsv06Z3rxfTpD8cbb6xZz9bmueVs+vSeeVxUhunTnX8AENHDw4d33nknPv7xj8egQYPSLoVlFIvFmDTp+/HPf76UdikAERHx4otz4pxzxqRdBiVWqtBpVcohhCkH+rBUufaiVGHd22/Pj/vvv7vDsvvuuyc+8Ymdom/fft2+/3LS2V6sTc96ar976nFRGZx/ALBUxYUPzz33XFxxxRXxxBNPxPvvvx8DBgyIo446Kr761a92eNzw4cPjlVdeiYiIX/ziF/Gzn/0shgwZEk1NTXH//ffHa6+9Fn369Ik99tgjzjvvvNhggw06tf+xY8fGe++9FwsWLIinnnoqTjjhhDjhhBPi4Ycfjquuuiqee+656NOnT3zhC1+I008/Perr6yMiYuutt46JEyfGbbfdFk899VT0798/zj777IiI+N73vhfz5s2LnXfeOS699NLYcMMNIyJi9uzZMWHChJgxY0b06dMndt111xg7dmxstNFGEbH4Vs4bbrghbr755vjXv/4VH/nIR+LYY4+Nww47LJFed5fm5uZ46aW5aZcBVeWCIw6PXF3FfeR3u3xra1zwv7ekXQYpETpB5zU0NMbo0WO6PYCYNu3O5SZXLhRa4+6774qjjx7VrfsuN53txdr0rKf2u6ceF5XB+QcAS1XUlaiFCxfGMcccE7vttlvcdNNNkc1m484774yLL744PvWpT3V47B133BEnn3xybLLJJnHuuedG375949JLL43f/OY3MWHChBg4cGA899xzcdZZZ8XVV18d55xzTqfrePDBB+PMM8+M8847L3r16hW//vWv4xvf+EaMHj06JkyYEC+++GJccMEF8corr0RTU1P7877zne/E+PHj4zvf+U5ccskl8a1vfSsGDx4cl112WSxYsCBOPfXUmDx5cpx11lkxb968OOKII2L//fePsWPHxsKFC6OpqSkOP/zwuOeee6J3794xYcKEmDZtWpx33nmx/fbbx2OPPRYXXnhhNDc3x9FHH51Y35PW0mLiWyi1XF026uvq0i6jrI0/6YTI6VFVaG5piQuuvjbtMqCizJkzO/L5fPsXa7rDc8/NjKeeemK55W1tbfHkkzNi9933jMGDt+q2/ZeTzvZibXrWU/vdU4+LyuD8A4COKi58+MpXvhJHHHFE+xwOo0ePjmuvvTZmzpzZ4bEbbrhh1NXVRa9evdrvFNh+++1j3333bQ8qPvrRj8aee+653HNXp2/fvvH1r3+9/fdTTz019tlnnzjllFMiImKLLbaIYrEYJ510UsyePTsaGxsjIuLggw+Oz3/+8xERcfjhh8dDDz0Up59+euywww4REbHHHnvErFmzIiLi5ptvjo033jjOP//89v1ceeWVsdtuu8UDDzwQ++67b9x8880xduzYOOCAAyIiYtCgQfHPf/4zrrnmmjjqqKPKdhzrurpc2iVA1cm3tKZdQlnKty7tyzgXowFWqqGhMXK57v073IwZf4xMJhPFYnG5dZlMJh5//A9Vc9Gus71Ym5711H731OOiMjj/AKCjigofNtxwwzjiiCPivvvui3/84x/x4osvxt///veIWPxNgtU58MAD4/e//31cfvnlMXfu3Jg9e3a88MILsfPOO3epjs0337zD77NmzYr999+/w7JddtklIiJmzpzZHj40NDS0r+/Vq1dERGy66abty+rr6yOfX3xXwN/+9reYPXt2DBkypMN2m5ub2+tuaWmJnXbaqcP6nXfeOX784x/Hm2++GR/+8Ie7dFylUl9fH5ttNsjQS1BCF9xkaCH4oHKY8yGbrYl+/XrH/PkLymp
2023-03-20 15:41:34 +00:00
"text/plain": [
2023-03-21 17:21:11 +00:00
"<Figure size 1800x800 with 1 Axes>"
2023-03-20 15:41:34 +00:00
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
2023-03-21 17:21:11 +00:00
"brands = ['mercedes_benz', 'fiat', 'volvo', 'alfa_romeo', 'lancia', 'trabant']\n",
2023-03-20 15:41:34 +00:00
"\n",
"df_price = df_used \\\n",
2023-03-21 17:21:11 +00:00
" .loc[df_used.brand.isin(brands), ['brand', 'price']] \\\n",
" .sort_values('brand', ascending=True)\n",
2023-03-20 15:41:34 +00:00
"\n",
"sns.set_theme(palette=\"hls\")\n",
"\n",
"# Initialize the matplotlib figure\n",
2023-03-21 17:21:11 +00:00
"f, ax = plt.subplots(figsize=(18, 8))\n",
2023-03-20 15:41:34 +00:00
"\n",
"mkfunc = lambda x, pos: '%1.0fk' % (x * 1e-3)\n",
"mkformatter = mpl.ticker.FuncFormatter(mkfunc)\n",
"ax.xaxis.set_major_formatter(mkformatter)\n",
"\n",
"# Draw a nested boxplot to show bills by day and time\n",
"sns.boxenplot(y=\"brand\", x=\"price\", data=df_price)\n",
"\n",
2023-03-21 17:21:11 +00:00
"ax.set(ylabel=\"\", xlim=[0, 100000], xticks=range(0, 105001, 5000),\n",
2023-03-20 15:41:34 +00:00
" xlabel=\"Distribution of prices per vehicle type and fuel type\")\n",
" \n",
"sns.despine(offset=10, trim=True)"
]
2023-03-20 11:12:57 +00:00
},
{
"attachments": {
"Banks%20-%20market%20cap.png": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABgwAAAQoCAYAAADWsvWXAAAMP2lDQ1BJQ0MgUHJvZmlsZQAASImVVwdYU8kWnluSkEBoAQSkhN4E6QSQEkILvSPYCEmAUEIMBBU7uqjg2sUCNnRVRMEKiB2xswg27IsFFWVdLNiVNymg677yvfm+ufPff87858y5M/feAUDtJEckykXVAcgTForjQgLoY1NS6aSnAAUYoAIHADjcAhEzJiYCwDLU/r28uwEQaXvVXqr1z/7/WjR4/AIuAEgMxOm8Am4exAcBwKu4InEhAEQpbzalUCTFsAItMQwQ4oVSnCnHVVKcLsd7ZTYJcSyIWwFQUuFwxJkAqHZAnl7EzYQaqv0QOwp5AiEAanSIffPy8nkQp0FsDW1EEEv1Gek/6GT+TTN9WJPDyRzG8rnIilKgoECUy5n2f6bjf5e8XMmQD0tYVbLEoXHSOcO83czJD5diFYj7hOlR0RBrQvxBwJPZQ4xSsiShiXJ71IBbwII5AzoQO/I4geEQG0AcLMyNilDw6RmCYDbEcIWgUwWF7ASIdSFeyC8IilfYbBbnxyl8oQ0ZYhZTwZ/niGV+pb7uS3ISmQr911l8tkIfUy3OSkiGmAKxeZEgKQpiVYgdCnLiwxU2Y4qzWFFDNmJJnDR+c4jj+MKQALk+VpQhDo5T2JflFQzNF9ucJWBHKfD+wqyEUHl+sFYuRxY/nAvWwRcyE4d0+AVjI4bmwuMHBsnnjj3jCxPjFTofRIUBcfKxOEWUG6Owx035uSFS3hRi14KieMVYPKkQLki5Pp4hKoxJkMeJF2dzwmLk8eDLQARggUBABxJY00E+yAaC9r7GPngn7wkGHCAGmYAP7BXM0IhkWY8QXuNBMfgTIj4oGB4XIOvlgyLIfx1m5Vd7kCHrLZKNyAFPIM4D4SAX3ktko4TD3pLAY8gI/uGdAysXxpsLq7T/3/ND7HeGCZkIBSMZ8khXG7IkBhEDiaHEYKINro/74t54BLz6w+qMM3DPoXl8tyc8IXQSHhKuE7oJtyYJSsQ/RRkJuqF+sCIX6T/mAreEmm54AO4D1aEyroPrA3vcFfph4n7QsxtkWYq4pVmh/6T9txn88DQUdmRHMkoeQfYnW/88UtVW1W1YRZrrH/MjjzV9ON+s4Z6f/bN+yD4PtuE/W2ILsQPYOewUdgE7ijUCOnYCa8LasGNSPLy6HstW15C3OFk8OVBH8A9/Q09WmskCx1rHXscv8r5C/lTpOxqw8kXTxILMrEI6E34R+HS2kOswiu7s6OwCgPT7In99vYmVfTcQnbbv3Lw/APA5MTg4eOQ7F3YCgH0ecPsf/s5ZM+CnQxmA84e5EnGRnMOlFwJ8S6jBnaYHjIAZsIbzcQbuwBv4gyAQBqJBAkgBE2H0WXCdi8EUMAPMBaWgHCwDq8F6sAlsBTvBHrAfNIKj4BQ4Cy6BDnAd3IGrpwe8AP3gHfiMIAgJoSI0RA8xRiwQO8QZYSC+SBASgcQhKUgakokIEQkyA5mHlCMrkPXIFqQG2YccRk4hF5BO5BbyAOlFXiOfUAxVQbVQQ9QSHY0yUCYajiagE9BMdDJajM5Hl6Br0Wp0N9qAnkIvodfRbvQFOoABTBnTwUwwe4yBsbBoLBXLwMTYLKwMq8CqsTqsGT7nq1g31od9xIk4Dafj9nAFh+KJOBefjM/CF+Pr8Z14A96KX8Uf4P34NwKVYECwI3gR2ISxhEzCFEIpoYKwnXCIcAbupR7COyKRqEO0InrAvZhCzCZOJy4mbiDWE08SO4mPiAMkEkmPZEfyIUWTOKRCUilpHWk36QTpCqmH9EFJWclYyVkpWClVSahUolShtEvpuNIVpadKn8nqZAuyFzmazCNPIy8lbyM3ky+Te8ifKRoUK4oPJYGSTZlLWUupo5yh3KW8UVZWNlX2VI5VFijPUV6rvFf5vPID5Y8qmiq2KiyV8SoSlSUqO1ROqtxSeUOlUi2p/tRUaiF1CbWGepp6n/pBlabqoMpW5anOVq1UbVC9ovpSjaxmocZUm6hWrFahdkDtslqfOlndUp2lzlGfpV6pfli9S31Ag6bhpBGtkaexWGOXxgWNZ5okTUvNIE2e5nzNrZqnNR/RMJoZjUXj0ubRttHO0Hq0iFpWWmytbK1yrT1a7Vr92prartpJ2lO1K7WPaXfrYDqWOmydXJ2lOvt1buh8GmE4gjmCP2LRiLoRV0a81x2p66/L1y3Trde9rvtJj64XpJejt1yvUe+ePq5vqx+rP0V/o/4Z/b6RWiO9R3JHlo3cP/K2AWpgaxBnMN1gq0GbwYChkWGIochwneFpwz4jHSN/o2yjVUbHjXqNaca+xgLjVcYnjJ/TtelMei59Lb2V3m9iYBJqIjHZYtJu8tnUyjTRtMS03vSeGcWMYZZhtsqsxazf3Ng80nyGea35bQuyBcMiy2KNxTmL95ZWlsmWCywbLZ9Z6VqxrYqtaq3uWlOt/awnW1dbX7Mh2jBscmw22HTYorZutlm2lbaX7VA7dzuB3Qa7zlGEUZ6jhKOqR3XZq9gz7Yvsa+0fOOg4RDiUODQ6vBxtPjp19PLR50Z/c3RzzHXc5njHSdMpzKnEqdnptbOtM9e50vmaC9Ul2GW2S5PLK1c7V77rRtebbjS3SLcFbi1uX9093MXude69HuYeaR5VHl0MLUYMYzHjvCfBM8BztudRz49e7l6FXvu9/vK2987x3uX9bIzVGP6YbWMe+Zj6cHy2+HT70n3TfDf7dvuZ+HH8qv0e+pv58/y3+z9l2jCzmbuZLwMcA8QBhwLes7xYM1knA7HAkMCywPYgzaDEoPVB94NNgzODa4P7Q9xCpoecDCWEhocuD+1iG7K57Bp2f5hH2Myw1nCV8Pjw9eEPI2wjxBHNkWhkWOTKyLtRFlHCqMZoEM2OXhl9L8YqZnLMkVhibExsZeyTOKe4GXHn4mnxk+J3xb9LCEhYmnAn0TpRktiSpJY0Pqkm6X1yYPKK5O6xo8fOHHspRT9FkNKUSkpNSt2eOjAuaNzqcT3j3caXjr8xwWrC1AkXJupPzJ14bJLaJM6kA2mEtOS0XWlfONGcas5AOju9Kr2fy+Ku4b7g+fNW8Xr5PvwV/KcZPhkrMp5l+mSuzOzN8suqyOoTsATrBa+yQ7M3Zb/Pic7ZkTOYm5xbn6eUl5Z3WKgpzBG25hvlT83vFNmJSkXdk70mr57cLw4Xby9ACiYUNBVqwR/5Nom15BfJgyLfosqiD1OSphyYqjFVOLVtmu20RdOeFgcX/zYdn86d3jLDZMbcGQ9mMmdumYXMSp/VMtts9vzZPXNC5uycS5mbM/f3EseSFSVv5yXPa55vOH/O/Ee/hPxSW6paKi7tWuC9YNNCfKFgYfsil0XrFn0r45VdLHcsryj/spi7+OKvTr+u/XVwScaS9qXuSzcuIy4TLrux3G/5zhUaK4pXPFoZubJhFX1V2aq3qyetvlDhWrFpDWWNZE332oi1TevM1y1b92V91vrrlQGV9VUGVYuq3m/gbbiy0X9j3SbDTeWbPm0WbL65JWRLQ7VldcVW4tairU+2JW079xvjt5rt+tvLt3/dIdzRvTNuZ2uNR03NLoNdS2vRWklt7+7xuzv2BO5pqrOv21KvU1++F+yV7H2+L23fjf3h+1sOMA7UHbQ4WHWIdqisAWmY1tDfmNXY3ZTS1Hk47HBLs3fzoSMOR3YcNTlaeUz72NLjlOPzjw+eKD4xcFJ0su9U5qlHLZNa7pwee/paa2xr+5nwM+fPBp89fY557sR5n/NHL3hdOHyRcbHxkvulhja3tkO/u/1+qN29veGyx+WmDs+O5s4xncev+F05dTXw6tlr7GuXrkdd77yReONm1/iu7pu8m89u5d56dbvo9uc7c+4S7pbdU79Xcd/gfvUfNn/Ud7t3H3sQ+KDtYfzDO4+4j148Lnj8pWf+E+qTiqfGT2ueOT872hvc2/F83POeF6IXn/tK/9T4s+ql9cuDf/n/1dY/tr/nlfjV4OvFb/Te7Hjr+rZlIGbg/ru8d5/fl33Q+7DzI+PjuU/Jn55+nvKF9GXtV5uvzd/C
}
},
"cell_type": "markdown",
"id": "f4e84bcf",
"metadata": {},
"source": [
"## Exercise 3 - Data analysis (20 points) 📊\n",
"\n",
"The following graph represents the financial meltdown's impact on banks since the 2008 financial crisis began, and compares the market value of each bank as of 2007 - in blue - and 2009 - in green. The **main** purpose of the graph is to show the loss of each bank after the financial crisis and to enlight the little decline pre-versus-post meltdown of J.P. Morgan; the **secondary** purpose is to provide a sense of the relative sizes of the banks in terms of market value (e.g., J.P. Morgan is not a small bank).\n",
"Is there a better solution to achieve these two goals? How would you compare both the remaining market value of each bank after the loss caused by the crisis and their decline?\n",
"\n",
"List all the problems that you detect in the design of this graph with respect to the quantive message the graph is supposed to deliver.\n",
"\n",
"Propose and implement a different graph that delivers effectively the message.\n",
"\n",
"Use the data in the *market_value_decline* dataset to populate the new graph.\n",
"\n",
"![Banks%20-%20market%20cap.png](attachment:Banks%20-%20market%20cap.png)"
]
},
{
"cell_type": "code",
"execution_count": 17,
2023-03-20 11:12:57 +00:00
"id": "eb956ed4",
"metadata": {},
"outputs": [],
2023-03-21 17:21:11 +00:00
"source": [
"df_m = pd.read_csv(\"./datasets/market_value_decline.csv\").rename(columns={\n",
" 'Unnamed: 0': 'bank',\n",
" 'market_value_2007': '2007',\n",
" 'market_value_2009': '2009'\n",
"})\n",
"\n",
"df_mkt = df_m\n",
"df_mkt[\"diff\"] = 100 * (df_mkt['2009'] - df_mkt['2007']) / df_mkt['2007']\n",
"df_mkt = df_mkt.sort_values(['diff'], ascending=False)\n",
"\n",
"# sort source DF according to new order by diff\n",
"df_m = df_m.reindex(df_mkt.index)"
]
},
{
"cell_type": "code",
"execution_count": 18,
2023-03-21 17:21:11 +00:00
"id": "4a29684b",
"metadata": {},
"outputs": [],
"source": [
"df_mval = pd.melt(df_m.loc[:, ['bank', '2007', '2009']], id_vars=['bank'], var_name='year', value_name='market_value')"
]
},
{
"cell_type": "code",
"execution_count": 19,
2023-03-21 17:21:11 +00:00
"id": "d3d58d25",
"metadata": {},
"outputs": [
{
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABUgAAAKrCAYAAAAj9WcAAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjUuMiwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy8qNh9FAAAACXBIWXMAAA9hAAAPYQGoP6dpAADVp0lEQVR4nOzdd1jV9f//8cc5B0FUwL0QNRTRnOTCrJxljtTUTDEsTctSyxwJmrPCPRIrzRy5cpv6tRypWTYsy1G5caLiCsQBAuec3x/+PJ8ILEQOB865366rS857Pp+vc6A3D97DYLVarQIAAAAAAAAAF2R0dAEAAAAAAAAA4CgEpAAAAAAAAABcFgEpAAAAAAAAAJdFQAoAAAAAAADAZRGQAgAAAAAAAHBZBKQAAAAAAAAAXBYBKQAAAAAAAACXRUAKAAAAAAAAwGW5OboAQJKsVqv++uumLBaro0vJEkajQYUL53eanpytH4mecgt6yh2crSdn60dy3p6KFCng6DKQAc52nJeTOOP3dk7C+NoX42tfjK99Mb725YjjPM4gRY5gMBhkNBocXUaWMRoNTtWTs/Uj0VNuQU+5g7P15Gz9SM7bE3IHZ/vs5STO+L2dkzC+9sX42hfja1+Mr305YlwJSAEAAAAAAAC4LC6xBwDg/zMa7f9XYJPJmOpfZ+BsPTlbP5Jz94TcgffLPpzxezsnccT4WixWLtcFAAcgIAUAQHfC0UI+njK6mbJlf97entmyn+zkbD05Wz+Sc/aEnM9stvDZszPG176yc3xTzGZdi0sgJAWAbEZACgCA/v/Zo24m7eg3VXHHoh1dDoD/UDCgjJrMHOjoMpABJpNRvSPm6eiZGEeXAuRolcqW1JxhPWU0GghIASCbEZACAPA3cceidfWPE44uAwCcytEzMdp//KyjywAAAEgXASkAAAAAAACQDovFIrM55R/TDEpMNCkp6bbMZs74vl8mk5uMxpx1/2wCUgAAAAAAAOBvrFar4uP/UkLCjXTnX7lilMViyeaqnIenZwF5exeWwWDfh+RmFAEpAAAAAAAA8Dd3w9ECBQrJ3d0jTZBnMhk4ezQTrFarkpJu68aNWEmSj08RB1d0BwEpAAAAAAAA8P9ZLGZbOFqggHe6y7i5GZWSwhmkmeHu7iFJunEjVl5ehXLE5faOrwAAAAAAAADIIcxms6T/BXnIenfH9p/3d3UUAlIAAAAAAADgH3LK/TGdUU4bWwJSAAAAAAAAAC6LgBQAAAAAAACAyyIgBQAAAAAAAOCyCEizwIYNG/T8888rKChIQUFB6tixo5YtW5al+4iNjdXKlSuzdJvpCQ0NVVhYmN33AwAAAAAAAOQEbo4uILdbtWqV3nvvPQ0bNkx169aV1WrVjz/+qPfff19XrlxRv379smQ/EydOVHR0tJ577rks2R4AAAAAAAAAAtIHtnTpUnXq1EmdO3e2TfP391dMTIwWLlyYZQGp1WrNku0AAAAAAAAgZ/nwww+0evUKrV+/WQUKFLBNX7RogRYunKf16zfrwoVzmjVrpvbt2ytJql27rvr1GyBf3zK25Y8fP6Z58z7RgQN7df36dRUqVFiNGzfVa6/1l4dHXknSY4/VUc+er+iHH3bp7NnT6to1VC+91Ct7G85huMT+ARmNRv3222+6du1aqum9e/fW8uXLJUkxMTEaPHiwHn30UVWtWlWNGjXStGnTZLFYJElr1qxR06ZNtXbtWj355JOqVq2aOnbsqL1773zgw8LCtHbtWv38888KDAyUJMXHx2vUqFFq1KiRqlatqoYNG2rUqFFKTEyUJO3evVuBgYHauXOn2rRpo2rVqql169basWOHrcakpCRFRESoQYMGqlOnjqZMmWKr6a6oqCj17t1bQUFBeuyxxzRo0CBdvnzZNj80NFTDhg3Tc889pzp16uiLL77I2gEGAAAAAABwcm3atFNS0m19883XqaZv3rxRTZo00+XLl9Snz8uKjf1Lw4ePUljYCJ0/f06vv35nmiRduXJFffv2UmJigoYNG63Jk2eoadPmWrVquZYvX5pqu599NleNGzfV6NHv6/HHG2dXmzkWZ5A+oN69e2vAgAF64oknVL9+fdWpU0fBwcGqXr26vL29JUmvvvqqihQporlz56pAgQL65ptv9N5776l69epq3ry5JOnSpUtatmyZJk2apDx58mj06NEaOnSoNm/erOHDhysxMVExMTGKjIyUJA0dOlQxMTGaMWOGihQpon379ik8PFz+/v568cUXbfVNmjRJw4cPV5EiRTR16lQNHjxY3377rfLnz6/33ntP27dv1/jx41W6dGnNmjVLe/bskZ+fnyTp4sWLCgkJUevWrRUWFqaEhARFRkaqS5cu2rBhg/LlyyfpTsA7adIkVa5cWUWLFs3O4QcAAEAu0LxuVQX4lXB0GcikazcSdCk23tFlOL1KZUs6ugQADlSuXHlVq1ZDmzZ9qTZt2kuSDh78Q6dOndSQIcM0f/4ceXh4aPr0j5Q//50zTOvUqavOndtp6dJF6tv3TZ04cVwBAYF6993xtmXq1q2vX3/9Rfv2/abu3Xva9vfww9X0wgsvZXebORYB6QNq0aKFli9frkWLFmnXrl3auXOnJKl8+fKKiIhQ1apV1a5dO7Vo0UK+vr6S7px1+cknn+jIkSO2gDQ5OVmjR49WlSpVJN0JVfv27avLly+rePHiyps3r/LkyaNixYpJkho2bKg6deqocuXKkqQyZcpo8eLFOnLkSKr6BgwYoAYNGti+bteunY4ePaqAgACtWbPGdhaqJEVERGj37t22dT///HMVL15cI0eOtE2bPn26goODtWnTJnXo0EGSVKVKFT3zzDNZO7AAAABwChaLWSNebufoMvAALBazjEaTo8twCSlmsywWbq8GuKo2bdpqwoT3deHCeZUqVVpffvl/8vUto5o1gzRiRJgeeaS2PDzyKiUlRZKUL19+1agRpF9+uZPl1KsXrHr1gpWSkqIzZ07r7Nkzioo6ptjYWHl7+6TaV4UKFbO9v5yMgDQL1KhRQ5MmTZLVatXRo0e1c+dOLVy4UL1799bWrVv1wgsvaNOmTfrss890+vRpHT58WJcuXUpzOXuFChVsX3t5eUm6E5ymJyQkRNu3b9e6det05swZHT16VGfPnlX58uVTLefv72/7+u49LJKTk3Xy5EklJyerevXqtvkeHh62gFaSDh48qKioKAUFBaXa5u3btxUVFWV7Xa5cuYwMEwAAAFyQ0WjStg+HKe7cCUeXgkwo6OuvZn0jFB+fILPZ8t8rOBGTyShvb89s7d1isRKQAi6sadOn9MEHU7V585fq1u1Fbd++VZ07d5UkXbsWp23btmrbtq1p1itYsJAkyWKxaPbsD7VmzUolJNxS8eIl9PDDVeXh4ZHm2TaFChW2f0O5CAHpA4iJidGcOXP0yiuvqESJEjIYDAoMDFRgYKCaNWumVq1a6bvvvtPChQuVkJCgli1bql27dhoxYoS6deuWZnvu7u5ppqX3cCar1ao+ffroyJEjeuaZZ9SiRQsNHDhQI0aMyPQ273Jz+99HwmKxKDg4WKNGjUqz3N0AV5Ly5s17z+0BAAAAcedO6Mqpw44uAw/AbLYoJcW1AtK7XLl3ANkrX758atKkmXbs+FoBAYG6ceO6nn66jaQ7OUzt2vXUtesLadYzme6c5b948QItX75EgweHq3HjZrYT5Xr37p59TeRSBKQPwN3dXcuXL1fJkiXVu3fvVPPufgijo6P1559/6vvvv7fdnzMuLk5Xr169ryfTGwwG29cHDx7Uzp07tWLFCtWsWVPSnbNCz5w5Y7t/6H+pUKGCPDw89Ouvv9ou009JSdHhw4dVv359SVJAQIC+/PJLlSpVyha0xsXFaejQoerRo4eCg4MzXD8AAAAAAAD+XZs27fTllxv0+eeL9MgjdVSy5J37E9eq9YhOnTqpihUr2U5us1qtGjt2hMqU8VNAQKAOHNinhx7yV5s2/7u1zeXLlxQVFaUqVR52SD+5BU+xfwCFCxdWr169NH36dE2bNk2HDh3S2bNntWPHDvXr10/169fXE08
"text/plain": [
"<Figure size 1500x800 with 2 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
}
],
"source": [
"sns.set_theme(palette=\"hls\")\n",
"\n",
"# Initialize the matplotlib figure\n",
"f, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 8), sharey=True)\n",
"\n",
"ax2.xaxis.set_major_formatter(mpl.ticker.FuncFormatter(lambda x, pos: '%.0fB' % (x)))\n",
"ax1.xaxis.set_major_formatter(mpl.ticker.FuncFormatter(lambda x, pos: '%.0f%%' % (x)))\n",
"\n",
"sns.barplot(y=\"bank\", x=\"diff\", data=df_mkt, ax=ax1, color=sns.xkcd_rgb[\"purplish red\"])\n",
"sns.barplot(\n",
" data=df_mval, ax=ax2,\n",
" x=\"market_value\", y=\"bank\", hue=\"year\",\n",
" palette=[sns.xkcd_rgb[\"prussian blue\"], sns.xkcd_rgb[\"sienna\"]]\n",
")\n",
"\n",
"# Add a legend and informative axis label\n",
"ax2.set(ylabel=\"Institution\", xlim=[0, 300],\n",
" xlabel=\"Market value\")\n",
"ax1.set(ylabel=\"Institution\", xlim=[-1, 0], xticks=range(-100, 1, 10),\n",
" xlabel=\"Market value decrease\")\n",
"sns.despine(left=True, bottom=True)"
]
2023-03-20 11:12:57 +00:00
},
{
"attachments": {},
"cell_type": "markdown",
"id": "06e7f954",
"metadata": {},
"source": [
"## Exercise 4 - Data visualisation and exploration (30 points) 🔍\n",
"\n",
"You'll need to work with the *'airports'* and *airports-delays* datasets. Examine the datasets and perform cleansing if needed, before performing the exercise.\n",
"\n",
"1. Create a dataframe that provides, for each country, <del>the mean of flights delayed</del>. Display these information by binning the flights delayed in 6 bins. The resulting dataframe should have the countries as rows and the 6 bins as columns. For this exercise you cannot use pivot_table but only groupby. \n",
"\n",
"<span style=\"color: red\">According to answer of question to professor:</span>\n",
"> Bin by delay_duration value, compute delay mean per-bin per-country \n",
2023-03-20 11:12:57 +00:00
"\n",
"2. Create a dataframe from a*irports-delays* which shows for each continent and country:\n",
" 1. max, min and mean of **delay_duration**;\n",
" 2. mean, sum of **flights_cancelled**;\n",
" 3. mean, sum of **flights_delayed**;\n",
" 4. mean, sum of **flights_planned**.\n",
"\n",
"3. Show a representation of the relationship between the number of flights planned and the number of flights delayed for each continent. It should be possible to see the relationship and the presence of outliers for each continent. What do you observe? You may want to display the median of the values for a better explaination."
]
},
{
"cell_type": "code",
"execution_count": 53,
2023-03-20 11:12:57 +00:00
"id": "b4fde7e4",
"metadata": {},
"outputs": [],
"source": [
"df_air = pd.read_csv(\"./datasets/airports.csv\", index_col='ID', na_values=['\\\\N'])\n",
"df_del = pd.read_csv(\"./datasets/airports-delays.csv\", index_col='ID', sep=\";\", na_values=['\\\\N'])"
]
},
{
"cell_type": "code",
"execution_count": 81,
"id": "f8906707",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th>delay_duration_bin</th>\n",
" <th>(15.999, 30.0]</th>\n",
" <th>(30.0, 35.0]</th>\n",
" <th>(35.0, 41.0]</th>\n",
" <th>(41.0, 47.0]</th>\n",
" <th>(47.0, 59.0]</th>\n",
" <th>(59.0, 850.0]</th>\n",
" </tr>\n",
" <tr>\n",
" <th>country</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Afghanistan</th>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.00</td>\n",
" <td>44.0</td>\n",
" <td>0.000000</td>\n",
" <td>60.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Albania</th>\n",
" <td>18.5</td>\n",
" <td>31.000000</td>\n",
" <td>0.00</td>\n",
" <td>0.0</td>\n",
" <td>56.000000</td>\n",
" <td>63.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Algeria</th>\n",
" <td>26.5</td>\n",
" <td>33.857143</td>\n",
" <td>38.75</td>\n",
" <td>43.0</td>\n",
" <td>51.200000</td>\n",
" <td>73.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>American Samoa</th>\n",
" <td>0.0</td>\n",
" <td>0.000000</td>\n",
" <td>0.00</td>\n",
" <td>43.0</td>\n",
" <td>48.000000</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Angola</th>\n",
" <td>28.0</td>\n",
" <td>34.500000</td>\n",
" <td>36.00</td>\n",
" <td>45.0</td>\n",
" <td>51.666667</td>\n",
" <td>0.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
"delay_duration_bin (15.999, 30.0] (30.0, 35.0] (35.0, 41.0] (41.0, 47.0] \\\n",
"country \n",
"Afghanistan 0.0 0.000000 0.00 44.0 \n",
"Albania 18.5 31.000000 0.00 0.0 \n",
"Algeria 26.5 33.857143 38.75 43.0 \n",
"American Samoa 0.0 0.000000 0.00 43.0 \n",
"Angola 28.0 34.500000 36.00 45.0 \n",
"\n",
"delay_duration_bin (47.0, 59.0] (59.0, 850.0] \n",
"country \n",
"Afghanistan 0.000000 60.0 \n",
"Albania 56.000000 63.0 \n",
"Algeria 51.200000 73.0 \n",
"American Samoa 48.000000 0.0 \n",
"Angola 51.666667 0.0 "
]
},
"execution_count": 81,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_4_1 = df_del.copy()\n",
"\n",
"# The following statements bins the data by the value of delay_duration.\n",
"# The bins are chosen as equally-spaced percentile values of the data. This is done to \n",
"# better distribute the data between bins, as it is quite skewed towards low values\n",
"df_4_1[\"delay_duration_bin\"] = pd.qcut(df_del.delay_duration, 6)\n",
"\n",
"# The dataframe will contain countries as row indices, the 6 bins as columns and values\n",
"# corresponding to the mean delay_duration per country, per bin. When no delay_duration \n",
"# falls in a particular bin for some country, that bin has a value of 0\n",
"df_4_1 = df_4_1.loc[:, ['country', 'delay_duration', 'delay_duration_bin']] \\\n",
" .groupby(['country', 'delay_duration_bin']) \\\n",
" .mean() \\\n",
" .fillna(0) \\\n",
" .reset_index() \\\n",
" .pivot(index='country', columns='delay_duration_bin', values='delay_duration') \n",
"\n",
"df_4_1.head()"
]
},
{
"cell_type": "code",
"execution_count": 82,
"id": "a677ce07",
"metadata": {},
"outputs": [],
"source": [
"# 4.2\n",
"# TODO: continents\n",
"df_4_2 = df_del.loc[:, ['country', 'delay_duration', 'flights_cancelled', 'flights_delayed', 'flights_planned']] \\\n",
" .groupby('country') \\\n",
" .agg(dur_min=('delay_duration', 'min'), \\\n",
" dur_mean=('delay_duration', 'mean'), \\\n",
" dur_max=('delay_duration', 'max'), \\\n",
" cancelled_sum=('flights_cancelled', 'sum'), \\\n",
" cancelled_mean=('flights_cancelled', 'mean'), \\\n",
" delayed_sum=('flights_delayed', 'sum'), \\\n",
" delayed_mean=('flights_delayed', 'mean'), \\\n",
" planned_sum=('flights_planned', 'sum'), \\\n",
" planned_mean=('flights_planned', 'mean'))"
]
},
{
"cell_type": "code",
"execution_count": 59,
"id": "a29b8c2f",
"metadata": {},
2023-03-21 17:21:11 +00:00
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>airport_name</th>\n",
" <th>city</th>\n",
" <th>country</th>\n",
" <th>IATA</th>\n",
" <th>ICAO</th>\n",
" <th>latitude</th>\n",
" <th>longitude</th>\n",
" <th>altitude</th>\n",
" <th>timezone</th>\n",
" <th>DST</th>\n",
" <th>tz_database_timezone</th>\n",
" <th>type</th>\n",
" <th>source</th>\n",
" <th>flights_planned</th>\n",
" <th>flights_cancelled</th>\n",
" <th>flights_delayed</th>\n",
" <th>delay_duration</th>\n",
" </tr>\n",
" <tr>\n",
" <th>ID</th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>1600</th>\n",
" <td>Bar Yehuda Airfield</td>\n",
" <td>Metzada</td>\n",
" <td>Israel</td>\n",
" <td>MTZ</td>\n",
" <td>LLMZ</td>\n",
" <td>31.328199</td>\n",
" <td>35.388599</td>\n",
" <td>-1266</td>\n",
" <td>2.0</td>\n",
2023-03-21 17:21:11 +00:00
" <td>E</td>\n",
" <td>Asia/Jerusalem</td>\n",
" <td>airport</td>\n",
" <td>OurAirports</td>\n",
" <td>62</td>\n",
" <td>0</td>\n",
" <td>9</td>\n",
" <td>32.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1595</th>\n",
" <td>Ein Yahav Airfield</td>\n",
" <td>Eyn-yahav</td>\n",
" <td>Israel</td>\n",
" <td>EIY</td>\n",
" <td>LLEY</td>\n",
" <td>30.621700</td>\n",
" <td>35.203300</td>\n",
" <td>-164</td>\n",
" <td>2.0</td>\n",
2023-03-21 17:21:11 +00:00
" <td>E</td>\n",
" <td>Asia/Jerusalem</td>\n",
" <td>airport</td>\n",
" <td>OurAirports</td>\n",
" <td>56</td>\n",
" <td>0</td>\n",
" <td>7</td>\n",
" <td>24.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7646</th>\n",
" <td>Jacqueline Cochran Regional Airport</td>\n",
" <td>Palm Springs</td>\n",
" <td>United States</td>\n",
" <td>TRM</td>\n",
" <td>KTRM</td>\n",
" <td>33.626701</td>\n",
" <td>-116.160004</td>\n",
" <td>-115</td>\n",
" <td>-8.0</td>\n",
2023-03-21 17:21:11 +00:00
" <td>A</td>\n",
" <td>America/Los_Angeles</td>\n",
" <td>airport</td>\n",
" <td>OurAirports</td>\n",
" <td>60</td>\n",
" <td>0</td>\n",
" <td>7</td>\n",
" <td>28.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>4357</th>\n",
" <td>Atyrau Airport</td>\n",
" <td>Atyrau</td>\n",
" <td>Kazakhstan</td>\n",
" <td>GUW</td>\n",
" <td>UATG</td>\n",
" <td>47.121899</td>\n",
" <td>51.821400</td>\n",
" <td>-72</td>\n",
" <td>5.0</td>\n",
2023-03-21 17:21:11 +00:00
" <td>U</td>\n",
" <td>Asia/Oral</td>\n",
" <td>airport</td>\n",
" <td>OurAirports</td>\n",
" <td>71</td>\n",
" <td>0</td>\n",
" <td>9</td>\n",
" <td>35.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>2151</th>\n",
" <td>Ramsar Airport</td>\n",
" <td>Ramsar</td>\n",
" <td>Iran</td>\n",
" <td>RZR</td>\n",
" <td>OINR</td>\n",
" <td>36.909901</td>\n",
" <td>50.679600</td>\n",
" <td>-70</td>\n",
" <td>3.5</td>\n",
" <td>E</td>\n",
" <td>Asia/Tehran</td>\n",
" <td>airport</td>\n",
" <td>OurAirports</td>\n",
" <td>62</td>\n",
" <td>1</td>\n",
" <td>6</td>\n",
" <td>47.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>...</th>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" <td>...</td>\n",
" </tr>\n",
" <tr>\n",
" <th>3039</th>\n",
" <td>Lengpui Airport</td>\n",
" <td>Aizwal</td>\n",
" <td>India</td>\n",
" <td>AJL</td>\n",
" <td>VELP</td>\n",
" <td>23.840599</td>\n",
" <td>92.619698</td>\n",
" <td>1398</td>\n",
" <td>5.5</td>\n",
" <td>N</td>\n",
" <td>Asia/Calcutta</td>\n",
" <td>airport</td>\n",
" <td>OurAirports</td>\n",
" <td>118</td>\n",
" <td>0</td>\n",
" <td>23</td>\n",
" <td>38.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>1670</th>\n",
" <td>Emmen Air Base</td>\n",
" <td>Emmen</td>\n",
" <td>Switzerland</td>\n",
" <td>EML</td>\n",
" <td>LSME</td>\n",
" <td>47.092444</td>\n",
" <td>8.305184</td>\n",
" <td>1400</td>\n",
" <td>1.0</td>\n",
2023-03-21 17:21:11 +00:00
" <td>E</td>\n",
" <td>Europe/Zurich</td>\n",
" <td>airport</td>\n",
" <td>OurAirports</td>\n",
" <td>124</td>\n",
" <td>0</td>\n",
" <td>19</td>\n",
" <td>38.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>6215</th>\n",
" <td>Long Lellang Airport</td>\n",
" <td>Long Datih</td>\n",
" <td>Malaysia</td>\n",
" <td>LGL</td>\n",
" <td>WBGF</td>\n",
" <td>3.421000</td>\n",
" <td>115.153999</td>\n",
" <td>1400</td>\n",
" <td>8.0</td>\n",
2023-03-21 17:21:11 +00:00
" <td>N</td>\n",
" <td>Asia/Kuala_Lumpur</td>\n",
" <td>airport</td>\n",
" <td>OurAirports</td>\n",
" <td>126</td>\n",
" <td>0</td>\n",
" <td>18</td>\n",
" <td>32.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>7375</th>\n",
" <td>Minaçu Airport</td>\n",
" <td>Minacu</td>\n",
" <td>Brazil</td>\n",
" <td>MQH</td>\n",
" <td>SBMC</td>\n",
" <td>-13.549100</td>\n",
" <td>-48.195301</td>\n",
" <td>1401</td>\n",
" <td>-3.0</td>\n",
2023-03-21 17:21:11 +00:00
" <td>S</td>\n",
" <td>America/Sao_Paulo</td>\n",
" <td>airport</td>\n",
" <td>OurAirports</td>\n",
" <td>119</td>\n",
" <td>1</td>\n",
" <td>25</td>\n",
" <td>48.0</td>\n",
" </tr>\n",
" <tr>\n",
" <th>9253</th>\n",
" <td>Bubovice Airport</td>\n",
" <td>Bubovice</td>\n",
" <td>Czech Republic</td>\n",
" <td>NaN</td>\n",
2023-03-21 17:21:11 +00:00
" <td>LKBU</td>\n",
" <td>49.974400</td>\n",
" <td>14.178100</td>\n",
" <td>1401</td>\n",
" <td>1.0</td>\n",
2023-03-21 17:21:11 +00:00
" <td>E</td>\n",
" <td>Europe/Prague</td>\n",
" <td>airport</td>\n",
" <td>OurAirports</td>\n",
" <td>128</td>\n",
" <td>0</td>\n",
" <td>15</td>\n",
" <td>32.0</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"<p>6029 rows × 17 columns</p>\n",
"</div>"
],
"text/plain": [
" airport_name city country IATA \\\n",
"ID \n",
"1600 Bar Yehuda Airfield Metzada Israel MTZ \n",
"1595 Ein Yahav Airfield Eyn-yahav Israel EIY \n",
"7646 Jacqueline Cochran Regional Airport Palm Springs United States TRM \n",
"4357 Atyrau Airport Atyrau Kazakhstan GUW \n",
"2151 Ramsar Airport Ramsar Iran RZR \n",
"... ... ... ... ... \n",
"3039 Lengpui Airport Aizwal India AJL \n",
"1670 Emmen Air Base Emmen Switzerland EML \n",
"6215 Long Lellang Airport Long Datih Malaysia LGL \n",
"7375 Minaçu Airport Minacu Brazil MQH \n",
"9253 Bubovice Airport Bubovice Czech Republic NaN \n",
2023-03-21 17:21:11 +00:00
"\n",
" ICAO latitude longitude altitude timezone DST \\\n",
"ID \n",
"1600 LLMZ 31.328199 35.388599 -1266 2.0 E \n",
"1595 LLEY 30.621700 35.203300 -164 2.0 E \n",
"7646 KTRM 33.626701 -116.160004 -115 -8.0 A \n",
"4357 UATG 47.121899 51.821400 -72 5.0 U \n",
"2151 OINR 36.909901 50.679600 -70 3.5 E \n",
"... ... ... ... ... ... .. \n",
"3039 VELP 23.840599 92.619698 1398 5.5 N \n",
"1670 LSME 47.092444 8.305184 1400 1.0 E \n",
"6215 WBGF 3.421000 115.153999 1400 8.0 N \n",
"7375 SBMC -13.549100 -48.195301 1401 -3.0 S \n",
"9253 LKBU 49.974400 14.178100 1401 1.0 E \n",
2023-03-21 17:21:11 +00:00
"\n",
" tz_database_timezone type source flights_planned \\\n",
"ID \n",
"1600 Asia/Jerusalem airport OurAirports 62 \n",
"1595 Asia/Jerusalem airport OurAirports 56 \n",
"7646 America/Los_Angeles airport OurAirports 60 \n",
"4357 Asia/Oral airport OurAirports 71 \n",
"2151 Asia/Tehran airport OurAirports 62 \n",
"... ... ... ... ... \n",
"3039 Asia/Calcutta airport OurAirports 118 \n",
"1670 Europe/Zurich airport OurAirports 124 \n",
"6215 Asia/Kuala_Lumpur airport OurAirports 126 \n",
"7375 America/Sao_Paulo airport OurAirports 119 \n",
"9253 Europe/Prague airport OurAirports 128 \n",
2023-03-21 17:21:11 +00:00
"\n",
" flights_cancelled flights_delayed delay_duration \n",
"ID \n",
"1600 0 9 32.0 \n",
"1595 0 7 24.0 \n",
"7646 0 7 28.0 \n",
"4357 0 9 35.0 \n",
"2151 1 6 47.0 \n",
"... ... ... ... \n",
"3039 0 23 38.0 \n",
"1670 0 19 38.0 \n",
"6215 0 18 32.0 \n",
"7375 1 25 48.0 \n",
"9253 0 15 32.0 \n",
2023-03-21 17:21:11 +00:00
"\n",
"[6029 rows x 17 columns]"
]
},
"execution_count": 59,
2023-03-21 17:21:11 +00:00
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_del"
2023-03-21 17:21:11 +00:00
]
2023-03-20 11:12:57 +00:00
},
{
"cell_type": "markdown",
"id": "e2f9c1aa",
"metadata": {},
"source": [
"## Exercise 5 - Geospatial data analysis (35 points) 🌍\n",
"\n",
"Use the *airports*, *routes*, *countries* and *europe.geojson* files. Create an interactive map representation - related to European countries only - such that, when a country is selected the map shows the number of flights left from the country selected and directed to each of the other countries, if flights with those destinations exist. The information should be represented as a choropleth map, essentially dynamically creating it when a country is selected.\n",
"\n",
"**Hints**:\n",
"1. If `A` is a GeoDataFrame and `B` a DataFrame, the result of `A.merge(B,..)` is a GeoDataFrame, whereas the result of `B.merge(A,..)` is a DataFrame. The function `to_json()` on a DataFrame with a geometry column does **not** work.\n",
"2. When updating the map, to access the color mapper you can use the following method:\n",
"```\n",
"color_mapper = p.select_one(LinearColorMapper)\n",
"```\n",
"where `p` is the figure.\n",
"\n",
"3. You can discard Guernsey and Gibraltar that are not present in the geojson.\n",
"\n",
"\n",
"<aside>\n",
"💡 Note that you have all the information you need in the files mentioned above. \n",
"</aside>"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "386875c8",
"metadata": {},
"outputs": [],
"source": []
},
{
"attachments": {},
"cell_type": "markdown",
"id": "9b9c5983",
"metadata": {},
"source": [
"## Datasets description\n",
"\n",
"### **Used Cars**\n",
"\n",
"Please find the dataset in the datasets folder.\n",
"\n",
"This dataset is scraped from Ebay. The content of the dataset is in German, but it should not impose critical issues in understanding the data. The fields included in the dataset are as following:\n",
"\n",
"**dateCrawled**: when this ad was first crawled, all field-values are taken from this date\\\n",
"**name**: ”name” of the car\\\n",
"**seller**: private or dealer\\\n",
"**offerTypeprice**: the price in euro on the ad to sell the car\\\n",
"**abtestvehicleTypeyearOfRegistration** : at which year the car was first registered\\\n",
"**gearboxpowerPS**: power of the car in PS\\\n",
"**modelkilometer**: how many kilometers the car has driven\\\n",
"**monthOfRegistration**: at which month the car was first registered\\\n",
"**fuelType**: vehicle fuel type\\\n",
"**brand**: vehicle brand\\\n",
"**notRepairedDamage**: if the car has a damage which is not repaired yet\\\n",
"**dateCreated**: the date for which the ad at ebay was created\\\n",
"**nrOfPictures**: number of pictures in the ad\\\n",
"**postalCodelastSeenOnline**: when the crawler saw this ad last online\n",
"\n",
"### **Airports, Routes and Ariports Delays**\n",
"\n",
"Please find the datasets in the datasets folder.\n",
"\n",
"The datasets used in this section can be found in the datasets folder.\n",
"Datasets description are as follows.\n",
"\n",
"### **Airports**\n",
"\n",
"As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe, as shown in the map above. Each entry contains the following information:\n",
"\n",
"**Airport ID**: Unique OpenFlights identifier for this airport\\\n",
"**Name**: Name of airport. May or may not contain the City name\\\n",
"**City**: Main city served by airport. May be spelled differently from Name\\\n",
"**Country**: Country or territory where airport is located. See Countries to cross-reference to ISO 3166-1 codes\\\n",
"**IATA**: 3-letter IATA code. Null if not assigned/unknown\\\n",
"**ICAO**: 4-letter ICAO code. Null if not assigned/unknown\\\n",
"**Latitude**: Decimal degrees, usually to six significant digits. Negative is South, positive is North\\\n",
"**Longitude**: Decimal degrees, usually to six significant digits. Negative is West, positive is East\\\n",
"**Altitude**: In feet\\\n",
"**Timezone**: Hours offset from UTC. Fractional hours are expressed as decimals, eg. India is 5.5\\\n",
"**DST**: Daylight savings time. One of E (Europe), A (US/Canada), S (South America), O (Australia), Z (New Zealand), N (None) or U (Unknown)\\\n",
"**Tz database time zone**: Timezone in \"tz\" (Olson) format, eg. \"America/Los_Angeles\"\\\n",
"**Type**: Type of the airport. Value \"airport\" for air terminals\\\n",
"**Source**: Source of the data. \"OurAirports\" for data sourced from OurAirports\n",
"\n",
"### **Airports Delays**\n",
"**Airport ID**: Unique OpenFlights identifier for this airport\\\n",
"**Name**: Name of airport. May or may not contain the City name\\\n",
"**City**: Main city served by airport. May be spelled differently from Name\\\n",
"**Country**: Country or territory where airport is located. See Countries to cross-reference to ISO 3166-1 codes\\\n",
"**IATA**: 3-letter IATA code. Null if not assigned/unknown\\\n",
"**ICAO**: 4-letter ICAO code. Null if not assigned/unknown\\\n",
"**Latitude**: Decimal degrees, usually to six significant digits. Negative is South, positive is North\\\n",
"**Longitude**: Decimal degrees, usually to six significant digits. Negative is West, positive is East\\\n",
"**Altitude**: In feet\\\n",
"**Timezone**: Hours offset from UTC. Fractional hours are expressed as decimals, eg. India is 5.5\\\n",
"**DST**: Daylight savings time. One of E (Europe), A (US/Canada), S (South America), O (Australia), Z (New Zealand), N (None) or U (Unknown)\\\n",
"**Tz database time zone**: Timezone in \"tz\" (Olson) format, eg. \"America/Los_Angeles\"\\\n",
"**Type**: Type of the airport. Value \"airport\" for air terminals\\\n",
"**Source**: Source of the data. \"OurAirports\" for data sourced from OurAirports\\\n",
"**Flights planned**: The number of fligths the related airport planned\\\n",
"**Flights cancelled**: The number of flights cancelled\\\n",
"**Flights delayed**: The number of flights delayed\\\n",
"**Delay duration**: The delay duration (in minutes)\n",
"\n",
"\n",
"### **Routes**\n",
"\n",
"As of June 2014, the OpenFlights/Airline Route Mapper Route Database contains 67663 routes between 3321 airports on 548 airlines spanning the globe, as shown in the map above. Each entry contains the following information:\n",
"\n",
"**Airline**: 2-letter (IATA) or 3-letter (ICAO) code of the airline\\\n",
"**Airline ID**: Unique OpenFlights identifier for airline (see Airline)\\\n",
"**Source airport**: 3-letter (IATA) or 4-letter (ICAO) code of the source airport\\\n",
"**Source airport ID**: Unique OpenFlights identifier for source airport (see Airport)\\\n",
"**Destination airport**: 3-letter (IATA) or 4-letter (ICAO) code of the destination airport\\\n",
"**Destination airport ID**: Unique OpenFlights identifier for destination airport (see Airport)\\\n",
"**Codeshare**: \"Y\" if this flight is a codeshare (that is, not operated by Airline, but another carrier), empty otherwise\\\n",
"**Stops**: Number of stops on this flight (\"0\" for direct)\\\n",
"**Equipment**: 3-letter codes for plane type(s) generally used on this flight, separated by spaces\\\n",
"The data is UTF-8 encoded. The special value \\N is used for \"NULL\" to indicate that no value is available, and is understood automatically by MySQL if imported\n",
"\n",
"\n",
"<aside>\n",
"💡 Notes:\n",
"\n",
"- Routes are directional: if an airline operates services from A to B and from B to A, both A-B and B-A are listed separately.\n",
"- Routes where one carrier operates both its own and codeshare flights are listed only once.\n",
"</aside>\n",
"\n",
"\n",
"### **Countries**\n",
"\n",
"Please find the dataset in the datasets folder.\n",
"\n",
"This dataset contains the information related to European countries.\n"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "base",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.13"
}
},
"nbformat": 4,
"nbformat_minor": 5
}