va/Assignment1/Assignment1.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "46302376",
   "metadata": {},
   "source": [
    "# S&DE Atelier - Visual Analytics\n",
    "\n",
    "# Assignment 1\n",
    "\n",
    "**Due** April 6, 2023 @23:55 \n",
    "\n",
    "**Contacts**: marco.dambros@usi.ch - carmen.armenti@usi.ch\n",
    "\n",
    "---\n",
    "\n",
    "The goal of this assignment is to use Python and Jupyter notebook to explore, analyze and visualize the datasets provided. To solve the assignment you should apply the knowledge you gained from the theoretical and practical lectures. In particular, when creating tabular or graphical representations you should apply the principles you learned from theoretical lectures and use the technologies presented during practical lectures. For what concerns the visualization library, we suggest to use the library presented in class (Seaborn, Matplotlib, Bokeh), but usage of other libraries (e.g., plotly) is also possible. You should submit a Jyputer notebook (named `SurenameName_Assignment1.ipynb`) that contains your solutions and the steps followed to arrive to these solutions. Please follow the structure of the assignment to solve the exercises.\n",
    "\n",
    "The datasets you need to use are described in the **Datasets description** section."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 35,
   "id": "fcf3beb9",
   "metadata": {},
   "outputs": [],
   "source": [
    "#%pip install pandas seaborn matplotlib bokeh ftfy geopandas jupyter_bokeh\n",
    "\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import seaborn as sns\n",
    "import matplotlib.pyplot as plt\n",
    "import bokeh\n",
    "import ftfy\n",
    "import matplotlib as mpl\n",
    "import geopandas as gpd\n",
    "from bokeh.plotting import figure, show, output_notebook\n",
    "from bokeh.models import GeoJSONDataSource, ColumnDataSource, Legend, BoxSelectTool, HoverTool, TapTool, CustomJS\n",
    "from bokeh.layouts import gridplot, column, row"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6f271000",
   "metadata": {},
   "source": [
    "## Exercise 1 - Data quality (15 points) 🧼\n",
    "\n",
    "In the Used Cars dataset identify the missing and invalid values for the columns: `vehicle type`, `price`, `brand`, and `month of registration`. If needed, standardize the information and covert them to unique values. Please specify for each column the number of missing or invalid instances. The prices are in euros and the range of accepted prices is between €1'000 and €100'000.\n",
    "Once you identified missing/invalid values for the given columns, remove all rows where one or more columns have invalid/missing data.\n",
    "Show the steps that you follow to reach the solution. You can choose your preferred approach/technology to clean the dataset (e.g., Python vanilla, Pandas, OpenRefine). "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "a0af6847",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "('Ü', 'sloppy-windows-1252')"
      ]
     },
     "execution_count": 2,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# UTF-8 decoding fails thanks to this byte\n",
    "ftfy.guess_bytes(b'\\xDC')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "22ce9426",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>dateCrawled</th>\n",
       "      <th>name</th>\n",
       "      <th>seller</th>\n",
       "      <th>offerType</th>\n",
       "      <th>price</th>\n",
       "      <th>abtest</th>\n",
       "      <th>vehicleType</th>\n",
       "      <th>yearOfRegistration</th>\n",
       "      <th>gearbox</th>\n",
       "      <th>powerPS</th>\n",
       "      <th>model</th>\n",
       "      <th>kilometer</th>\n",
       "      <th>monthOfRegistration</th>\n",
       "      <th>fuelType</th>\n",
       "      <th>brand</th>\n",
       "      <th>notRepairedDamage</th>\n",
       "      <th>dateCreated</th>\n",
       "      <th>nrOfPictures</th>\n",
       "      <th>postalCode</th>\n",
       "      <th>lastSeen</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2016-03-24 11:52:17</td>\n",
       "      <td>Golf_3_1.6</td>\n",
       "      <td>privat</td>\n",
       "      <td>Angebot</td>\n",
       "      <td>480</td>\n",
       "      <td>test</td>\n",
       "      <td>NaN</td>\n",
       "      <td>1993</td>\n",
       "      <td>manuell</td>\n",
       "      <td>0</td>\n",
       "      <td>golf</td>\n",
       "      <td>150000</td>\n",
       "      <td>0</td>\n",
       "      <td>benzin</td>\n",
       "      <td>volkswagen</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2016-03-24 00:00:00</td>\n",
       "      <td>0</td>\n",
       "      <td>70435</td>\n",
       "      <td>2016-04-07 03:16:57</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2016-03-24 10:58:45</td>\n",
       "      <td>A5_Sportback_2.7_Tdi</td>\n",
       "      <td>privat</td>\n",
       "      <td>Angebot</td>\n",
       "      <td>18300</td>\n",
       "      <td>test</td>\n",
       "      <td>coupe</td>\n",
       "      <td>2011</td>\n",
       "      <td>manuell</td>\n",
       "      <td>190</td>\n",
       "      <td>NaN</td>\n",
       "      <td>125000</td>\n",
       "      <td>5</td>\n",
       "      <td>diesel</td>\n",
       "      <td>audi</td>\n",
       "      <td>ja</td>\n",
       "      <td>2016-03-24 00:00:00</td>\n",
       "      <td>0</td>\n",
       "      <td>66954</td>\n",
       "      <td>2016-04-07 01:46:50</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2016-03-14 12:52:21</td>\n",
       "      <td>Jeep_Grand_Cherokee_\"Overland\"</td>\n",
       "      <td>privat</td>\n",
       "      <td>Angebot</td>\n",
       "      <td>9800</td>\n",
       "      <td>test</td>\n",
       "      <td>suv</td>\n",
       "      <td>2004</td>\n",
       "      <td>automatik</td>\n",
       "      <td>163</td>\n",
       "      <td>grand</td>\n",
       "      <td>125000</td>\n",
       "      <td>8</td>\n",
       "      <td>diesel</td>\n",
       "      <td>jeep</td>\n",
       "      <td>NaN</td>\n",
       "      <td>2016-03-14 00:00:00</td>\n",
       "      <td>0</td>\n",
       "      <td>90480</td>\n",
       "      <td>2016-04-05 12:47:46</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2016-03-17 16:54:04</td>\n",
       "      <td>GOLF_4_1_4__3TÜRER</td>\n",
       "      <td>privat</td>\n",
       "      <td>Angebot</td>\n",
       "      <td>1500</td>\n",
       "      <td>test</td>\n",
       "      <td>kleinwagen</td>\n",
       "      <td>2001</td>\n",
       "      <td>manuell</td>\n",
       "      <td>75</td>\n",
       "      <td>golf</td>\n",
       "      <td>150000</td>\n",
       "      <td>6</td>\n",
       "      <td>benzin</td>\n",
       "      <td>volkswagen</td>\n",
       "      <td>nein</td>\n",
       "      <td>2016-03-17 00:00:00</td>\n",
       "      <td>0</td>\n",
       "      <td>91074</td>\n",
       "      <td>2016-03-17 17:40:17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2016-03-31 17:25:20</td>\n",
       "      <td>Skoda_Fabia_1.4_TDI_PD_Classic</td>\n",
       "      <td>privat</td>\n",
       "      <td>Angebot</td>\n",
       "      <td>3600</td>\n",
       "      <td>test</td>\n",
       "      <td>kleinwagen</td>\n",
       "      <td>2008</td>\n",
       "      <td>manuell</td>\n",
       "      <td>69</td>\n",
       "      <td>fabia</td>\n",
       "      <td>90000</td>\n",
       "      <td>7</td>\n",
       "      <td>diesel</td>\n",
       "      <td>skoda</td>\n",
       "      <td>nein</td>\n",
       "      <td>2016-03-31 00:00:00</td>\n",
       "      <td>0</td>\n",
       "      <td>60437</td>\n",
       "      <td>2016-04-06 10:17:21</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           dateCrawled                            name  seller offerType  \\\n",
       "0  2016-03-24 11:52:17                      Golf_3_1.6  privat   Angebot   \n",
       "1  2016-03-24 10:58:45            A5_Sportback_2.7_Tdi  privat   Angebot   \n",
       "2  2016-03-14 12:52:21  Jeep_Grand_Cherokee_\"Overland\"  privat   Angebot   \n",
       "3  2016-03-17 16:54:04              GOLF_4_1_4__3TÜRER  privat   Angebot   \n",
       "4  2016-03-31 17:25:20  Skoda_Fabia_1.4_TDI_PD_Classic  privat   Angebot   \n",
       "\n",
       "   price abtest vehicleType  yearOfRegistration    gearbox  powerPS  model  \\\n",
       "0    480   test         NaN                1993    manuell        0   golf   \n",
       "1  18300   test       coupe                2011    manuell      190    NaN   \n",
       "2   9800   test         suv                2004  automatik      163  grand   \n",
       "3   1500   test  kleinwagen                2001    manuell       75   golf   \n",
       "4   3600   test  kleinwagen                2008    manuell       69  fabia   \n",
       "\n",
       "   kilometer  monthOfRegistration fuelType       brand notRepairedDamage  \\\n",
       "0     150000                    0   benzin  volkswagen               NaN   \n",
       "1     125000                    5   diesel        audi                ja   \n",
       "2     125000                    8   diesel        jeep               NaN   \n",
       "3     150000                    6   benzin  volkswagen              nein   \n",
       "4      90000                    7   diesel       skoda              nein   \n",
       "\n",
       "           dateCreated  nrOfPictures  postalCode             lastSeen  \n",
       "0  2016-03-24 00:00:00             0       70435  2016-04-07 03:16:57  \n",
       "1  2016-03-24 00:00:00             0       66954  2016-04-07 01:46:50  \n",
       "2  2016-03-14 00:00:00             0       90480  2016-04-05 12:47:46  \n",
       "3  2016-03-17 00:00:00             0       91074  2016-03-17 17:40:17  \n",
       "4  2016-03-31 00:00:00             0       60437  2016-04-06 10:17:21  "
      ]
     },
     "execution_count": 3,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Reading using windows-1252 works\n",
    "df_used = pd.read_csv(\"./datasets/used_cars_dataset.csv\", encoding=\"windows-1252\")\n",
    "df_used.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "a332b6a5",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'dateCrawled': ['str'],\n",
       " 'name': ['str'],\n",
       " 'seller': ['str'],\n",
       " 'offerType': ['str'],\n",
       " 'price': ['int64'],\n",
       " 'abtest': ['str'],\n",
       " 'vehicleType': ['str', 'nan'],\n",
       " 'yearOfRegistration': ['int64'],\n",
       " 'gearbox': ['str', 'nan'],\n",
       " 'powerPS': ['int64'],\n",
       " 'model': ['str', 'nan'],\n",
       " 'kilometer': ['int64'],\n",
       " 'monthOfRegistration': ['int64'],\n",
       " 'fuelType': ['str', 'nan'],\n",
       " 'brand': ['str'],\n",
       " 'notRepairedDamage': ['str', 'nan'],\n",
       " 'dateCreated': ['str'],\n",
       " 'nrOfPictures': ['int64'],\n",
       " 'postalCode': ['int64'],\n",
       " 'lastSeen': ['str']}"
      ]
     },
     "execution_count": 4,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Here I check the types and the presence of missing values for each column\n",
    "types = {}\n",
    "\n",
    "for col in df_used.columns:\n",
    "    t = set([type(x).__name__ if type(x) != float or not np.isnan(x) else 'nan' for x in df_used[col].unique()])\n",
    "    types[col] = list(t)\n",
    "\n",
    "types"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "11bfa9a2",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "dateCrawled: []\n",
      "name: []\n",
      "seller: []\n",
      "offerType: []\n",
      "price: []\n",
      "abtest: []\n",
      "vehicleType: []\n",
      "yearOfRegistration: []\n",
      "gearbox: []\n",
      "powerPS: []\n",
      "model: []\n",
      "kilometer: []\n",
      "monthOfRegistration: []\n",
      "fuelType: []\n",
      "brand: []\n",
      "notRepairedDamage: []\n",
      "dateCreated: []\n",
      "nrOfPictures: []\n",
      "postalCode: []\n",
      "lastSeen: []\n"
     ]
    }
   ],
   "source": [
    "# Here I check for numeric values that have decimal digits (i.e. that are not integers).\n",
    "for col in df_used.columns:\n",
    "    print(f\"{col}: {str([x for x in df_used[col].unique() if type(x) == float and not np.isnan(x) and round(x) != x])}\")\n",
    "\n",
    "# As shown, there are none, therefore we can use the Int64 dtype in numeric columns"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 6,
   "id": "f1c539c4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "dateCrawled: False\n",
      "name: False\n",
      "seller: False\n",
      "offerType: False\n",
      "price: False\n",
      "abtest: False\n",
      "vehicleType: False\n",
      "yearOfRegistration: False\n",
      "gearbox: False\n",
      "powerPS: False\n",
      "model: False\n",
      "kilometer: False\n",
      "monthOfRegistration: False\n",
      "fuelType: False\n",
      "brand: False\n",
      "notRepairedDamage: False\n",
      "dateCreated: False\n",
      "nrOfPictures: False\n",
      "postalCode: False\n",
      "lastSeen: False\n"
     ]
    }
   ],
   "source": [
    "# Here I check if any column is unique to find potential candidates for the index\n",
    "for col in df_used.columns:\n",
    "    print(f\"{col}: {df_used[col].is_unique}\")\n",
    "\n",
    "# None are unique, so I use the default numeric index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 7,
   "id": "86074e70",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>dateCrawled</th>\n",
       "      <th>name</th>\n",
       "      <th>seller</th>\n",
       "      <th>offerType</th>\n",
       "      <th>price</th>\n",
       "      <th>abtest</th>\n",
       "      <th>vehicleType</th>\n",
       "      <th>yearOfRegistration</th>\n",
       "      <th>gearbox</th>\n",
       "      <th>powerPS</th>\n",
       "      <th>model</th>\n",
       "      <th>kilometer</th>\n",
       "      <th>monthOfRegistration</th>\n",
       "      <th>fuelType</th>\n",
       "      <th>brand</th>\n",
       "      <th>notRepairedDamage</th>\n",
       "      <th>dateCreated</th>\n",
       "      <th>nrOfPictures</th>\n",
       "      <th>postalCode</th>\n",
       "      <th>lastSeen</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>2016-03-24 11:52:17</td>\n",
       "      <td>Golf_3_1.6</td>\n",
       "      <td>privat</td>\n",
       "      <td>Angebot</td>\n",
       "      <td>480</td>\n",
       "      <td>test</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>1993</td>\n",
       "      <td>manuell</td>\n",
       "      <td>0</td>\n",
       "      <td>golf</td>\n",
       "      <td>150000</td>\n",
       "      <td>0</td>\n",
       "      <td>benzin</td>\n",
       "      <td>volkswagen</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>2016-03-24 00:00:00</td>\n",
       "      <td>0</td>\n",
       "      <td>70435</td>\n",
       "      <td>2016-04-07 03:16:57</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>2016-03-24 10:58:45</td>\n",
       "      <td>A5_Sportback_2.7_Tdi</td>\n",
       "      <td>privat</td>\n",
       "      <td>Angebot</td>\n",
       "      <td>18300</td>\n",
       "      <td>test</td>\n",
       "      <td>coupe</td>\n",
       "      <td>2011</td>\n",
       "      <td>manuell</td>\n",
       "      <td>190</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>125000</td>\n",
       "      <td>5</td>\n",
       "      <td>diesel</td>\n",
       "      <td>audi</td>\n",
       "      <td>ja</td>\n",
       "      <td>2016-03-24 00:00:00</td>\n",
       "      <td>0</td>\n",
       "      <td>66954</td>\n",
       "      <td>2016-04-07 01:46:50</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>2016-03-14 12:52:21</td>\n",
       "      <td>Jeep_Grand_Cherokee_\"Overland\"</td>\n",
       "      <td>privat</td>\n",
       "      <td>Angebot</td>\n",
       "      <td>9800</td>\n",
       "      <td>test</td>\n",
       "      <td>suv</td>\n",
       "      <td>2004</td>\n",
       "      <td>automatik</td>\n",
       "      <td>163</td>\n",
       "      <td>grand</td>\n",
       "      <td>125000</td>\n",
       "      <td>8</td>\n",
       "      <td>diesel</td>\n",
       "      <td>jeep</td>\n",
       "      <td>&lt;NA&gt;</td>\n",
       "      <td>2016-03-14 00:00:00</td>\n",
       "      <td>0</td>\n",
       "      <td>90480</td>\n",
       "      <td>2016-04-05 12:47:46</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>2016-03-17 16:54:04</td>\n",
       "      <td>GOLF_4_1_4__3TÜRER</td>\n",
       "      <td>privat</td>\n",
       "      <td>Angebot</td>\n",
       "      <td>1500</td>\n",
       "      <td>test</td>\n",
       "      <td>kleinwagen</td>\n",
       "      <td>2001</td>\n",
       "      <td>manuell</td>\n",
       "      <td>75</td>\n",
       "      <td>golf</td>\n",
       "      <td>150000</td>\n",
       "      <td>6</td>\n",
       "      <td>benzin</td>\n",
       "      <td>volkswagen</td>\n",
       "      <td>nein</td>\n",
       "      <td>2016-03-17 00:00:00</td>\n",
       "      <td>0</td>\n",
       "      <td>91074</td>\n",
       "      <td>2016-03-17 17:40:17</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>2016-03-31 17:25:20</td>\n",
       "      <td>Skoda_Fabia_1.4_TDI_PD_Classic</td>\n",
       "      <td>privat</td>\n",
       "      <td>Angebot</td>\n",
       "      <td>3600</td>\n",
       "      <td>test</td>\n",
       "      <td>kleinwagen</td>\n",
       "      <td>2008</td>\n",
       "      <td>manuell</td>\n",
       "      <td>69</td>\n",
       "      <td>fabia</td>\n",
       "      <td>90000</td>\n",
       "      <td>7</td>\n",
       "      <td>diesel</td>\n",
       "      <td>skoda</td>\n",
       "      <td>nein</td>\n",
       "      <td>2016-03-31 00:00:00</td>\n",
       "      <td>0</td>\n",
       "      <td>60437</td>\n",
       "      <td>2016-04-06 10:17:21</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "           dateCrawled                            name  seller offerType  \\\n",
       "0  2016-03-24 11:52:17                      Golf_3_1.6  privat   Angebot   \n",
       "1  2016-03-24 10:58:45            A5_Sportback_2.7_Tdi  privat   Angebot   \n",
       "2  2016-03-14 12:52:21  Jeep_Grand_Cherokee_\"Overland\"  privat   Angebot   \n",
       "3  2016-03-17 16:54:04              GOLF_4_1_4__3TÜRER  privat   Angebot   \n",
       "4  2016-03-31 17:25:20  Skoda_Fabia_1.4_TDI_PD_Classic  privat   Angebot   \n",
       "\n",
       "   price abtest vehicleType  yearOfRegistration    gearbox  powerPS  model  \\\n",
       "0    480   test        <NA>                1993    manuell        0   golf   \n",
       "1  18300   test       coupe                2011    manuell      190   <NA>   \n",
       "2   9800   test         suv                2004  automatik      163  grand   \n",
       "3   1500   test  kleinwagen                2001    manuell       75   golf   \n",
       "4   3600   test  kleinwagen                2008    manuell       69  fabia   \n",
       "\n",
       "   kilometer  monthOfRegistration fuelType       brand notRepairedDamage  \\\n",
       "0     150000                    0   benzin  volkswagen              <NA>   \n",
       "1     125000                    5   diesel        audi                ja   \n",
       "2     125000                    8   diesel        jeep              <NA>   \n",
       "3     150000                    6   benzin  volkswagen              nein   \n",
       "4      90000                    7   diesel       skoda              nein   \n",
       "\n",
       "           dateCreated  nrOfPictures  postalCode             lastSeen  \n",
       "0  2016-03-24 00:00:00             0       70435  2016-04-07 03:16:57  \n",
       "1  2016-03-24 00:00:00             0       66954  2016-04-07 01:46:50  \n",
       "2  2016-03-14 00:00:00             0       90480  2016-04-05 12:47:46  \n",
       "3  2016-03-17 00:00:00             0       91074  2016-03-17 17:40:17  \n",
       "4  2016-03-31 00:00:00             0       60437  2016-04-06 10:17:21  "
      ]
     },
     "execution_count": 7,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# I read again the dataset using the information about the column types I found\n",
    "df_used = pd.read_csv(\"./datasets/used_cars_dataset.csv\", encoding=\"windows-1252\", dtype={\n",
    "    'dateCrawled': str,\n",
    "    'name': pd.StringDtype(),\n",
    "    'seller': pd.StringDtype(),\n",
    "    'offerType': pd.StringDtype(),\n",
    "    'price': pd.Int64Dtype(),\n",
    "    'abtest': pd.StringDtype(),\n",
    "    'vehicleType': pd.StringDtype(),\n",
    "    'yearOfRegistration': pd.Int64Dtype(),\n",
    "    'gearbox': pd.StringDtype(),\n",
    "    'powerPS': pd.Int64Dtype(),\n",
    "    'model': pd.StringDtype(),\n",
    "    'kilometer': pd.Int64Dtype(),\n",
    "    'monthOfRegistration': pd.Int64Dtype(),\n",
    "    'fuelType': pd.StringDtype(),\n",
    "    'brand': pd.StringDtype(),\n",
    "    'notRepairedDamage': pd.StringDtype(),\n",
    "    'dateCreated': pd.StringDtype(),\n",
    "    'nrOfPictures': pd.Int64Dtype(),\n",
    "    'postalCode': pd.Int64Dtype(),\n",
    "    'lastSeen': pd.StringDtype()\n",
    "})\n",
    "df_used.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6a3f2455",
   "metadata": {},
   "source": [
    "From here onwards, I investigate the missing and invalid values. If I find any invalid values, I replace them with `<NA>` to encode them as the missing values. This makes it easy to count and drop them all in one go."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 8,
   "id": "8b6f9ce3",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "vehicleType: [<NA>, 'andere', 'bus', 'cabrio', 'coupe', 'kleinwagen', 'kombi', 'limousine', 'suv']\n",
      "brand: ['BMW', 'alfa_romeo', 'audi', 'bmw', 'bmw ', 'chevrolet', 'chrysler', 'citroen', 'dacia', 'daewoo', 'daihatsu', 'fiat', 'ford', 'honda', 'hyundai', 'jaguar', 'jeep', 'kia', 'lada', 'lancia', 'land_rover', 'mazda', 'mercedes_benz', 'mini', 'mitsubishi', 'nissan', 'opel', 'peugeot', 'porsche', 'renault', 'rover', 'saab', 'seat', 'skoda', 'smart', 'sonstige_autos', 'subaru', 'suzuki', 'toyota', 'trabant', 'volkswagen', 'volvo']\n",
      "monthOfRegistration: [0, 1, 10, 11, 12, 2, 3, 4, 5, 6, 7, 8, 9]\n"
     ]
    }
   ],
   "source": [
    "# I look at the values of the indicated columns to find odd values. \n",
    "# Indeed, some brand values use mixed case (BMW) and spaces ('bmw '). Additionally, \n",
    "# a month of registration = 0 does not make sense when the other values are in the\n",
    "# 1-12 range.\n",
    "cols = [\"vehicleType\", \"brand\", \"monthOfRegistration\"]\n",
    "\n",
    "def print_col(col: str):\n",
    "    print(f\"{col}: {str(sorted(df_used[col].unique(), key=lambda x: str(x)))}\")\n",
    "\n",
    "for col in cols:\n",
    "    print_col(col)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 9,
   "id": "98f8d101",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "brand: ['alfa_romeo', 'audi', 'bmw', 'chevrolet', 'chrysler', 'citroen', 'dacia', 'daewoo', 'daihatsu', 'fiat', 'ford', 'honda', 'hyundai', 'jaguar', 'jeep', 'kia', 'lada', 'lancia', 'land_rover', 'mazda', 'mercedes_benz', 'mini', 'mitsubishi', 'nissan', 'opel', 'peugeot', 'porsche', 'renault', 'rover', 'saab', 'seat', 'skoda', 'smart', 'sonstige_autos', 'subaru', 'suzuki', 'toyota', 'trabant', 'volkswagen', 'volvo']\n",
      "monthOfRegistration: [1, 10, 11, 12, 2, 3, 4, 5, 6, 7, 8, 9, <NA>]\n"
     ]
    }
   ],
   "source": [
    "# Some brands are written using mixed case or with spaces, hence here I normalize to stripped lowercase\n",
    "df_used.brand = df_used.brand.apply(lambda x: x if type(x) is not str else x.lower().strip())\n",
    "print_col(\"brand\")\n",
    "\n",
    "# monthOfRegistration=0 is invalid, hence i mark it as NaN\n",
    "df_used[df_used.monthOfRegistration == 0] = np.nan\n",
    "print_col(\"monthOfRegistration\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 10,
   "id": "f300f49d",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "notRepairedDamage: [<NA>, 'ja', 'nein']\n"
     ]
    }
   ],
   "source": [
    "# This column only has 'ja' and 'nein' as non-missing values, we can convert it to a boolean\n",
    "print_col(\"notRepairedDamage\")"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 11,
   "id": "923c5354",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Hence we map the column to boolean values\n",
    "df_used.notRepairedDamage = df_used.notRepairedDamage.map({'ja': True, 'nein': False})"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 12,
   "id": "4b847b1f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Prices not in the 1000-100'000 range are invalid, hence I convert them to NaN\n",
    "df_used.loc[(df_used.price.isna()) | (df_used.price < 1000) | (df_used.price > 100_000), \"price\"] = np.nan"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 13,
   "id": "bf1f417d",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "dateCrawled             37675\n",
       "name                    37675\n",
       "seller                  37675\n",
       "offerType               37675\n",
       "price                  101662\n",
       "abtest                  37675\n",
       "vehicleType             60491\n",
       "yearOfRegistration      37675\n",
       "gearbox                 47998\n",
       "powerPS                 37675\n",
       "model                   51550\n",
       "kilometer               37675\n",
       "monthOfRegistration     37675\n",
       "fuelType                57286\n",
       "brand                   37675\n",
       "notRepairedDamage       87440\n",
       "dateCreated             37675\n",
       "nrOfPictures            37675\n",
       "postalCode              37675\n",
       "lastSeen                37675\n",
       "dtype: int64"
      ]
     },
     "execution_count": 13,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# This reports the number of values in each column that are missing or invalid\n",
    "df_used.isna().sum()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 14,
   "id": "919e692f",
   "metadata": {},
   "outputs": [],
   "source": [
    "# Here I drop the missing values and i re-enumerate all the rows with the automatic numeric index\n",
    "df_used = df_used.dropna().reset_index(drop=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "47a3929f",
   "metadata": {},
   "source": [
    "## Exercise 2 - Data analysis (20 points) 📊\n",
    "\n",
    "1. We consider the norm to be that, for a given type of vehicle, on average the price of diesel is greater than the one of benzine. Provide a representation of the data which shows if, and to which extent, the various vehicle types conform to the norm.\n",
    "What relationship are you showing? Please justify the choice of the representation and your answer.\n",
    "2. Find an appropriate way to show and compare the range of prices for the following `brand`: **mercedes_benz**, **fiat**, **volvo**, **alfa_romeo** and **lancia**. Create a suitable graphical representation of this data. What kind of relationship are you showing? Describe what can be understood from the plot. Please justify your answer and your choice of the graphical representation.\n",
    "\n",
    "<aside>\n",
    "💡 N.B. In this section you should work on the clean Used Cars dataset, without the missing and invalid data.\n",
    "</aside>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e2ae928d",
   "metadata": {},
   "source": [
    "### 2.1\n",
    "\n",
    "By interpreting the following requirement:\n",
    "\n",
    "> on average the price of diesel is greater than the one of benzine\n",
    "\n",
    "as meaning that we expect the average price of diesel cars to be greater than the average cars of _benzin_ cars for each car type, I choose to represent the relationship between each car type and the difference of these average values (i.e. $y=E({\\text{diesel}}) - E({\\text{benzin}})$, where a positive value of $y$ would confirm the expectation).\n",
    "\n",
    "To represent this relationship I choose to use a simple bar chart plotting these differences. I choose to plot a single series for the difference instead of both series for both fuel types to further focus the reader on the difference and not the values. Additionally, plotting the difference only makes comparing the difference value between car types easier as they are all aligned with the origin."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 15,
   "id": "7cc5c90f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAAAxoAAAIXCAYAAAAbqSg4AAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/bCgiHAAAACXBIWXMAAA9hAAAPYQGoP6dpAABnXklEQVR4nO3dd3gU1dvG8Xs3ISSUUEIJUqRIAkg3iSBFOkhRIKAgHaMgCAJSBaQoiHQBIyIgRZEiTYo0QRAMoaigBAhdLNQQagrJ7vsHv+zLmgC7YUKy+P1cF5fJnDNnnz1ZJ3tn5uyYrFarVQAAAABgIHN6FwAAAADg8UPQAAAAAGA4ggYAAAAAwxE0AAAAABiOoAEAAADAcAQNAAAAAIYjaAAAAAAwHEEDAAAAgOEIGgAAAAAM5zJBY9WqVWrcuLHKlSunJk2a6LvvvrO1/fnnn+rWrZsqV66s6tWra+rUqUpMTLTb/6uvvlLdunVVvnx5vfrqq4qIiLBrd2QMAAAAAI5xiaCxevVqDR06VO3atdO6devUtGlT9evXT7/88otu376t1157TZK0ePFijRw5Ul9//bU++eQT2/4rV67U+PHj9fbbb2vFihUqVKiQunTpoqioKElyaAwAAAAAjjNZrVZrehdxP1arVXXr1lXDhg01aNAg2/bXXntNQUFBKliwoIYMGaKdO3cqR44ckqQlS5Zo/PjxCgsLk4eHhxo2bKh69eppwIABkqSEhATVq1dPbdu2Vbdu3bR27doHjgEAAADAcRn+jMapU6f0119/qVmzZnbb58yZo27dumnfvn16+umnbQFBkqpUqaIbN27o8OHDunz5sk6fPq2qVava2t3d3RUQEKC9e/dK0gPHAAAAAOAclwgaknTr1i299tprqlq1qlq3bq2tW7dKks6dOydfX1+7ffLlyydJ+ueff3Tu3DlJUoECBZL1SWp70BgAAAAAnJPhg8aNGzckSYMGDVLTpk01d+5cVatWTT169FBYWJhiY2OTXdqUOXNmSVJcXJxiYmIkKcU+cXFxkvTAMVIrg1+VBgAAAKQZ9/Qu4EEyZcok6c6ajBYtWkiSSpcurYiICH3xxRfy9PRUfHy83T5J4SBLlizy9PSUpBT7eHl5SdIDx0itqKibMptNqd4fAAAAyGhy5crqUL8MHzTy588vSfLz87Pb/tRTT+mHH35QUFCQIiMj7douXLhg2zfpkqkLFy6oRIkSdn2Sxvb19b3vGKllsVhlsXBWAwAAAP89Gf7SqaefflpZs2bVgQMH7LZHRkaqSJEiCgwMVEREhO0SK0navXu3smbNqlKlSsnHx0fFihVTeHi4rT0hIUH79u1TYGCgJD1wDAAAAADOyfBBw9PTUyEhIfrkk0+0du1a/fHHH/r000+1a9cudenSRfXq1VPevHnVp08fHTlyRFu2bNHkyZPVtWtX27qLrl276osvvtDKlSt1/Phxvfvuu4qNjVWrVq0kyaExAAAAADguw99HI8kXX3yhL7/8UufPn1eJEiXUq1cv1atXT5J05swZjRo1Svv27VOOHDnUqlUr9erVS2bz/+eoOXPmaMGCBYqOjlbZsmU1bNgwlS5d2tbuyBjOunjxeuqfMAAAAJAB5c2b3aF+LhM0XBFBAwAAAI8bR4NGhr90CgAAAIDryfCfOgUAAACkBbPZxK0I/sXIT00laAAAAOA/x2w2KWeOLHJz5wKfuyUmWBR99ZYhYYOgAQAAgP8cs9kkN3ezhvTYppOR0eldToZQ3C+nPgytLbPZRNAAAAAAHsbJyGgd+e1yepfxWOJcEQAAAADDETQAAAAAGI6gAQAAAMBwBA0AAAAAhiNoAAAAADAcQQMAAACA4QgaAAAAAAxH0AAAAABgOIIGAAAAAMMRNAAAAAAYjqABAAAAwHAEDQAAAACGI2gAAAAAMBxBAwAAAIDhCBoAAAAADEfQAAAAAGA4ggYAAAAAwxE0AAAAABiOoAEAAADAcAQNAAAAAIYjaAAAAAAwHEEDAAAAgOEIGgAAAAAMR9AAAAAAYDiCBgAAAADDETQAAAAAGI6gAQAAAMBwBA0AAAAAhiNoAAAAADAcQQMAAACA4QgaAAAAAAxH0AAAAABgOIIGAAAAAMMRNAAAAAAYjqABAAAAwHAEDQAAAACGI2gAAAAAMBxBAwAAAIDhCBoAAAAADEfQAAAAAGA4ggYAAAAAwxE0AAAAABiOoAEAAADAcAQNAAAAAIYjaAAAAAAwHEEDAAAAgOEIGgAAAAAMR9AAAAAAYDiCBgAAAADDuUTQOH/+vPz9/ZP9W7FihSTp8OHDat++vSpWrKg6depowYIFdvtbLBZNmzZNNWrUUMWKFfX666/r7Nmzdn0eNAYAAAAAx7mndwGOOHLkiDJnzqwtW7bIZDLZtmfPnl1XrlxRly5dVKdOHY0aNUq//vqrRo0apaxZsyo4OFiSFBoaqkWLFmncuHHy9fXVhAkTFBISojVr1sjDw8OhMQAAAAA4ziWCRmRkpIoWLap8+fIla5s/f74yZcqk0aNHy93dXSVKlNCZM2c0a9YsBQcHKz4+XnPnzlX//v1Vq1YtSdKUKVNUo0YNbdq0SU2bNtXSpUvvOwYAAAAA57jEpVNHjx5ViRIlUmzbt2+fgoKC5O7+/5mpSpUqOn36tC5duqQjR47o5s2bqlq1qq3d29tbZcqU0d69ex0aAwAAAIBzXOaMRq5cudSuXTudOnVKTz75pN58803VrFlT586dk5+fn13/pDMf//zzj86dOydJKlCgQLI+SW0PGiNPnjypqttsNslsNj24IwAAAB4pNzeX+Ht7ujBqbjJ80EhISNDJkyf11FNPafDgwcqWLZvWrVunN954Q1988YViY2Pl4eFht0/mzJklSXFxcYqJiZGkFPtcvXpVkh44Rmrlzp3Vbk0JAAAAkNF5e3sZMk6GDxru7u4KDw+Xm5ubPD09JUlly5bVsWPHNGfOHHl6eio+Pt5un6RwkCVLFts+8fHxtq+T+nh53ZnEB42RWlFRNzmjAQAAkAG5uZkNe0P9uLl2LUaJiZZ7tufKldWhcTJ80JCkrFmTP5mSJUtq586d8vX11YULF+zakr7Pnz+/EhISbNuKFCli18ff31+SHjhGalksVlks1lTvDwAAADxqiYkWJSTcO2g4KsNfnHbs2DFVrlxZ4eHhdtt///13PfXUUwoMDNT+/fuVmJhoa9u9e7eKFSsmHx8flSpVStmyZbPb/9q1a4qIiFBgYKAkPXAMAAAAAM7J8EGjRIkSKl68uEaPHq19+/bpxIkT+vDDD/Xrr7/qzTffVHBwsG7cuKGhQ4fq+PHjWrFihebNm6du3bpJurM2o3379po4caK+//57HTlyRH379pWvr68aNGggSQ8cAwAAAIBzTFarNcNf23Pp0iVNmjRJP/74o65du6YyZcqof//+CggIkCQdPHhQY8aMUUREhPLmzauuXbuqffv2tv0TExM1efJkrVixQrGxsQoMDNR7772nQoUK2fo8aIzUuHjx+kPtDwAAgLTh7m5WrlxZ9Uq9lTry2+X0LidDKFXOR0u2tNCVKzfve+lU3rzZHRrPJYKGqyJoAAAAZEwEjeSMDhoZ/tIpAAAAAK6HoAEAAADAcAQNAAAAAIYjaAAAAAAwHEEDAAAAgOEIGgAAAAAMR9AAAAAAYDiCBgAAAADDETQAAAAAGI6gAQAAAMBwBA0AAAAAhiNoAAAAADAcQQMAAACA4QgaAAAAAAxH0AAAAABgOIIGAAAAAMMRNAAAAAAYjqABAAAAwHAEDQAAAACGI2gAAAAAMBxBAwAAAIDhCBoAAAAADEfQAAAAAGA4ggYAAAAAwxE0AAAAABiOoAEAAADAcAQNAAAAAIYjaAAAAAAwHEEDAAAAgOEIGgAAAAAMR9AAAAAAYDiCBgAAAADDETQAAAAAGI6gAQAAAMBwBA0AAAAAhiNoAAAAADAcQQMAAACA4QgaAAAAAAxH0AAAAABgOIIGAAAAAMMRNAAAAAAYjqABAAAAwHAEDQAAAACGI2gAAAAAMBxBAwAAAIDhCBoAAAAADEfQAAAAAGA
      "text/plain": [
       "<Figure size 900x600 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "df_diff = df_used \\\n",
    "    .loc[(df_used.fuelType == 'benzin') | (df_used.fuelType == 'diesel'), ['vehicleType', 'fuelType', 'price']]\\\n",
    "    .groupby(['vehicleType', 'fuelType']) \\\n",
    "    .mean() \\\n",
    "    .sort_values(['vehicleType', 'fuelType'], ascending=[True, True]) \\\n",
    "    .groupby('vehicleType') \\\n",
    "    .diff() \\\n",
    "    .reset_index() \\\n",
    "    .set_index('vehicleType')\n",
    "\n",
    "df_diff = df_diff.loc[df_diff.fuelType == 'diesel', ['price']].rename({'price': 'diffPrice'}, axis=1).reset_index()\n",
    "\n",
    "sns.set_theme(palette=\"hls\")\n",
    "\n",
    "# Initialize the matplotlib figure\n",
    "f, ax = plt.subplots(figsize=(9, 6))\n",
    "\n",
    "# Plot the total crashes\n",
    "sns.set_color_codes(\"pastel\")\n",
    "sns.barplot(x=\"vehicleType\", y=\"diffPrice\", data=df_diff,\n",
    "            label=\"avg(diesel) - avg(benzin)\", color=sns.xkcd_rgb[\"ultramarine\"])\n",
    "\n",
    "# Add a legend and informative axis label\n",
    "ax.legend(ncol=2, loc=\"lower right\", frameon=True)\n",
    "ax.set(ylabel=\"Diesel - benzin difference\", ylim=[-1000, 6000], \n",
    "       xlabel=\"Vehicle type\")\n",
    "sns.despine(left=True, bottom=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7cc33371",
   "metadata": {},
   "source": [
    "### 2.2\n",
    "\n",
    "To compare the range of prices between car brands, I choose to plot the distribution of car prices for each car brand. To achieve this, I choose to use a variant of the box plot called boxen plot, which ditches whiskers in favour of showing octiles, 16-tiles and so on with coloured rectangles similar to the inner quartiles with exponentially smaller heights.\n",
    "\n",
    "From the plot we can see that the `mercedes_benz` car type has the highest median price, and it also has the most right skewed price distribution out of all car brands. `volvo` has the second-highest average and also a skewed price distribution. Both `lancia` and `fiat` are instead more uniformly distributed towards lower prices, while `alfa_romeo` has a similar distribution however with some skewing towards the expensive side. [`trabant`](https://www.youtube.com/watch?v=npMKIUTa3uI) is the cheapest car type.\n",
    "\n",
    "I choose to use a box-plot style graph as it is an effective representation to show some salient characterististics for one-dimensional distributions, such as the median and the quartiles (25% percentile, 75% percentile). I choose a `boxenplot` in particular to better capture the right-skewedness of some distributions with the additional percentiles considered by the octiles (87.5%), 16-tiles (93.75%) and so on exponentially."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 16,
   "id": "ca97e7c8",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAABh8AAAK5CAYAAACvwT+gAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/bCgiHAAAACXBIWXMAAA9hAAAPYQGoP6dpAACSJElEQVR4nOzde5zc870/8PfsZTaCJlREfm5ZSxRV4tY6kiL0lJQoqhe3U1F16VY1SkNKhKqUqkZUXNL0Qh3ELXHr1UE4PZRSl5YQiaAaqUpckuzszszvj2022ea2m3x3vjO7z+fj4WHznZnv9/19z+zs7vc1n88nUywWiwEAAAAAAJCQqrQLAAAAAAAAuhfhAwAAAAAAkCjhAwAAAAAAkCjhAwAAAAAAkCjhAwAAAAAAkCjhAwAAAAAAkCjhAwAAAAAAkCjhAwAAAAAAkKiatAugeygWi/HPf34QhUIx7VJSU1WViY03Xr/H9yFCL5bSh2X0opU+tNKHZfSilT4soxet9GEZvWilD630YRm9aKUPy+hFK31YRi9a6UMrfVimqioTH/7wBqU5VkmOQreXyWSiqiqTdhmpqqrK6MO/6EUrfVhGL1rpQyt9WEYvWunDMnrRSh+W0YtW+tBKH5bRi1b6sIxetNKHZfSilT600odlStkD4QMAAAAAAJAo4QMAAAAAAJAo4QMAAAAAAJAo4QMAAAAAAJAo4QMAAAAAAJAo4QMAAAAAAJAo4QMAAAAAAJComrQLgI4qFouRy+XSLmOV8vmqWLKkOpqamqKlpVCy42az2chkMiU7HgAAAADAmggfqAjFYjEmTrw85sx5Je1Syk59fUM0No4SQAAAAAAAZcO0S1SEXC5XkuChUCjEm2++GW+++WYUCqUbvbAuZs+eVdYjQgAAAACAnsfIByrOSX16R20Xfci/pVCI376bjYiIT/XtHTVV5ZvPNRcjrl+4KO0yAAAAAABWIHyg4tRmImq7aIqhTCYT1Zmlx8lETVlPZVRMuwAAAAAAgJUq3491AwAAAAAAFUn4AAAAAAAAJEr4AAAAAAAAJEr4AAAAAAAAJEr4AAAAAAAAJEr4AAAAAAAAJEr4AAAAAAAAJEr4AAAAAAAAJEr4AAAAAAAAJEr4AAAAAAAAJEr4AAAAAAAAJEr4AAAAAAAAJEr4AAAAAAAAJEr4AAAAAAAAJEr4AAAAAAAAJKom7QLg3xWLxcjlcu225XJNKVVTGcqtP/l8VSxZUh1NTU3R0lJod1s2m41MJpNSZQAAAABAKQgfKCvFYjEmTrw85sx5Je1SKsrYsaPTLqHD6usborFxlAACAAAAALox0y5RVnK5nOChm5s9e9YKI1sAAAAAgO7FyAfK1tfqt4zaqtZPxzcXCvHj2a+nXFH5+lr9FlFbVd5ZYnOhGD+e/VraZQAAAAAAJSB8oGzVVmUiW+YX1MtFbVVVBfSqsOa7AAAAAADdQrlfrQQAAAAAACqM8AEAAAAAAEiU8AEAAAAAAEiU8AEAAAAAAEiU8AEAAAAAAEhUTdoFQEREsViMXC4XuVxT2qVQApX6PGez2chkMmmXAQAAAABlT/hA6orFYkyceHnMmfNK2qVQImPHjk67hLVSX98QjY2jBBAAAAAAsAamXSJ1uVxO8EBFmD17VuRyubTLAAAAAICyZ+QDZeX0wR+LK596Ju0y6GKj9tg1stXVaZfRYc35Qlz+xFNplwEAAAAAFUP4QFmpqTYYpyfIVldXVPgAAAAAAHSO8IFUtS40XZmLD9Mzdeb1ms9XxZIl1dHU1BQtLYUurGr1LJQNAAAAQKkJH0iNhaapRJW4WLaFsgGoRPl8Pu0SykZHe5HP56N6LUeXrstjy1l3PS8qg9cfAD2dOW5IjYWmoTQslA1ApXnppZlx6qmnxssvv5R2KanraC9efnlmnHfeWTFrVud7ti6PLWfd9byoDF5/ANDNRj5MnDgx7rzzznjggQciIuLZZ5+Ns88+O1577bU47rjj4tvf/nbKFbK8YrGYdgnQaed8+sDI1lTGW2cun49LfvXb1q9Tnt6sXKagWhVTUwGUj3w+H7fccmMsXrw4br75xjjrrO/02E8Od7QX+Xw+pk69KZYsWRJTp97UqZ6ty2PLWXc9LyqD1x8AtKqMK2hr6dprr43a2tq47777YsMNN0y7HJZTKBRi0qQfpV0GdFq2pqZiwoflVeJ0UaW09db1cfLJX+/yAKLcQ5hSKYc+CJygfM2Y8WC89dZbERHx1lvz4pFHHox99z0g1ZrSMmNGx3oxY8aDMX/+2vVsXR5bzmbM6J7nRWWYMcPrDwAiunn4sHDhwthhhx1iq622SrsUllMsFuPKK38Qr702N+1SACIi4tVXZ8e5545KuwxKqFSB05qUQxBTDvRhmXLtRakCu4ULF8T9909vt+2+++6OXXfdPfr06dvlxy8nHe3FuvSsu/a7u54XlcHrDwCWqbjwYebMmXH55ZfHn/70p1i8eHH0798/jjnmmBg5cmS7+w0bNizeeOONiIi466674ve//31suOGGcdlll8VDDz0U//znP+NDH/pQHHDAATFmzJhYb731OnT84447LgYOHBgvvPBCzJ49O84///wYMWJE3HXXXTFlypSYM2dObLLJJvG5z30uTj755Kiuro7XX389DjjggPjhD38Y119/fcyaNSu22267uOyyy+JXv/pV/PKXv4yWlpb4zGc+E+eff37bH3b/8z//ExMnToyXX345+vfvH5/5zGfitNNOi2w2GxERCxYsiAkTJsQDDzwQ77zzTuy4447xzW9+Mz7+8Y8n2PHkNTU1xdy5c9IuA3qUC47+YmRrK+4tv8vlWlrigl/enHYZpEDgBJ1TX98QjY2jujyAmDbt9hUWV87nW2L69DviuONGruJR3VNHe7EuPeuu/e6u50Vl8PoDgGUq6krU4sWLY+TIkbHPPvvEzTffHNXV1TF16tT4/ve/H3vvvXe7+952221x2mmnxWabbRZjxoyJjTfeOBobG2PevHlx1VVXxYc//OH405/+FOeee25su+228eUvf7nDdUydOjUuu+yy2H777aNfv37xs5/9LC6//PIYPXp07LPPPvHnP/85LrzwwnjnnXdizJgxbY+74oor4nvf+1586EMfisbGxvjSl74U++67b9xwww3x+OOPxwUXXBBDhw6NYcOGxcMPPxxnnHFGnHPOOfEf//EfMXfu3Ljoooti9uzZMWHChMjn8zFy5Mhobm6Oyy67LDbeeOP4xS9+ESeeeGLcdNNN8bGPfSyptieuudnCt1Bq2dqaqKutTbuMsjbu1JMjq0fdXlNzc1ww6dq0y4CKM3v2rMjlclFXV9dlx3jppRfj6aefXGF7oVCIp556Ivbee0hsu+2gLjt+OeloL9alZ9213931vKgMXn8A0F7FhQ/HH398HHPMMbH++utHRMTpp58ekydPjhdffLHdfTfeeOOora2NXr16Rb9+/SIiYp999ok999wztt9++4iI2GKLLeLGG2+MmTNndqqOHXbYIQ499NCIaJ1C6Prrr49jjz02jjnmmIiIGDhwYCxYsCAuu+yyOP3009seN3LkyNhrr70iIuJTn/pU3HDDDXHhhRfGeuutFw0NDTFx4sR46aWXYtiwYXHNNdfE5z//+fjiF78YERFbbbVVjBs3Lv7rv/4rXn/99Zg1a1Y8//zzcffdd8egQa2/vIwbNy6effbZ+MlPfhITJkzo1DmVUm1tNu0SoMfJNbekXUJZyrUs68tYF6QBVqm+vqFt9G1XeeKJxyKTyUSxWFzhtkwmE3/84//1mIt2He3FuvSsu/a7u54XlcHrDwDaq6jwYeONN46jjz467rnnnvjLX/4Sc+fOjRdeeCEiWj9JsCZHH310PPDAA3HnnXfGnDlz4uWXX47XX389ttlmm07VsfXWW7d9/c9//jP+8Y9/xO67797uPnvttVc0NzfHK6+8Eh/+8IdXeFzv3r1jk002aTfdU69evSKXax0V8Je//CWeeeaZuO2229puX/oLzKxZs2LmzJmx4YYbtgUPEa2/zOyxxx7xyCOPdOp8Sq2uri622mq
      "text/plain": [
       "<Figure size 1800x800 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "brands = ['mercedes_benz', 'fiat', 'volvo', 'alfa_romeo', 'lancia', 'trabant']\n",
    "\n",
    "df_price = df_used \\\n",
    "    .loc[df_used.brand.isin(brands), ['brand', 'price']] \\\n",
    "    .sort_values('brand', ascending=True)\n",
    "\n",
    "sns.set_theme(palette=\"hls\")\n",
    "\n",
    "# Initialize the matplotlib figure\n",
    "f, ax = plt.subplots(figsize=(18, 8))\n",
    "\n",
    "mkfunc = lambda x, pos: '%1.0fk' % (x * 1e-3)\n",
    "mkformatter = mpl.ticker.FuncFormatter(mkfunc)\n",
    "ax.xaxis.set_major_formatter(mkformatter)\n",
    "\n",
    "# Draw a nested boxplot to show bills by day and time\n",
    "sns.boxenplot(y=\"brand\", x=\"price\", data=df_price)\n",
    "\n",
    "ax.set(ylabel=\"\", xlim=[0, 100000], xticks=range(0, 105001, 5000),\n",
    "       xlabel=\"Distribution of prices per vehicle type and fuel type\")\n",
    "       \n",
    "sns.despine(offset=10, trim=True)"
   ]
  },
  {
   "attachments": {
    "Banks%20-%20market%20cap.png": {
     "image/png": "iVBORw0KGgoAAAANSUhEUgAABgwAAAQoCAYAAADWsvWXAAAMP2lDQ1BJQ0MgUHJvZmlsZQAASImVVwdYU8kWnluSkEBoAQSkhN4E6QSQEkILvSPYCEmAUEIMBBU7uqjg2sUCNnRVRMEKiB2xswg27IsFFWVdLNiVNymg677yvfm+ufPff87858y5M/feAUDtJEckykXVAcgTForjQgLoY1NS6aSnAAUYoAIHADjcAhEzJiYCwDLU/r28uwEQaXvVXqr1z/7/WjR4/AIuAEgMxOm8Am4exAcBwKu4InEhAEQpbzalUCTFsAItMQwQ4oVSnCnHVVKcLsd7ZTYJcSyIWwFQUuFwxJkAqHZAnl7EzYQaqv0QOwp5AiEAanSIffPy8nkQp0FsDW1EEEv1Gek/6GT+TTN9WJPDyRzG8rnIilKgoECUy5n2f6bjf5e8XMmQD0tYVbLEoXHSOcO83czJD5diFYj7hOlR0RBrQvxBwJPZQ4xSsiShiXJ71IBbwII5AzoQO/I4geEQG0AcLMyNilDw6RmCYDbEcIWgUwWF7ASIdSFeyC8IilfYbBbnxyl8oQ0ZYhZTwZ/niGV+pb7uS3ISmQr911l8tkIfUy3OSkiGmAKxeZEgKQpiVYgdCnLiwxU2Y4qzWFFDNmJJnDR+c4jj+MKQALk+VpQhDo5T2JflFQzNF9ucJWBHKfD+wqyEUHl+sFYuRxY/nAvWwRcyE4d0+AVjI4bmwuMHBsnnjj3jCxPjFTofRIUBcfKxOEWUG6Owx035uSFS3hRi14KieMVYPKkQLki5Pp4hKoxJkMeJF2dzwmLk8eDLQARggUBABxJY00E+yAaC9r7GPngn7wkGHCAGmYAP7BXM0IhkWY8QXuNBMfgTIj4oGB4XIOvlgyLIfx1m5Vd7kCHrLZKNyAFPIM4D4SAX3ktko4TD3pLAY8gI/uGdAysXxpsLq7T/3/ND7HeGCZkIBSMZ8khXG7IkBhEDiaHEYKINro/74t54BLz6w+qMM3DPoXl8tyc8IXQSHhKuE7oJtyYJSsQ/RRkJuqF+sCIX6T/mAreEmm54AO4D1aEyroPrA3vcFfph4n7QsxtkWYq4pVmh/6T9txn88DQUdmRHMkoeQfYnW/88UtVW1W1YRZrrH/MjjzV9ON+s4Z6f/bN+yD4PtuE/W2ILsQPYOewUdgE7ijUCOnYCa8LasGNSPLy6HstW15C3OFk8OVBH8A9/Q09WmskCx1rHXscv8r5C/lTpOxqw8kXTxILMrEI6E34R+HS2kOswiu7s6OwCgPT7In99vYmVfTcQnbbv3Lw/APA5MTg4eOQ7F3YCgH0ecPsf/s5ZM+CnQxmA84e5EnGRnMOlFwJ8S6jBnaYHjIAZsIbzcQbuwBv4gyAQBqJBAkgBE2H0WXCdi8EUMAPMBaWgHCwDq8F6sAlsBTvBHrAfNIKj4BQ4Cy6BDnAd3IGrpwe8AP3gHfiMIAgJoSI0RA8xRiwQO8QZYSC+SBASgcQhKUgakokIEQkyA5mHlCMrkPXIFqQG2YccRk4hF5BO5BbyAOlFXiOfUAxVQbVQQ9QSHY0yUCYajiagE9BMdDJajM5Hl6Br0Wp0N9qAnkIvodfRbvQFOoABTBnTwUwwe4yBsbBoLBXLwMTYLKwMq8CqsTqsGT7nq1g31od9xIk4Dafj9nAFh+KJOBefjM/CF+Pr8Z14A96KX8Uf4P34NwKVYECwI3gR2ISxhEzCFEIpoYKwnXCIcAbupR7COyKRqEO0InrAvZhCzCZOJy4mbiDWE08SO4mPiAMkEkmPZEfyIUWTOKRCUilpHWk36QTpCqmH9EFJWclYyVkpWClVSahUolShtEvpuNIVpadKn8nqZAuyFzmazCNPIy8lbyM3ky+Te8ifKRoUK4oPJYGSTZlLWUupo5yh3KW8UVZWNlX2VI5VFijPUV6rvFf5vPID5Y8qmiq2KiyV8SoSlSUqO1ROqtxSeUOlUi2p/tRUaiF1CbWGepp6n/pBlabqoMpW5anOVq1UbVC9ovpSjaxmocZUm6hWrFahdkDtslqfOlndUp2lzlGfpV6pfli9S31Ag6bhpBGtkaexWGOXxgWNZ5okTUvNIE2e5nzNrZqnNR/RMJoZjUXj0ubRttHO0Hq0iFpWWmytbK1yrT1a7Vr92prartpJ2lO1K7WPaXfrYDqWOmydXJ2lOvt1buh8GmE4gjmCP2LRiLoRV0a81x2p66/L1y3Trde9rvtJj64XpJejt1yvUe+ePq5vqx+rP0V/o/4Z/b6RWiO9R3JHlo3cP/K2AWpgaxBnMN1gq0GbwYChkWGIochwneFpwz4jHSN/o2yjVUbHjXqNaca+xgLjVcYnjJ/TtelMei59Lb2V3m9iYBJqIjHZYtJu8tnUyjTRtMS03vSeGcWMYZZhtsqsxazf3Ng80nyGea35bQuyBcMiy2KNxTmL95ZWlsmWCywbLZ9Z6VqxrYqtaq3uWlOt/awnW1dbX7Mh2jBscmw22HTYorZutlm2lbaX7VA7dzuB3Qa7zlGEUZ6jhKOqR3XZq9gz7Yvsa+0fOOg4RDiUODQ6vBxtPjp19PLR50Z/c3RzzHXc5njHSdMpzKnEqdnptbOtM9e50vmaC9Ul2GW2S5PLK1c7V77rRtebbjS3SLcFbi1uX9093MXude69HuYeaR5VHl0MLUYMYzHjvCfBM8BztudRz49e7l6FXvu9/vK2987x3uX9bIzVGP6YbWMe+Zj6cHy2+HT70n3TfDf7dvuZ+HH8qv0e+pv58/y3+z9l2jCzmbuZLwMcA8QBhwLes7xYM1knA7HAkMCywPYgzaDEoPVB94NNgzODa4P7Q9xCpoecDCWEhocuD+1iG7K57Bp2f5hH2Myw1nCV8Pjw9eEPI2wjxBHNkWhkWOTKyLtRFlHCqMZoEM2OXhl9L8YqZnLMkVhibExsZeyTOKe4GXHn4mnxk+J3xb9LCEhYmnAn0TpRktiSpJY0Pqkm6X1yYPKK5O6xo8fOHHspRT9FkNKUSkpNSt2eOjAuaNzqcT3j3caXjr8xwWrC1AkXJupPzJ14bJLaJM6kA2mEtOS0XWlfONGcas5AOju9Kr2fy+Ku4b7g+fNW8Xr5PvwV/KcZPhkrMp5l+mSuzOzN8suqyOoTsATrBa+yQ7M3Zb/Pic7ZkTOYm5xbn6eUl5Z3WKgpzBG25hvlT83vFNmJSkXdk70mr57cLw4Xby9ACiYUNBVqwR/5Nom15BfJgyLfosqiD1OSphyYqjFVOLVtmu20RdOeFgcX/zYdn86d3jLDZMbcGQ9mMmdumYXMSp/VMtts9vzZPXNC5uycS5mbM/f3EseSFSVv5yXPa55vOH/O/Ee/hPxSW6paKi7tWuC9YNNCfKFgYfsil0XrFn0r45VdLHcsryj/spi7+OKvTr+u/XVwScaS9qXuSzcuIy4TLrux3G/5zhUaK4pXPFoZubJhFX1V2aq3qyetvlDhWrFpDWWNZE332oi1TevM1y1b92V91vrrlQGV9VUGVYuq3m/gbbiy0X9j3SbDTeWbPm0WbL65JWRLQ7VldcVW4tairU+2JW079xvjt5rt+tvLt3/dIdzRvTNuZ2uNR03NLoNdS2vRWklt7+7xuzv2BO5pqrOv21KvU1++F+yV7H2+L23fjf3h+1sOMA7UHbQ4WHWIdqisAWmY1tDfmNXY3ZTS1Hk47HBLs3fzoSMOR3YcNTlaeUz72NLjlOPzjw+eKD4xcFJ0su9U5qlHLZNa7pwee/paa2xr+5nwM+fPBp89fY557sR5n/NHL3hdOHyRcbHxkvulhja3tkO/u/1+qN29veGyx+WmDs+O5s4xncev+F05dTXw6tlr7GuXrkdd77yReONm1/iu7pu8m89u5d56dbvo9uc7c+4S7pbdU79Xcd/gfvUfNn/Ud7t3H3sQ+KDtYfzDO4+4j148Lnj8pWf+E+qTiqfGT2ueOT872hvc2/F83POeF6IXn/tK/9T4s+ql9cuDf/n/1dY/tr/nlfjV4OvFb/Te7Hjr+rZlIGbg/ru8d5/fl33Q+7DzI+PjuU/Jn55+nvKF9GXtV5uvzd/C
    }
   },
   "cell_type": "markdown",
   "id": "f4e84bcf",
   "metadata": {},
   "source": [
    "## Exercise 3 - Data analysis (20 points) 📊\n",
    "\n",
    "The following graph represents the financial meltdown's impact on banks since the 2008 financial crisis began, and compares the market value of each bank as of 2007 - in blue - and 2009 - in green. The **main** purpose of the graph is to show the loss of each bank after the financial crisis and to enlight the little decline pre-versus-post meltdown of J.P. Morgan; the **secondary** purpose is to provide a sense of the relative sizes of the banks in terms of market value (e.g., J.P. Morgan is not a small bank).\n",
    "Is there a better solution to achieve these two goals? How would you compare both the remaining market value of each bank after the loss caused by the crisis and their decline?\n",
    "\n",
    "List all the problems that you detect in the design of this graph with respect to the quantive message the graph is supposed to deliver.\n",
    "\n",
    "Propose and implement a different graph that delivers effectively the message.\n",
    "\n",
    "Use the data in the ‘*market_value_decline’* dataset to populate the new graph.\n",
    "\n",
    "![Banks%20-%20market%20cap.png](attachment:Banks%20-%20market%20cap.png)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "139bb76f",
   "metadata": {},
   "source": [
    "<!-- List all the problems that you detect in the design of this graph with respect to the quantive message the graph is supposed to deliver. -->\n",
    "The given graph is not suited to show quantitative values because it relies on the areas of two-dimensional objects to convey the magnitude of the values it shows. Humans are not suited to understand at a glance the difference between the areas of two objects, instead being capable to grasp much one-dimensional differences, like length. This is why the use of a bar chart would have been more suited to show these quantitative values.\n",
    "\n",
    "The given graph does not provide a convenient way to show the market decline of each bank. As we are required to mentally compute the difference of market value before and after the stock market crash, it is hard to compare the banks to see which one has lost the least or the most value. \n",
    "\n",
    "<!-- How would you compare both the remaining market value of each bank after the loss caused by the crisis and their decline? -->\n",
    "As we want to convey that JP Morgan has one of the lowest relative market value differences, I would plot directly this difference as another bar chart.\n",
    "\n",
    "<!-- Is there a better solution to achieve these two goals? -->\n",
    "<!-- Propose and implement a different graph that delivers effectively the message. -->\n",
    "We can implement a better graph with a table lens bar chart showing both the relative market value decrease for each bank and the pre- and post-market collapse market values. The left side shows the former message (i.e. fulfilling the main purpose), while the right side shows the latter message (i.e. fulfilling the secondary purpose)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 17,
   "id": "eb956ed4",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_m = pd.read_csv(\"./datasets/market_value_decline.csv\").rename(columns={\n",
    "       'Unnamed: 0': 'bank',\n",
    "       'market_value_2007': '2007',\n",
    "       'market_value_2009': '2009'\n",
    "})\n",
    "\n",
    "df_mkt = df_m\n",
    "df_mkt[\"diff\"] = 100 * (df_mkt['2009'] - df_mkt['2007']) / df_mkt['2007']\n",
    "df_mkt = df_mkt.sort_values(['diff'], ascending=False)\n",
    "\n",
    "# sort source DF according to new order by diff\n",
    "df_m = df_m.reindex(df_mkt.index)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 18,
   "id": "4a29684b",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_mval = pd.melt(df_m.loc[:, ['bank', '2007', '2009']], id_vars=['bank'], var_name='year', value_name='market_value')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 19,
   "id": "d3d58d25",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAABUgAAAKrCAYAAAAj9WcAAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/bCgiHAAAACXBIWXMAAA9hAAAPYQGoP6dpAADcF0lEQVR4nOzdeVhUdf/G8XtmAMEFEDVxQc19V9yX0tyeXFPTtDRMSdPUyiUVdy1F3HOpNNNKzcq1zEwrMy0t0zSXx3DfwNwF00Bgzvz+8Oc8EViIDAMz79d1demc+Z5zPp8vgx1uzmKy2Ww2AQAAAAAAAIAbMju7AAAAAAAAAABwFgJSAAAAAAAAAG6LgBQAAAAAAACA2yIgBQAAAAAAAOC2CEgBAAAAAAAAuC0CUgAAAAAAAABui4AUAAAAAAAAgNsiIAUAAAAAAADgtjycXQAgSTabTdeu3ZJh2JxdSoYwm00KCMjlMj25Wj8SPWUX9JQ9uFpPrtaP5Lo95cuX29llIA1c7TgvK3HF7+2shPl1LObXsZhfx2J+HcsZx3mcQYoswWQyyWw2ObuMDGM2m1yqJ1frR6Kn7IKesgdX68nV+pFctydkD6722ctKXPF7Oythfh2L+XUs5texmF/Hcsa8EpACAAAAAAAAcFtcYg8AwP8zmx3/W2CLxZzsT1fgaj25Wj+Sa/eE7IGvl2O44vd2VuKM+TUMG5frAoATEJACAKA74WhePx+ZPSyZsj9fX59M2U9mcrWeXK0fyTV7QtZntRp89hyM+XWszJzfJKtVsTFxhKQAkMkISAEA0P+fPeph0daBsxRzLMrZ5QD4F/5liqrJ/CHOLgNpYLGY1Sd8iY6eveDsUoAsrWyxQC0aFSqz2URACgCZjIAUAIC/iDkWpauHTjq7DABwKUfPXtD+4+ecXQYAAECqCEgBAAAAAACAVBiGIas16W/LTIqPtygh4basVs74vl8Wi4fM5qx1/2wCUgAAAAAAAOAvbDabbty4pri4m6m+f+WKWYZhZHJVrsPHJ7d8fQNkMjn2IblpRUAKAAAAAAAA/MXdcDR37rzy8sqRIsizWEycPZoONptNCQm3dfPmdUmSn18+J1d0BwEpAAAAAAAA8P8Mw2oPR3Pn9k11jIeHWUlJnEGaHl5eOSRJN29eV548ebPE5fbOrwAAAAAAAADIIqxWq6T/BXnIeHfn9u/3d3UWAlIAAAAAAADgb7LK/TFdUVabWwJSAAAAAAAAAG6LgBQAAAAAAACA2yIgBQAAAAAAAOC2CEgzwPr169WlSxdVr15dwcHB6tSpkz7++OMM3cf169e1atWqDN1makJCQhQWFubw/QAAAAAAAABZgYezC8juVq9ercmTJ2v06NGqWbOmbDabduzYoUmTJunKlSsaOHBghuxn2rRpioqK0lNPPZUh2wMAAAAAAABAQPrAVqxYoU6dOqlz5872ZSVLltTFixe1dOnSDAtIbTZbhmwHAAAAAAAAWcubb87RmjUrtX79ZuXOndu+/P3339VHHy3TZ59t1vnzUVqwYL5+/XWfJKlmzdoaOHCQihQpah9//PgxLVnyjg4c2Kc//vhDefMG6LHHmurFF19SjhzekqRHHqml0NAXtGPH9zp16qRCQnqqV68+mdtwFsMl9g/IbDZr3759io2NTbb8hRde0CeffCJJOn/+vAYPHqz69eurUqVKatSokaZPny7DMCRJa9euVYsWLex/Vq5cWU8++aR++eUXSVJYWJjWrVunn3/+WeXKlZMkxcbGasyYMXr00UdVqVIl1a9fX2PGjFFcXJwkadeuXapYsaK2bdumtm3bqnLlymrZsqW++eYbe40JCQkKDw9X/fr1VbNmzWQ13XXixAn16dNHwcHBeuSRRzR06FBdvnzZ/n5ISIjGjh2rp556SrVq1dL69eszeIYBAAAAAABcW9u27ZWQcFvfffdNsuWbNm1U06b/0aVLF9Wv3/O6fv2aRo+eoLCwsTp/Plr9+99ZJklXrlzRgAG9FR8fp1GjJmjGjLlq1uw/Wr36E61cmfxWkMuWvacWLR7XpElT1bhx00zrM6viDNIH1Lt3bw0ePFiNGjVS3bp1VatWLdWrV09VqlSRr6+vJOnFF19UgQIF9N577ylXrlzasmWLpkyZouDgYDVv3lyS9Pvvv+vjjz/W9OnTlStXLk2YMEFhYWH66quvNHr0aMXHx+vChQuaN2+epDuh6cWLFzV//nzly5dPe/fu1ahRo1S6dGn17NlTkmS1WjV9+nSNHj1ahQoV0qxZszRixAht375duXLl0qRJk/Ttt98qIiJChQsX1oIFC7Rnzx4FBQVJki5evKhu3bqpXbt2CgsLU1xcnObNm6euXbtqw4YNypkzpyRp1apVmj59usqVK6cCBQpk8lcAAAAAWV3z2pVUJqigs8tAOsXejNOl6zecXYbLK1ss0NklAHCi4sVLqHLlqtq0aaPatu0gSTp4cL+ios5qzJgJeu+9RfL29tYbb7ylXLnunGFaq1ZtdenSXitWLNOAAa/o5MnjKlOmnCZNmqqcOXNJkmrXrqs9e3Zp375fFBLS076/qlWD9fTTz2Z2m1kWAekDatmypQIDA7V06VLt2LFD27ZtkySVKFFC4eHhqlSpktq3b69WrVqpUKFCkqSePXtq0aJFOnLkiD0gTUxM1MSJE1WhQgVJUq9evTRgwABdvnxZDz30kLy9veXp6WkPIBs2bKjatWvbzygtWrSoli9frqNHjyarb9CgQapfv74kqX///tq8ebOOHj2qMmXKaO3atRo/frwaN24sSQoPD9dPP/1kX/ejjz5SYGCgxowZY1/2xhtvqF69etq0aZOefPJJSVKFChXUrl27jJ1YAAAAuATDsGrs8+2dXQYegGFYZTZbnF2GW0iyWmUY3F4NcFdt2z6hqVMn68KF3xUYWEgbN25QsWLFVblyVY0aNUzBwTWUI4e3kpKSJEk5c+ZS1arB2r17lySpTp16qlOnnpKSknTq1ElFR5/TiRPHdf36dfn6+iXbV5kyZTO9v6yMgDQDVK9eXdWrV5dhGIqMjNS2bdu0fPly9enTR19//bWeffZZbdq0SQcOHNCZM2d05MgRXblyJcXl7KVKlbL/PU+ePJLuBKep6datm7799lutW7dOp0+f1vHjxxUVFaWSJUsmG/fX13fvYZGYmKhTp04pMTFRVapUsb+fI0cOVaxY0f768OHDOnbsmIKDg5Nt8/bt2zpx4oT9dfHixdM0TwAAAHA/ZrNFW94cpZjok84uBengX6Skmg0I140bcbJajX9fwYVYLGb5+vpkau+GYSMgBdxY06b/0Zw5s7Rp0xd65pkQbd36tbp37ylJio2N0ZYtX2vLlq9TrOfvn1eSZBiGFi58U2vXrlJc3J966KGCqlixknLkyJHi2TY+Pj4O7yc7ISB9ABcuXNDChQvVt29fBQYGymw2q2LFiqpYsaKaN2+utm3bavv27Vq2bJni4+PVsmVLdezYUVWrVlX37t1TbM/LyyvFstQezmQYhvr27atjx46pbdu2at26tSpVqqSxY8emeZsmkynV7Xt4/O8jYRiG6tWrp/Hjx6fYxt0AV5K8vb1TvA8AAADcFRN9UldORzq7DDwAq9VQUpJ7BaR3uXPvADJXzpw51aRJM23d+o1KlSqtuLg4tWrVRtKdHKZmzTp65pmUl8VbLHfO8l++/H198smHGjZslBo3bmo/Ua5Pnx6Z10Q2RUD6ALy8vLRq1SoVKlRIL7zwQrL37t5/NDo6Wv/973+1Y8cO5c+fX5IUExOjq1ev3teT6e8GmpL022+/afv27Vq5cqWqVasm6c5ZoWfPnrXfP/TfPPzww8qRI4f27t1rv6w/KSlJkZGRqlu3riSpTJky2rhxowoVKmQPWmNiYjRixAj16tVL9erVS3P9AAAAAAAA+Gdt27bXxo2f65NPVqhWrbrKn//OrRarV6+h06dPqXTpsvaT22w2myZOHKOgoGIqU6acDhz4VQ8/XFJt2jxh397ly5d04sQJVahQMdX94Q6eYv8AAgIC1Lt3b82ZM0ezZ8/Wb7/9pnPnzmn
      "text/plain": [
       "<Figure size 1500x800 with 2 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "sns.set_theme(palette=\"hls\")\n",
    "\n",
    "# Initialize the matplotlib figure\n",
    "f, (ax1, ax2) = plt.subplots(1, 2, figsize=(15, 8), sharey=True)\n",
    "\n",
    "ax2.xaxis.set_major_formatter(mpl.ticker.FuncFormatter(lambda x, pos: '%.0fB' % (x)))\n",
    "ax1.xaxis.set_major_formatter(mpl.ticker.FuncFormatter(lambda x, pos: '%.0f%%' % (x)))\n",
    "\n",
    "sns.barplot(y=\"bank\", x=\"diff\", data=df_mkt, ax=ax1, color=sns.xkcd_rgb[\"purplish red\"])\n",
    "sns.barplot(\n",
    "    data=df_mval, ax=ax2,\n",
    "    x=\"market_value\", y=\"bank\", hue=\"year\",\n",
    "    palette=[sns.xkcd_rgb[\"prussian blue\"], sns.xkcd_rgb[\"sienna\"]]\n",
    ")\n",
    "\n",
    "# Add a legend and informative axis label\n",
    "ax2.set(ylabel=\"Institution\", xlim=[0, 300],\n",
    "       xlabel=\"Market value\")\n",
    "ax1.set(ylabel=\"Institution\", xlim=[-1, 0], xticks=range(-100, 1, 10),\n",
    "       xlabel=\"Market value decrease\")\n",
    "sns.despine(left=True, bottom=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "06e7f954",
   "metadata": {},
   "source": [
    "## Exercise 4 - Data visualisation and exploration (30 points) 🔍\n",
    "\n",
    "You'll need to work with the *'airports'* and *‘airports-delays’* datasets. Examine the datasets and perform cleansing if needed, before performing the exercise.\n",
    "\n",
    "1. Create a dataframe that provides, for each country, <del>the mean of flights delayed</del>. Display these information by binning the flights delayed in 6 bins. The resulting dataframe should have the countries as rows and the 6 bins as columns. For this exercise you cannot use pivot_table but only groupby. \n",
    "\n",
    "<span style=\"color: red\">According to answer of question to professor:</span>\n",
    "> Bin by delay_duration value, compute delay mean per-bin per-country \n",
    "\n",
    "2. Create a dataframe from ‘a*irports-delays’* which shows for each continent and country:\n",
    "    1. max, min and mean of ‘**delay_duration**’;\n",
    "    2. mean, sum of ‘**flights_cancelled**’;\n",
    "    3. mean, sum of ‘**flights_delayed**’;\n",
    "    4. mean, sum of ‘**flights_planned**.\n",
    "\n",
    "3. Show a representation of the relationship between the number of flights planned and the number of flights delayed for each continent. It should be possible to see the relationship and the presence of outliers for each continent. What do you observe? You may want to display the median of the values for a better explaination."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 20,
   "id": "b4fde7e4",
   "metadata": {},
   "outputs": [],
   "source": [
    "df_del = pd.read_csv(\"./datasets/airports-delays.csv\", index_col='ID', sep=\";\", na_values=['\\\\N']) \\\n",
    "    .dropna(subset=['tz_database_timezone'])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 21,
   "id": "25391739",
   "metadata": {},
   "outputs": [],
   "source": [
    "def tz_to_continent(tz: str) -> str:\n",
    "    tz_mappings = {\n",
    "        'Asia/': 'Asia',\n",
    "        'Africa/': 'Africa',\n",
    "        'America/': 'America',\n",
    "        'Europe/': 'Europe',\n",
    "        'Australia/': 'Oceania',\n",
    "        'Pacific/': 'Oceania',\n",
    "        'Antarctica/': 'Antarctica',\n",
    "        'Arctic/Longyearbyen': 'Europe',\n",
    "        'Atlantic/Azores': 'Europe',\n",
    "        'Atlantic/Bermuda': 'America',\n",
    "        'Atlantic/Canary': 'Africa',\n",
    "        'Atlantic/Cape_Verde': 'Africa',\n",
    "        'Atlantic/Faeroe': 'Europe',\n",
    "        'Atlantic/Reykjavik': 'Europe',\n",
    "        'Atlantic/St_Helena': 'Africa',\n",
    "        'Atlantic/Stanley': 'America',\n",
    "        'Indian/Antananarivo': 'Africa',\n",
    "        'Indian/Chagos': 'Asia',\n",
    "        'Indian/Christmas': 'Oceania',\n",
    "        'Indian/Cocos': 'Oceania',\n",
    "        'Indian/Comoro': 'Africa',\n",
    "        'Indian/Mahe': 'Africa',\n",
    "        'Indian/Maldives': 'Asia',\n",
    "        'Indian/Mauritius': 'Africa',\n",
    "        'Indian/Mayotte': 'Africa',\n",
    "        'Indian/Reunion': 'Africa',\n",
    "    }\n",
    "    if type(tz) != str:\n",
    "        raise ValueError(\"tz not str\")\n",
    "    to_return = [v for (k, v) in tz_mappings.items() if tz.startswith(k)]\n",
    "    if len(to_return) == 0:\n",
    "        raise ValueError(f\"'{tz}' no continent found\")\n",
    "    return to_return[0]\n",
    "\n",
    "df_del[\"continent\"] = df_del[\"tz_database_timezone\"].apply(tz_to_continent)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 22,
   "id": "f8906707",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>delay_duration_bin</th>\n",
       "      <th>(15.999, 30.0]</th>\n",
       "      <th>(30.0, 35.0]</th>\n",
       "      <th>(35.0, 41.0]</th>\n",
       "      <th>(41.0, 47.0]</th>\n",
       "      <th>(47.0, 59.0]</th>\n",
       "      <th>(59.0, 850.0]</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>country</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Afghanistan</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00</td>\n",
       "      <td>44.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>60.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Albania</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00</td>\n",
       "      <td>0.0</td>\n",
       "      <td>56.000000</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Algeria</th>\n",
       "      <td>26.5</td>\n",
       "      <td>33.857143</td>\n",
       "      <td>38.75</td>\n",
       "      <td>43.0</td>\n",
       "      <td>51.200000</td>\n",
       "      <td>73.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>American Samoa</th>\n",
       "      <td>0.0</td>\n",
       "      <td>0.000000</td>\n",
       "      <td>0.00</td>\n",
       "      <td>43.0</td>\n",
       "      <td>48.000000</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Angola</th>\n",
       "      <td>28.0</td>\n",
       "      <td>35.000000</td>\n",
       "      <td>36.00</td>\n",
       "      <td>45.0</td>\n",
       "      <td>51.666667</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "delay_duration_bin  (15.999, 30.0]  (30.0, 35.0]  (35.0, 41.0]  (41.0, 47.0]  \\\n",
       "country                                                                        \n",
       "Afghanistan                    0.0      0.000000          0.00          44.0   \n",
       "Albania                        0.0      0.000000          0.00           0.0   \n",
       "Algeria                       26.5     33.857143         38.75          43.0   \n",
       "American Samoa                 0.0      0.000000          0.00          43.0   \n",
       "Angola                        28.0     35.000000         36.00          45.0   \n",
       "\n",
       "delay_duration_bin  (47.0, 59.0]  (59.0, 850.0]  \n",
       "country                                          \n",
       "Afghanistan             0.000000           60.0  \n",
       "Albania                56.000000            0.0  \n",
       "Algeria                51.200000           73.0  \n",
       "American Samoa         48.000000            0.0  \n",
       "Angola                 51.666667            0.0  "
      ]
     },
     "execution_count": 22,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_4_1 = df_del.copy()\n",
    "\n",
    "# The following statements bins the data by the value of delay_duration.\n",
    "# The bins are chosen as equally-spaced percentile values of the data. This is done to \n",
    "# better distribute the data between bins, as it is quite skewed towards low values\n",
    "df_4_1[\"delay_duration_bin\"] = pd.qcut(df_del.delay_duration, 6)\n",
    "\n",
    "# The dataframe will contain countries as row indices, the 6 bins as columns and values\n",
    "# corresponding to the mean delay_duration per country, per bin. When no delay_duration \n",
    "# falls in a particular bin for some country, that bin has a value of 0\n",
    "df_4_1 = df_4_1.loc[:, ['country', 'delay_duration', 'delay_duration_bin']] \\\n",
    "    .groupby(['country', 'delay_duration_bin']) \\\n",
    "    .mean() \\\n",
    "    .fillna(0) \\\n",
    "    .reset_index() \\\n",
    "    .pivot(index='country', columns='delay_duration_bin', values='delay_duration') \n",
    "\n",
    "df_4_1.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 23,
   "id": "a677ce07",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th>dur_min</th>\n",
       "      <th>dur_mean</th>\n",
       "      <th>dur_max</th>\n",
       "      <th>cancelled_sum</th>\n",
       "      <th>cancelled_mean</th>\n",
       "      <th>delayed_sum</th>\n",
       "      <th>delayed_mean</th>\n",
       "      <th>planned_sum</th>\n",
       "      <th>planned_mean</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>continent</th>\n",
       "      <th>country</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th rowspan=\"5\" valign=\"top\">Africa</th>\n",
       "      <th>Algeria</th>\n",
       "      <td>26.0</td>\n",
       "      <td>43.739130</td>\n",
       "      <td>82.0</td>\n",
       "      <td>6</td>\n",
       "      <td>0.26087</td>\n",
       "      <td>360</td>\n",
       "      <td>15.652174</td>\n",
       "      <td>1864</td>\n",
       "      <td>81.043478</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Angola</th>\n",
       "      <td>28.0</td>\n",
       "      <td>42.714286</td>\n",
       "      <td>53.0</td>\n",
       "      <td>9</td>\n",
       "      <td>1.12500</td>\n",
       "      <td>97</td>\n",
       "      <td>12.125000</td>\n",
       "      <td>472</td>\n",
       "      <td>59.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Benin</th>\n",
       "      <td>69.0</td>\n",
       "      <td>69.000000</td>\n",
       "      <td>69.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>7</td>\n",
       "      <td>7.000000</td>\n",
       "      <td>28</td>\n",
       "      <td>28.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Burkina Faso</th>\n",
       "      <td>35.0</td>\n",
       "      <td>35.000000</td>\n",
       "      <td>35.0</td>\n",
       "      <td>0</td>\n",
       "      <td>0.00000</td>\n",
       "      <td>18</td>\n",
       "      <td>18.000000</td>\n",
       "      <td>65</td>\n",
       "      <td>65.000000</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Cameroon</th>\n",
       "      <td>28.0</td>\n",
       "      <td>51.250000</td>\n",
       "      <td>83.0</td>\n",
       "      <td>3</td>\n",
       "      <td>0.75000</td>\n",
       "      <td>61</td>\n",
       "      <td>15.250000</td>\n",
       "      <td>339</td>\n",
       "      <td>84.750000</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "</div>"
      ],
      "text/plain": [
       "                        dur_min   dur_mean  dur_max  cancelled_sum  \\\n",
       "continent country                                                    \n",
       "Africa    Algeria          26.0  43.739130     82.0              6   \n",
       "          Angola           28.0  42.714286     53.0              9   \n",
       "          Benin            69.0  69.000000     69.0              0   \n",
       "          Burkina Faso     35.0  35.000000     35.0              0   \n",
       "          Cameroon         28.0  51.250000     83.0              3   \n",
       "\n",
       "                        cancelled_mean  delayed_sum  delayed_mean  \\\n",
       "continent country                                                   \n",
       "Africa    Algeria              0.26087          360     15.652174   \n",
       "          Angola               1.12500           97     12.125000   \n",
       "          Benin                0.00000            7      7.000000   \n",
       "          Burkina Faso         0.00000           18     18.000000   \n",
       "          Cameroon             0.75000           61     15.250000   \n",
       "\n",
       "                        planned_sum  planned_mean  \n",
       "continent country                                  \n",
       "Africa    Algeria              1864     81.043478  \n",
       "          Angola                472     59.000000  \n",
       "          Benin                  28     28.000000  \n",
       "          Burkina Faso           65     65.000000  \n",
       "          Cameroon              339     84.750000  "
      ]
     },
     "execution_count": 23,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# 4.2\n",
    "df_4_2 = df_del.loc[:, ['country', 'continent', 'delay_duration', 'flights_cancelled', 'flights_delayed', 'flights_planned']] \\\n",
    "    .sort_values(['continent', 'country']) \\\n",
    "    .groupby(['continent', 'country']) \\\n",
    "    .agg(dur_min=('delay_duration', 'min'), \\\n",
    "        dur_mean=('delay_duration', 'mean'), \\\n",
    "        dur_max=('delay_duration', 'max'), \\\n",
    "        cancelled_sum=('flights_cancelled', 'sum'), \\\n",
    "        cancelled_mean=('flights_cancelled', 'mean'), \\\n",
    "        delayed_sum=('flights_delayed', 'sum'), \\\n",
    "        delayed_mean=('flights_delayed', 'mean'), \\\n",
    "        planned_sum=('flights_planned', 'sum'), \\\n",
    "        planned_mean=('flights_planned', 'mean'))\n",
    "    \n",
    "df_4_2.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 24,
   "id": "a29b8c2f",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "image/png": "iVBORw0KGgoAAAANSUhEUgAABR4AAAK5CAYAAADZ4TKfAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/bCgiHAAAACXBIWXMAAA9hAAAPYQGoP6dpAACtO0lEQVR4nOzdeXhTZfr/8c8JTSiFttCKoqwFZEdRXKAVZJCRgqDAuAAqiAOI+zIyiAKz6Iwg+h0RdVxwUHDBnwjIInTGDZAijjsoWChldXCgjF1oISk5vz8wsWnSNGnTJmner+vqZXLOs9zPc04TvPuccwzTNE0BAAAAAAAAQAhZwh0AAAAAAAAAgPqHxCMAAAAAAACAkCPxCAAAAAAAACDkSDwCAAAAAAAACDkSjwAAAAAAAABCjsQjAAAAAAAAgJAj8QgAAAAAAAAg5Eg8AgAAAAAAAAi5uHAHgLr3+eef66yz2ikuzhbuUABJUlmZXT/8sIfzEhGF8xKRiPMSkYpzE5GI8xKRiPMSkaiszK4zz0ytlbZZ8RijnE5nuEMA3FznI+clIgnnJSIR5yUiFecmIhHnJSIR5yUiUW2ejyQeAQAAAAAAAIQciUcAAAAAAAAAIUfiEQAAAAAAAEDIkXgEAAAAAAAAEHIkHgEAAAAAAACEXFy4AwAAAAAAAED0cTqdOnmyLNxhwI8GDeJksYRv3SGJRwAAAAAAAATMNE0VFh7V8ePHwh0KAhAf31hJSSkyDKPO+ybxCA+macrhcFRZRlKNTlir1RqWEx4AAAAAANRMYeFRnThRotNPP13x8Y34//sIZZqmjh8v1eHDh1VYKCUnp9Z5DCQe4cHhcGju3IdrvZ+pU2fKZrPVej8AAAAAACB0nM6TOn78mE4//XQ1a5YS7nBQhUaNGkmS/vvf/yoxsVmdX3bNw2UAAAAAAAAQkJMnT0qS4uMbhTkSBMp1rMJxP05WPKJS1w+bobg4z1WJjjK7Xl/9iCRp7LAZssYFvmqxrMyu136uCwAAAAAAoheXV0ePcB4rEo+oVFyczW9i0VrFfgAAAAAAAMQuLrUGAAAAAAAAoozr4b+RjMQjAAAAAAAAQu6mm8brppvGe20vLi7W2LGjdd555+qDD96XJF1++SA99NCDNe5zxYrl6tGjmw4ePFijdsrKyvTQQw/qoosu0MUXX6hPP93is9zixYt06aX91Lv3eXr++ec8xnHw4EH16NFNK1YsD7jfQOt8+OEHevDB6YEPKEy41BoAAAAAAAB14tixY7rllsn6/vvv9dRT89WvX39J0rx5T6lx4yZhju4XH3/8sd55Z4WmTLlVffr0VbduXb3KFBcXa+7cx3TppZdq/PgJatmypd5+e6l7f/PmzfXaa2+odevWIY/vlVdeCXmbtYHEIwAAAAAAAGrdL0nHHZo//xmlp6e793Xt2i2MkXkrKPhJkjRixEi1atXKZ5nCwgI5nU4NHHiZLrjgAq/9NptN5557bm2GGfG41BoAAAAAAAC1qqTkmKZMuUXff79Dzz77nEfSUZLPS5Szstbp3nvv0UUXXaD09D76wx9mqaSkxF3H6XTq+eef06BBA3XBBefrrrvuUEFBQZWxnDx5UkuWvKGRI69S797nadCggfrb3/5PJ06ckCQ99NCD7lgyMy/3ebn4ihXLdfnlv5YkzZw5Qz16eCdOfV02/dVXX2n8+Bt14YW9NWjQQC1evFgTJ97sdZn54cOHdd99v4z9j3/8g0pKjkk6dQn7Z5/9W5999m/16NFNn376aZVjDhcSjzEoGm4+WhtM04zZsQMAAAAAEC4lJSW69dYp2r79Oz3//Au66KKLAqr3pz/9UWeddZaeemq+Jky4WcuWva3nn3/Ovf+JJx7X3//+rH7zm6s1b95TSk5uqr/97f8Canf27Ed12WWDNH/+Mxo79nq9/vpruvPOO2Sapm65ZYpuuWWKJOnJJ5/SzJkzvdro3/9SPfnkU5KkW26Zotdee6PKfnfv3q2JE2+WJM2d+7huv/0OLVjwgr744guvsk8/PV8tWpyp+fOf1rhx47V06Vt65plnJEkzZ85U165d1bVrV7322hvq1i2yVouWx6XWMSgrK0vjx08Kdxh1yjRNLVq0QJI0btxEGYYR5ogAAAAAAKj/SktLddttU9zJtfIrFqvSv/+lmjr195KkPn36avPmzdqwYb3uvfc+FRYW6rXXXtX48Tfp1ltvkyRlZFyiw4f/q48//rjSNnNzd2nZsrd1zz33auLEU7mR9PR0NW/eXNOnP6CNGzeof/9L3fdl7Nq1q1q2bOnVTkpKirp2PXXfx9atWwd0SfWLL76gJk2a6LnnXlCjRo0kSWlp7XXDDWO9yv7615fr97+fJkm6+OI+ys7e5H7ATYcOHd33w4z0S7lZ8RiDDh8+LIfDEe4w6pTD4dCBA/t04MC+mBs7AAAAAADh8u2327Rr1y698spitWnTRg8++KCOHDkcUN1evXp5vD/jjDNUWloqSfrmm69VVlamSy8d4FFm8OBMv23++9+fSZKGDh3qsX3IkKFq0KCB/v3vfwcUW3V8+ukW9evX3510lE6N0Vdis3fv3h7vW7ZspaKiolqLrbaQeAQAAAAAAECtSEpK0ksvLdT555+vRx+drcLCAk2fPj2gW6HFx8d7vLdYLHI6nZLkvpdjs2bNPMo0b97cb5uuh8acdppnubi4ODVt2rRWk3tHjx5VSkqK1/bU1FSvbeWTk5Ln2KMJiUcAAAAAAADUik6dOqtz586SpHPOOVcTJ07S5s3ZWrjwHzVqt2nTUwnH/Px8j+0//fST33rJyU0lyWvVpcPh0E8//aSmTZvWKC5/zjijhVe80qmEZH1F4hEAAAAAAAB1YsqUW9WjR0899dQ8bd36TbXbOe+8XoqPj9c//7nOY/tHH33kt96FF14gSXr33Xc9tq9du1YnT57U+eefX+2YqnLBBRfo4483up+eLUnbt3+nAwcOBN1WgwbRkdLj4TIAAAAAAACoE3FxcZo9e46uueY3mjp1qpYufVtNmjQJup2EhMa65ZYpmj//KTVqlKCLLrpYGzdu0Pr1H/mt16FDR1111Qg9/fR8HT9+XL1799aOHTv07LPP6KKLLtYll/Sr5siqNnnyZK1bt1ZTptyi8eNvUlFRoebPf0oWiyXoh+AmJibp66+/0pYtn6hLl65KTk6upahrhsQjYtpHH72n7OwN6tKlmw4ePKDBg4dJkrKyVstqtSo//4hSU09TSUmJDMNQ27bttGPHd+rSpZtyc3fJ4bArPb2/BgwY5NFuTs4OZWWt9mivZ89e2rr1Kw0ePEydOnWp03GWj6emfefk7NDq1ctlGIauuGJEnY8FAAAAABDd2rVrp9/9bqoeeeTP+vOf/6THHptbrXYmTZqshIQELV68WIsXL1KvXufp/vun6uGH/+y33p///LDatGmj5cuXa8GCF3XGGWfohhtu1JQpt8piqb2VhG3atNXzz7+gJ554XPfdd49SUlI0adJkvfDC80pISAiqrbFjx+rbb7dpypRb9Mgjf9EVVwyrpahrxjADuZsn6pV77rlHd989VU2aJHnts9vtmjv3YUnS+BF/ljXO5rHfUWbXKytmVbrfn/J1p06dKZst8Lo1VX5crr5LSo7pySfneNzQtkmTRElScXHgN5M1DEP33DNNCQmNJUkOh11///s8FRUVerRnGIZM01RiYpJuvfVuWa11M/7y8dS0b4fDrmeffdI9P02aJOq22+6p8Vjs9uM6dGifWrRoI5stvuoKQB3gvEQk4rxEpOLcRCTivEQkqg/npcNh19Gjh9S2bTuvh7/Av08+2Syr1arevS9wbyssLFT//pfo/vun6oYbbqyVfo8fP669e/coJaWFz/9/t9uPq2VL/w/lqa7ouCAcqAVvvfW611O0iouLgko6SpJpmlq69A33+02bNrifglW+PVdfRUVFys7eWJPQg1I+npr2vWnTBo/5KS6u27EAAAAAABCtvvvuO02ePEmLFy/SZ599pvfee0933HGbkpKSNHToFeEOr1ZwqXWMcjjsstvtXtt9basNddVPZf3
      "text/plain": [
       "<Figure size 1500x800 with 1 Axes>"
      ]
     },
     "metadata": {},
     "output_type": "display_data"
    }
   ],
   "source": [
    "df_4_3 = df_del.loc[:, ['continent', 'flights_planned', 'flights_delayed']] \\\n",
    "    .rename(columns={'flights_planned': 'planned', 'flights_delayed': 'delayed'}) \\\n",
    "    .melt(id_vars=['continent'], value_vars=['planned', 'delayed'], var_name=\"Kind of flight\", \\\n",
    "        value_name=\"# of flights\") \\\n",
    "    .sort_values('continent')\n",
    "\n",
    "f, ax1 = plt.subplots(figsize=(15, 8))\n",
    "\n",
    "sns.set_theme(style=\"ticks\", palette=\"pastel\")\n",
    "\n",
    "# Draw a nested boxplot to show bills by day and time\n",
    "ax1.set(xlim=[0, 700])\n",
    "sns.boxplot(x=\"# of flights\", y=\"continent\",\n",
    "            hue=\"Kind of flight\", palette=[\"m\", \"g\"],\n",
    "            data=df_4_3)\n",
    "sns.despine(offset=10, trim=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "04aa4de5",
   "metadata": {},
   "source": [
    "I observe that in all continents there is a significant higher number of planned flights than the number of delayed flights. This can be determined by the inter-quartile range positions of both series' boxplots with respect to each other."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e2f9c1aa",
   "metadata": {},
   "source": [
    "## Exercise 5 - Geospatial data analysis (35 points) 🌍\n",
    "\n",
    "Use the ‘*airports’*, ‘*routes’*, ’*countries*’ and ’*europe.geojson*’ files. Create an interactive map representation - related to European countries only - such that, when a country is selected the map shows the number of flights left from the country selected and directed to each of the other countries, if flights with those destinations exist. The information should be represented as a choropleth map, essentially dynamically creating it when a country is selected.\n",
    "\n",
    "**Hints**:\n",
    "1. If `A` is a GeoDataFrame and `B` a DataFrame, the result of `A.merge(B,..)` is a GeoDataFrame, whereas the result of `B.merge(A,..)` is a DataFrame. The function `to_json()` on a DataFrame with a geometry column does **not** work.\n",
    "2. When updating the map, to access the color mapper you can use the following method:\n",
    "```\n",
    "color_mapper = p.select_one(LinearColorMapper)\n",
    "```\n",
    "where `p` is the figure.\n",
    "\n",
    "3. You can discard Guernsey and Gibraltar that are not present in the geojson.\n",
    "\n",
    "\n",
    "<aside>\n",
    "💡 Note that you have all the information you need in the files mentioned above. \n",
    "</aside>"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 122,
   "id": "5d1fad2a",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th>country</th>\n",
       "      <th>Albania</th>\n",
       "      <th>Andorra</th>\n",
       "      <th>Austria</th>\n",
       "      <th>Belarus</th>\n",
       "      <th>Belgium</th>\n",
       "      <th>Bosnia and Herzegovina</th>\n",
       "      <th>Bulgaria</th>\n",
       "      <th>Croatia</th>\n",
       "      <th>Cyprus</th>\n",
       "      <th>Czech Republic</th>\n",
       "      <th>...</th>\n",
       "      <th>San Marino</th>\n",
       "      <th>Serbia</th>\n",
       "      <th>Slovakia</th>\n",
       "      <th>Slovenia</th>\n",
       "      <th>Spain</th>\n",
       "      <th>Sweden</th>\n",
       "      <th>Switzerland</th>\n",
       "      <th>Ukraine</th>\n",
       "      <th>United Kingdom</th>\n",
       "      <th>Vatican City</th>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>country_dest</th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "      <th></th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>Albania</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Andorra</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Austria</th>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>15</td>\n",
       "      <td>2</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>6</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>3</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>40</td>\n",
       "      <td>4</td>\n",
       "      <td>11</td>\n",
       "      <td>10</td>\n",
       "      <td>11</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Belarus</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>1</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>Belgium</th>\n",
       "      <td>0</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>0</td>\n",
       "      <td>1</td>\n",
       "      <td>0</td>\n",
       "      <td>4</td>\n",
       "      <td>5</td>\n",
       "      <td>2</td>\n",
       "      <td>4</td>\n",
       "      <td>...</td>\n",
       "      <td>0</td>\n",
       "      <td>2</td>\n",
       "      <td>1</td>\n",
       "      <td>3</td>\n",
       "      <td>60</td>\n",
       "      <td>6</td>\n",
       "      <td>6</td>\n",
       "      <td>2</td>\n",
       "      <td>17</td>\n",
       "      <td>0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 47 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "country       Albania  Andorra  Austria  Belarus  Belgium  \\\n",
       "country_dest                                                \n",
       "Albania             0        0        1        0        0   \n",
       "Andorra             0        0        0        0        0   \n",
       "Austria             1        0       15        2        2   \n",
       "Belarus             0        0        2        0        0   \n",
       "Belgium             0        0        2        0        1   \n",
       "\n",
       "country       Bosnia and Herzegovina  Bulgaria  Croatia  Cyprus  \\\n",
       "country_dest                                                      \n",
       "Albania                            0         0        0       0   \n",
       "Andorra                            0         0        0       0   \n",
       "Austria                            1         3        6       3   \n",
       "Belarus                            0         0        0       1   \n",
       "Belgium                            0         4        5       2   \n",
       "\n",
       "country       Czech Republic  ...  San Marino  Serbia  Slovakia  Slovenia  \\\n",
       "country_dest                  ...                                           \n",
       "Albania                    0  ...           0       0         0         1   \n",
       "Andorra                    0  ...           0       0         0         0   \n",
       "Austria                    1  ...           0       3         1         2   \n",
       "Belarus                    2  ...           0       0         0         0   \n",
       "Belgium                    4  ...           0       2         1         3   \n",
       "\n",
       "country       Spain  Sweden  Switzerland  Ukraine  United Kingdom  \\\n",
       "country_dest                                                        \n",
       "Albania           0       0            0        0               1   \n",
       "Andorra           0       0            0        0               0   \n",
       "Austria          40       4           11       10              11   \n",
       "Belarus           1       1            1        2               1   \n",
       "Belgium          60       6            6        2              17   \n",
       "\n",
       "country       Vatican City  \n",
       "country_dest                \n",
       "Albania                  0  \n",
       "Andorra                  0  \n",
       "Austria                  0  \n",
       "Belarus                  0  \n",
       "Belgium                  0  \n",
       "\n",
       "[5 rows x 47 columns]"
      ]
     },
     "execution_count": 122,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "df_air = pd.read_csv(\"./datasets/airports.csv\", index_col='ID', na_values=['\\\\N'], dtype={'ID': pd.Int64Dtype()}) \\\n",
    "    .drop(columns=['latitude', 'longitude'])\n",
    "\n",
    "df_routes = pd.read_csv(\"./datasets/routes.csv\", na_values=['\\\\N'], sep=\";\") \\\n",
    "    .rename(lambda x: x.strip(), axis=1)\n",
    "\n",
    "df_countries = pd.read_csv(\"./datasets/countries.csv\") \\\n",
    "    .rename(columns={'name': 'country'}).drop(columns=['continent'])\n",
    "\n",
    "df_countries.loc[df_countries.country == 'Faroe Is.', 'country'] = 'Faroe Islands'\n",
    "\n",
    "df_id_country = df_air.join(df_countries.set_index('country'), on='country', how='right', lsuffix='_air') \\\n",
    "    .reset_index(drop=True) \\\n",
    "    .loc[:, ['IATA', 'country']] \\\n",
    "    .set_index('IATA')\n",
    "\n",
    "# Right join twice with source airport country and destination airport country\n",
    "# A right join assures we include all countries in the final dataframe\n",
    "df_routes_count = df_routes \\\n",
    "    .loc[:, ['source_airport', 'destination_airport']] \\\n",
    "    .join(df_id_country, how='right', on='source_airport') \\\n",
    "    .join(df_id_country, how='right', on='destination_airport', rsuffix='_dest')\n",
    "\n",
    "# Count only a pair of notna source and destination airport as a valid route\n",
    "# When this is not a case the row is an artifact of the right join. We assign 0\n",
    "# as a value so that in the final sum the value will still appear to include \n",
    "# no-flight countries, albeit with a total number of routes to 0\n",
    "df_routes_count['# routes'] = 0\n",
    "df_routes_count.loc[df_routes_count.source_airport.notna() & \\\n",
    "                    df_routes_count.destination_airport.notna(), '# routes'] = 1\n",
    "\n",
    "# destination as rows, source as columns\n",
    "df_routes_count = df_routes_count \\\n",
    "    .groupby(['country_dest', 'country']).agg({'# routes': 'sum'}) \\\n",
    "    .rename(columns={0: '# routes'}) \\\n",
    "    .unstack() \\\n",
    "    .fillna(0) \\\n",
    "    .sort_values('country_dest')\n",
    "\n",
    "# Change type of cells and remove column level for geopandas compatibility\n",
    "df_routes_count = df_routes_count[df_routes_count.columns].astype(int)\n",
    "df_routes_count.columns = df_routes_count.columns.droplevel(0)\n",
    "df_routes_count.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 128,
   "id": "75225ed4",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/html": [
       "<div>\n",
       "<style scoped>\n",
       "    .dataframe tbody tr th:only-of-type {\n",
       "        vertical-align: middle;\n",
       "    }\n",
       "\n",
       "    .dataframe tbody tr th {\n",
       "        vertical-align: top;\n",
       "    }\n",
       "\n",
       "    .dataframe thead th {\n",
       "        text-align: right;\n",
       "    }\n",
       "</style>\n",
       "<table border=\"1\" class=\"dataframe\">\n",
       "  <thead>\n",
       "    <tr style=\"text-align: right;\">\n",
       "      <th></th>\n",
       "      <th>NAME</th>\n",
       "      <th>geometry</th>\n",
       "      <th>Albania</th>\n",
       "      <th>Andorra</th>\n",
       "      <th>Austria</th>\n",
       "      <th>Belarus</th>\n",
       "      <th>Belgium</th>\n",
       "      <th>Bosnia and Herzegovina</th>\n",
       "      <th>Bulgaria</th>\n",
       "      <th>Croatia</th>\n",
       "      <th>...</th>\n",
       "      <th>San Marino</th>\n",
       "      <th>Serbia</th>\n",
       "      <th>Slovakia</th>\n",
       "      <th>Slovenia</th>\n",
       "      <th>Spain</th>\n",
       "      <th>Sweden</th>\n",
       "      <th>Switzerland</th>\n",
       "      <th>Ukraine</th>\n",
       "      <th>United Kingdom</th>\n",
       "      <th>Vatican City</th>\n",
       "    </tr>\n",
       "  </thead>\n",
       "  <tbody>\n",
       "    <tr>\n",
       "      <th>0</th>\n",
       "      <td>Azerbaijan</td>\n",
       "      <td>MULTIPOLYGON (((45.08332 39.76804, 45.26639 39...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>1</th>\n",
       "      <td>Albania</td>\n",
       "      <td>POLYGON ((19.43621 41.02107, 19.45055 41.06000...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>2</th>\n",
       "      <td>Armenia</td>\n",
       "      <td>MULTIPOLYGON (((45.57305 40.63249, 45.52888 40...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>...</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "      <td>NaN</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>3</th>\n",
       "      <td>Bosnia and Herzegovina</td>\n",
       "      <td>POLYGON ((17.64984 42.88908, 17.57853 42.94382...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "    <tr>\n",
       "      <th>4</th>\n",
       "      <td>Bulgaria</td>\n",
       "      <td>POLYGON ((27.87917 42.84110, 27.89500 42.80250...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>3.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>4.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>6.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>...</td>\n",
       "      <td>0.0</td>\n",
       "      <td>2.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>8.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>1.0</td>\n",
       "      <td>0.0</td>\n",
       "      <td>10.0</td>\n",
       "      <td>0.0</td>\n",
       "    </tr>\n",
       "  </tbody>\n",
       "</table>\n",
       "<p>5 rows × 49 columns</p>\n",
       "</div>"
      ],
      "text/plain": [
       "                     NAME                                           geometry  \\\n",
       "0              Azerbaijan  MULTIPOLYGON (((45.08332 39.76804, 45.26639 39...   \n",
       "1                 Albania  POLYGON ((19.43621 41.02107, 19.45055 41.06000...   \n",
       "2                 Armenia  MULTIPOLYGON (((45.57305 40.63249, 45.52888 40...   \n",
       "3  Bosnia and Herzegovina  POLYGON ((17.64984 42.88908, 17.57853 42.94382...   \n",
       "4                Bulgaria  POLYGON ((27.87917 42.84110, 27.89500 42.80250...   \n",
       "\n",
       "   Albania  Andorra  Austria  Belarus  Belgium  Bosnia and Herzegovina  \\\n",
       "0      NaN      NaN      NaN      NaN      NaN                     NaN   \n",
       "1      0.0      0.0      1.0      0.0      0.0                     0.0   \n",
       "2      NaN      NaN      NaN      NaN      NaN                     NaN   \n",
       "3      0.0      0.0      1.0      0.0      0.0                     2.0   \n",
       "4      0.0      0.0      3.0      0.0      4.0                     0.0   \n",
       "\n",
       "   Bulgaria  Croatia  ...  San Marino  Serbia  Slovakia  Slovenia  Spain  \\\n",
       "0       NaN      NaN  ...         NaN     NaN       NaN       NaN    NaN   \n",
       "1       0.0      0.0  ...         0.0     0.0       0.0       1.0    0.0   \n",
       "2       NaN      NaN  ...         NaN     NaN       NaN       NaN    NaN   \n",
       "3       0.0      1.0  ...         0.0     3.0       0.0       1.0    0.0   \n",
       "4       6.0      0.0  ...         0.0     2.0       0.0       0.0    8.0   \n",
       "\n",
       "   Sweden  Switzerland  Ukraine  United Kingdom  Vatican City  \n",
       "0     NaN          NaN      NaN             NaN           NaN  \n",
       "1     0.0          0.0      0.0             1.0           0.0  \n",
       "2     NaN          NaN      NaN             NaN           NaN  \n",
       "3     3.0          1.0      0.0             0.0           0.0  \n",
       "4     0.0          1.0      0.0            10.0           0.0  \n",
       "\n",
       "[5 rows x 49 columns]"
      ]
     },
     "execution_count": 128,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "# Some countries have column values = 'NaN'. These countries are not in Europe according to the dataset,\n",
    "# but I choose to include them in the map as 'no data' (i.e. in grey)\n",
    "yurop = gpd.read_file(\"./datasets/europe.geojson\") \\\n",
    "    .loc[:, ['NAME', 'geometry']] \\\n",
    "    .set_index('NAME') \\\n",
    "    .join(df_routes_count, how='left') \\\n",
    "    .reset_index()\n",
    "\n",
    "yurop.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 166,
   "id": "11612845",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "application/vnd.bokehjs_exec.v0+json": "",
      "text/html": [
       "<script id=\"p7925\">\n",
       "  (function() {\n",
       "    const xhr = new XMLHttpRequest()\n",
       "    xhr.responseType = 'blob';\n",
       "    xhr.open('GET', \"http://localhost:50851/autoload.js?bokeh-autoload-element=p7925&bokeh-absolute-url=http://localhost:50851&resources=none\", true);\n",
       "    xhr.onload = function (event) {\n",
       "      const script = document.createElement('script');\n",
       "      const src = URL.createObjectURL(event.target.response);\n",
       "      script.src = src;\n",
       "      document.body.appendChild(script);\n",
       "    };\n",
       "    xhr.send();\n",
       "  })();\n",
       "</script>"
      ]
     },
     "metadata": {
      "application/vnd.bokehjs_exec.v0+json": {
       "server_id": "de3272d393da49aba1d3fd3516574d15"
      }
     },
     "output_type": "display_data"
    },
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Index(['NAME', 'geometry', 'routes'], dtype='object')\n",
      "Index(['NAME', 'geometry', 'routes'], dtype='object')\n",
      "Index(['NAME', 'geometry', 'routes'], dtype='object')\n",
      "Index(['NAME', 'geometry', 'routes'], dtype='object')\n",
      "Index(['NAME', 'geometry', 'routes'], dtype='object')\n",
      "Index(['NAME', 'geometry', 'routes'], dtype='object')\n",
      "Index(['NAME', 'geometry', 'routes'], dtype='object')\n",
      "Index(['NAME', 'geometry', 'routes'], dtype='object')\n",
      "Index(['NAME', 'geometry', 'routes'], dtype='object')\n",
      "Index(['NAME', 'geometry', 'routes'], dtype='object')\n",
      "Index(['NAME', 'geometry', 'routes'], dtype='object')\n",
      "Index(['NAME', 'geometry', 'routes'], dtype='object')\n",
      "Index(['NAME', 'geometry', 'routes'], dtype='object')\n",
      "Index(['NAME', 'geometry', 'routes'], dtype='object')\n",
      "Index(['NAME', 'geometry', 'routes'], dtype='object')\n",
      "Index(['NAME', 'geometry', 'routes'], dtype='object')\n",
      "Index(['NAME', 'geometry', 'routes'], dtype='object')\n",
      "Index(['NAME', 'geometry', 'routes'], dtype='object')\n",
      "Index(['NAME', 'geometry', 'routes'], dtype='object')\n"
     ]
    }
   ],
   "source": [
    "from bokeh.events import Tap\n",
    "from bokeh.models.widgets import MultiSelect, Slider, DateRangeSlider\n",
    "from bokeh.layouts import row\n",
    "from bokeh.application import Application\n",
    "from bokeh.application.handlers import FunctionHandler\n",
    "from bokeh.models.ranges import Range1d\n",
    "from bokeh.palettes import Reds\n",
    "from bokeh.models import LinearColorMapper, LogColorMapper, ColorBar\n",
    "from shapely import Point\n",
    "\n",
    "yurop_json = yurop.to_json()\n",
    "\n",
    "def figure_flights(doc):\n",
    "    palette = Reds[6]\n",
    "    palette = palette[::-1]\n",
    "\n",
    "    color_mapper = LinearColorMapper(palette = palette, low = 0, high = 600)\n",
    "\n",
    "    color_bar = ColorBar(color_mapper = color_mapper, \n",
    "                         width = 20, height = 600,\n",
    "                         label_standoff = 8,\n",
    "                         location = (0,0))\n",
    "\n",
    "    \n",
    "    p = figure(title = 'Number of flights to each country', \n",
    "           frame_height = 600,\n",
    "           frame_width = 800, \n",
    "           toolbar_location = 'below',\n",
    "           tools = \"pan, wheel_zoom, box_zoom, reset\")\n",
    "\n",
    "    geo_ds = GeoJSONDataSource(geojson=yurop_json)\n",
    "    \n",
    "    plotted_districts = p.patches('xs','ys', source = geo_ds,\n",
    "                    line_color = 'black', \n",
    "                    line_width = 0.25)\n",
    "\n",
    "    p.patches(\"xs\",\"ys\", source = geo_ds,\n",
    "                   fill_color = {\"field\" : \"routes\",\n",
    "                                 \"transform\" : color_mapper},\n",
    "                   line_color = \"gray\", \n",
    "                   line_width = 0.25, \n",
    "                   fill_alpha = 1)\n",
    "    \n",
    "    p.xgrid.grid_line_color = None\n",
    "    p.ygrid.grid_line_color = None\n",
    "    p.axis.visible = False\n",
    "    \n",
    "    p.add_tools(HoverTool(renderers = [plotted_districts],\n",
    "                        tooltips = [(\"Country\",\"@NAME\"),(\"# routes\",\"@routes\")]))\n",
    "\n",
    "    tool = TapTool()\n",
    "    \n",
    "    def event(x):\n",
    "        # Figure out the country that intersects the coordinates we clicked\n",
    "        intersects = yurop[yurop.intersects(Point(x.x, x.y))]\n",
    "        if len(intersects) == 0:\n",
    "            return\n",
    "        \n",
    "        routes_from_country = intersects.iloc[0, 2:].to_frame(name='routes')\n",
    "        gdf_country_flights = yurop.set_index('NAME').loc[:, ['geometry']] \\\n",
    "            .join(routes_from_country) \\\n",
    "            .reset_index()\n",
    "        print(gdf_country_flights.columns)\n",
    "        \n",
    "        geo_ds_country = gdf_country_flights.to_json()\n",
    "        \n",
    "        geo_ds.geojson = geo_ds_country\n",
    "\n",
    "    tap = p.add_tools()\n",
    "    p.on_event(Tap, event)\n",
    "    \n",
    "    p.add_layout(color_bar, \"right\")\n",
    "    doc.add_root(p)\n",
    "\n",
    "handler = FunctionHandler(figure_flights)\n",
    "app = Application(handler)\n",
    "\n",
    "show(app)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9b9c5983",
   "metadata": {},
   "source": [
    "## Datasets description\n",
    "\n",
    "### **Used Cars**\n",
    "\n",
    "Please find the dataset in the datasets folder.\n",
    "\n",
    "This dataset is scraped from Ebay. The content of the dataset is in German, but it should not impose critical issues in understanding the data. The fields included in the dataset are as following:\n",
    "\n",
    "**dateCrawled**: when this ad was first crawled, all field-values are taken from this date\\\n",
    "**name**: ”name” of the car\\\n",
    "**seller**: private or dealer\\\n",
    "**offerTypeprice**: the price in euro on the ad to sell the car\\\n",
    "**abtestvehicleTypeyearOfRegistration** : at which year the car was first registered\\\n",
    "**gearboxpowerPS**: power of the car in PS\\\n",
    "**modelkilometer**: how many kilometers the car has driven\\\n",
    "**monthOfRegistration**: at which month the car was first registered\\\n",
    "**fuelType**: vehicle fuel type\\\n",
    "**brand**: vehicle brand\\\n",
    "**notRepairedDamage**: if the car has a damage which is not repaired yet\\\n",
    "**dateCreated**: the date for which the ad at ebay was created\\\n",
    "**nrOfPictures**: number of pictures in the ad\\\n",
    "**postalCodelastSeenOnline**: when the crawler saw this ad last online\n",
    "\n",
    "### **Airports, Routes and Ariports Delays**\n",
    "\n",
    "Please find the datasets in the datasets folder.\n",
    "\n",
    "The datasets used in this section can be found in the datasets folder.\n",
    "Datasets description are as follows.\n",
    "\n",
    "### **Airports**\n",
    "\n",
    "As of January 2017, the OpenFlights Airports Database contains over 10,000 airports, train stations and ferry terminals spanning the globe, as shown in the map above. Each entry contains the following information:\n",
    "\n",
    "**Airport ID**: Unique OpenFlights identifier for this airport\\\n",
    "**Name**: Name of airport. May or may not contain the City name\\\n",
    "**City**: Main city served by airport. May be spelled differently from Name\\\n",
    "**Country**: Country or territory where airport is located. See Countries to cross-reference to ISO 3166-1 codes\\\n",
    "**IATA**: 3-letter IATA code. Null if not assigned/unknown\\\n",
    "**ICAO**: 4-letter ICAO code. Null if not assigned/unknown\\\n",
    "**Latitude**: Decimal degrees, usually to six significant digits. Negative is South, positive is North\\\n",
    "**Longitude**: Decimal degrees, usually to six significant digits. Negative is West, positive is East\\\n",
    "**Altitude**: In feet\\\n",
    "**Timezone**: Hours offset from UTC. Fractional hours are expressed as decimals, eg. India is 5.5\\\n",
    "**DST**: Daylight savings time. One of E (Europe), A (US/Canada), S (South America), O (Australia), Z (New Zealand), N (None) or U (Unknown)\\\n",
    "**Tz database time zone**: Timezone in \"tz\" (Olson) format, eg. \"America/Los_Angeles\"\\\n",
    "**Type**: Type of the airport. Value \"airport\" for air terminals\\\n",
    "**Source**: Source of the data. \"OurAirports\" for data sourced from OurAirports\n",
    "\n",
    "### **Airports Delays**\n",
    "**Airport ID**: Unique OpenFlights identifier for this airport\\\n",
    "**Name**: Name of airport. May or may not contain the City name\\\n",
    "**City**: Main city served by airport. May be spelled differently from Name\\\n",
    "**Country**: Country or territory where airport is located. See Countries to cross-reference to ISO 3166-1 codes\\\n",
    "**IATA**: 3-letter IATA code. Null if not assigned/unknown\\\n",
    "**ICAO**: 4-letter ICAO code. Null if not assigned/unknown\\\n",
    "**Latitude**: Decimal degrees, usually to six significant digits. Negative is South, positive is North\\\n",
    "**Longitude**: Decimal degrees, usually to six significant digits. Negative is West, positive is East\\\n",
    "**Altitude**: In feet\\\n",
    "**Timezone**: Hours offset from UTC. Fractional hours are expressed as decimals, eg. India is 5.5\\\n",
    "**DST**: Daylight savings time. One of E (Europe), A (US/Canada), S (South America), O (Australia), Z (New Zealand), N (None) or U (Unknown)\\\n",
    "**Tz database time zone**: Timezone in \"tz\" (Olson) format, eg. \"America/Los_Angeles\"\\\n",
    "**Type**: Type of the airport. Value \"airport\" for air terminals\\\n",
    "**Source**: Source of the data. \"OurAirports\" for data sourced from OurAirports\\\n",
    "**Flights planned**: The number of fligths the related airport planned\\\n",
    "**Flights cancelled**: The number of flights cancelled\\\n",
    "**Flights delayed**: The number of flights delayed\\\n",
    "**Delay duration**: The delay duration (in minutes)\n",
    "\n",
    "\n",
    "### **Routes**\n",
    "\n",
    "As of June 2014, the OpenFlights/Airline Route Mapper Route Database contains 67663 routes between 3321 airports on 548 airlines spanning the globe, as shown in the map above. Each entry contains the following information:\n",
    "\n",
    "**Airline**: 2-letter (IATA) or 3-letter (ICAO) code of the airline\\\n",
    "**Airline ID**: Unique OpenFlights identifier for airline (see Airline)\\\n",
    "**Source airport**: 3-letter (IATA) or 4-letter (ICAO) code of the source airport\\\n",
    "**Source airport ID**: Unique OpenFlights identifier for source airport (see Airport)\\\n",
    "**Destination airport**: 3-letter (IATA) or 4-letter (ICAO) code of the destination airport\\\n",
    "**Destination airport ID**: Unique OpenFlights identifier for destination airport (see Airport)\\\n",
    "**Codeshare**: \"Y\" if this flight is a codeshare (that is, not operated by Airline, but another carrier), empty otherwise\\\n",
    "**Stops**: Number of stops on this flight (\"0\" for direct)\\\n",
    "**Equipment**: 3-letter codes for plane type(s) generally used on this flight, separated by spaces\\\n",
    "The data is UTF-8 encoded. The special value \\N is used for \"NULL\" to indicate that no value is available, and is understood automatically by MySQL if imported\n",
    "\n",
    "\n",
    "<aside>\n",
    "💡 Notes:\n",
    "\n",
    "- Routes are directional: if an airline operates services from A to B and from B to A, both A-B and B-A are listed separately.\n",
    "- Routes where one carrier operates both its own and codeshare flights are listed only once.\n",
    "</aside>\n",
    "\n",
    "\n",
    "### **Countries**\n",
    "\n",
    "Please find the dataset in the datasets folder.\n",
    "\n",
    "This dataset contains the information related to European countries.\n"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.10.10"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}