From 5a59262a91d855fb71f901de512feafc584d075a Mon Sep 17 00:00:00 2001
From: Claudio Maggioni <maggicl@usi.ch>
Date: Tue, 30 May 2023 17:52:00 +0200
Subject: [PATCH] hw3: done 1-4, wip 5

---
 Assignment3/.gitignore                        |    3 +
 Assignment3/MaggioniClaudio_Assignment3.ipynb | 1107 +++++++++++++++++
 Assignment3/requirements.txt                  |    4 +
 3 files changed, 1114 insertions(+)
 create mode 100644 Assignment3/.gitignore
 create mode 100644 Assignment3/MaggioniClaudio_Assignment3.ipynb
 create mode 100644 Assignment3/requirements.txt

diff --git a/Assignment3/.gitignore b/Assignment3/.gitignore
new file mode 100644
index 0000000..4bd34d4
--- /dev/null
+++ b/Assignment3/.gitignore
@@ -0,0 +1,3 @@
+.env/
+data/
+!data/.gitkeep
diff --git a/Assignment3/MaggioniClaudio_Assignment3.ipynb b/Assignment3/MaggioniClaudio_Assignment3.ipynb
new file mode 100644
index 0000000..11c3ac6
--- /dev/null
+++ b/Assignment3/MaggioniClaudio_Assignment3.ipynb
@@ -0,0 +1,1107 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "23b48f71",
+   "metadata": {},
+   "source": [
+    "# S&DE Atelier - Visual Analytics\n",
+    "\n",
+    "# Assignment 3\n",
+    "\n",
+    "**Due** June 2, 2023 @23:55\n",
+    "\n",
+    "**Contacts**: [marco.dambros@usi.ch](mailto:marco.dambros@usi.ch) - [carmen.armenti@usi.ch](mailto:carmen.armenti@usi.ch)\n",
+    "\n",
+    "---\n",
+    "\n",
+    "The goal of this assignment is to use Spark in Jupyter notebooks (PySpark). The files `trip_data.csv`, `trip_fare.csv` and `nyc_boroughs.geojson` can be found in the following folder: [Assignment3-data](https://usi365-my.sharepoint.com/:f:/g/personal/armenc_usi_ch/Ejp7sb8QAMROoWe0XUDcAkMBoqUFk-w2Vgroup025NhAww?e=TFG5CD). You should clean the data if needed. \n",
+    "\n",
+    "Note that you can use Spark [window functions](https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-window.html) whenever applicable.  \n",
+    "\n",
+    "Please name your file as `SurnameName_Assignment3.ipynb`."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "id": "9f434eb8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# Import the basic spark library\n",
+    "from pyspark.sql import SparkSession\n",
+    "from pyspark.sql.functions import col"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "id": "4a3188f4",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "#import sys\n",
+    "#!{sys.executable} -m pip install geospark"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "id": "b9a87a5c",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "Setting default log level to \"WARN\".\n",
+      "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n",
+      "23/05/30 15:56:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Create an entry point to the PySpark Application\n",
+    "spark = SparkSession.builder \\\n",
+    "      .config(\"spark.driver.bindAddress\", \"127.0.0.1\") \\\n",
+    "      .config(\"spark.driver.memory\", \"16g\") \\\n",
+    "      .config(\"spark.executor.memory\", \"16g\") \\\n",
+    "      .config(\"spark.executor.cores\", \"4\") \\\n",
+    "      .config(\"spark.executor.memory\", \"16g\") \\\n",
+    "      .master(\"local\") \\\n",
+    "      .appName(\"MaggioniClaudio_Assignment3\") \\\n",
+    "      .getOrCreate()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "536a6cc4",
+   "metadata": {},
+   "source": [
+    "### Exercise 1\n",
+    "Join the `trip_data` and `trip_fare` dataframes into one and consider only data on 2013-01-01."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "id": "9fc094c8",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "def sanitize_column_names(df):\n",
+    "    for original, renamed in [(x, x.strip().replace(\" \", \"_\"),) for x in df.columns]:\n",
+    "        df = df.withColumnRenamed(original, renamed)\n",
+    "    return df"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "id": "afe8000d",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "                                                                                \r"
+     ]
+    }
+   ],
+   "source": [
+    "df_trip_data = spark.read \\\n",
+    "    .option(\"header\", True) \\\n",
+    "    .csv(\"data/trip_data.csv\", inferSchema=True)\n",
+    "\n",
+    "df_trip_data = sanitize_column_names(df_trip_data)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "id": "4dfe92f6",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "                                                                                \r"
+     ]
+    }
+   ],
+   "source": [
+    "df_trip_fare = spark.read \\\n",
+    "    .option(\"header\", True) \\\n",
+    "    .csv(\"data/trip_fare.csv\", inferSchema=True)\n",
+    "\n",
+    "df_trip_fare = sanitize_column_names(df_trip_fare)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "id": "d76abc83",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "+--------------------+--------------------+---------+---------+------------------+-------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+\n",
+      "|           medallion|        hack_license|vendor_id|rate_code|store_and_fwd_flag|    pickup_datetime|   dropoff_datetime|passenger_count|trip_time_in_secs|trip_distance|pickup_longitude|pickup_latitude|dropoff_longitude|dropoff_latitude|\n",
+      "+--------------------+--------------------+---------+---------+------------------+-------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+\n",
+      "|89D227B655E5C82AE...|BA96DE419E711691B...|      CMT|        1|                 N|2013-01-01 15:11:48|2013-01-01 15:18:10|              4|              382|          1.0|      -73.978165|      40.757977|       -73.989838|       40.751171|\n",
+      "|0BD7C8F5BA12B88E0...|9FD8F69F0804BDB55...|      CMT|        1|                 N|2013-01-06 00:18:35|2013-01-06 00:22:54|              1|              259|          1.5|      -74.006683|      40.731781|       -73.994499|        40.75066|\n",
+      "|0BD7C8F5BA12B88E0...|9FD8F69F0804BDB55...|      CMT|        1|                 N|2013-01-05 18:49:41|2013-01-05 18:54:23|              1|              282|          1.1|      -74.004707|       40.73777|       -74.009834|       40.726002|\n",
+      "|DFD2202EE08F7A8DC...|51EE87E3205C985EF...|      CMT|        1|                 N|2013-01-07 23:54:15|2013-01-07 23:58:20|              2|              244|          0.7|      -73.974602|      40.759945|       -73.984734|       40.759388|\n",
+      "|DFD2202EE08F7A8DC...|51EE87E3205C985EF...|      CMT|        1|                 N|2013-01-07 23:25:03|2013-01-07 23:34:24|              1|              560|          2.1|       -73.97625|      40.748528|       -74.002586|       40.747868|\n",
+      "|20D9ECB2CA0767CF7...|598CCE5B9C1918568...|      CMT|        1|                 N|2013-01-07 15:27:48|2013-01-07 15:38:37|              1|              648|          1.7|      -73.966743|      40.764252|       -73.983322|       40.743763|\n",
+      "|496644932DF393260...|513189AD756FF14FE...|      CMT|        1|                 N|2013-01-08 11:01:15|2013-01-08 11:08:14|              1|              418|          0.8|      -73.995804|      40.743977|       -74.007416|       40.744343|\n",
+      "|0B57B9633A2FECD3D...|CCD4367B417ED6634...|      CMT|        1|                 N|2013-01-07 12:39:18|2013-01-07 13:10:56|              3|             1898|         10.7|      -73.989937|      40.756775|        -73.86525|        40.77063|\n",
+      "|2C0E91FF20A856C89...|1DA2F6543A62B8ED9...|      CMT|        1|                 N|2013-01-07 18:15:47|2013-01-07 18:20:47|              1|              299|          0.8|      -73.980072|      40.743137|       -73.982712|       40.735336|\n",
+      "|2D4B95E2FA7B2E851...|CD2F522EEE1FF5F5A...|      CMT|        1|                 N|2013-01-07 15:33:28|2013-01-07 15:49:26|              2|              957|          2.5|      -73.977936|      40.786983|       -73.952919|        40.80637|\n",
+      "|E12F6AF991172EAC3...|06918214E951FA000...|      CMT|        1|                 N|2013-01-08 13:11:52|2013-01-08 13:19:50|              1|              477|          1.3|      -73.982452|      40.773167|       -73.964134|       40.773815|\n",
+      "|E12F6AF991172EAC3...|06918214E951FA000...|      CMT|        1|                 N|2013-01-08 09:50:05|2013-01-08 10:02:54|              1|              768|          0.7|       -73.99556|      40.749294|       -73.988686|       40.759052|\n",
+      "|78FFD9CD0CDA541F3...|E949C583ECF62C8F0...|      CMT|        1|                 N|2013-01-10 12:07:08|2013-01-10 12:17:29|              1|              620|          2.3|      -73.971497|      40.791321|       -73.964478|       40.775921|\n",
+      "|237F49C3ECC11F502...|93C363DDF8ED9385D...|      CMT|        1|                 N|2013-01-07 07:35:47|2013-01-07 07:46:00|              1|              612|          2.3|       -73.98851|      40.774307|       -73.981094|       40.755325|\n",
+      "|3349F919AA8AE5DC9...|7CE849FEF67514F08...|      CMT|        1|                 N|2013-01-10 15:42:29|2013-01-10 16:04:02|              1|             1293|          3.2|      -73.994911|      40.723221|       -73.971558|       40.761612|\n",
+      "|3349F919AA8AE5DC9...|7CE849FEF67514F08...|      CMT|        1|                 N|2013-01-10 14:27:28|2013-01-10 14:45:21|              1|             1073|          4.4|      -74.010391|      40.708702|       -73.987846|       40.756104|\n",
+      "|4C005EEBAA7BF26B8...|351BE7D984BE17DB2...|      CMT|        1|                 N|2013-01-07 22:09:59|2013-01-07 22:19:50|              1|              591|          1.7|      -73.973732|      40.756287|       -73.998413|       40.756832|\n",
+      "|7D99C30FCE69B1A9D...|460C3F57DD9CB2265...|      CMT|        1|                 N|2013-01-07 17:18:16|2013-01-07 17:20:55|              1|              158|          0.7|      -73.968925|      40.767704|        -73.96199|       40.776566|\n",
+      "|E6FBF80668FE0611A...|36773E80775F26CD1...|      CMT|        1|                 N|2013-01-07 06:08:51|2013-01-07 06:13:14|              1|              262|          1.7|       -73.96212|      40.769737|       -73.979561|        40.75539|\n",
+      "|0C5296F3C8B16E702...|D2363240A9295EF57...|      CMT|        1|                 N|2013-01-07 22:25:46|2013-01-07 22:36:56|              1|              669|          2.3|      -73.989708|      40.756714|       -73.977615|       40.787575|\n",
+      "+--------------------+--------------------+---------+---------+------------------+-------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+\n",
+      "only showing top 20 rows\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "df_trip_data.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "id": "3c7ccbd4",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "+--------------------+--------------------+---------+-------------------+------------+-----------+---------+-------+----------+------------+------------+\n",
+      "|           medallion|        hack_license|vendor_id|    pickup_datetime|payment_type|fare_amount|surcharge|mta_tax|tip_amount|tolls_amount|total_amount|\n",
+      "+--------------------+--------------------+---------+-------------------+------------+-----------+---------+-------+----------+------------+------------+\n",
+      "|89D227B655E5C82AE...|BA96DE419E711691B...|      CMT|2013-01-01 15:11:48|         CSH|        6.5|      0.0|    0.5|       0.0|         0.0|         7.0|\n",
+      "|0BD7C8F5BA12B88E0...|9FD8F69F0804BDB55...|      CMT|2013-01-06 00:18:35|         CSH|        6.0|      0.5|    0.5|       0.0|         0.0|         7.0|\n",
+      "|0BD7C8F5BA12B88E0...|9FD8F69F0804BDB55...|      CMT|2013-01-05 18:49:41|         CSH|        5.5|      1.0|    0.5|       0.0|         0.0|         7.0|\n",
+      "|DFD2202EE08F7A8DC...|51EE87E3205C985EF...|      CMT|2013-01-07 23:54:15|         CSH|        5.0|      0.5|    0.5|       0.0|         0.0|         6.0|\n",
+      "|DFD2202EE08F7A8DC...|51EE87E3205C985EF...|      CMT|2013-01-07 23:25:03|         CSH|        9.5|      0.5|    0.5|       0.0|         0.0|        10.5|\n",
+      "|20D9ECB2CA0767CF7...|598CCE5B9C1918568...|      CMT|2013-01-07 15:27:48|         CSH|        9.5|      0.0|    0.5|       0.0|         0.0|        10.0|\n",
+      "|496644932DF393260...|513189AD756FF14FE...|      CMT|2013-01-08 11:01:15|         CSH|        6.0|      0.0|    0.5|       0.0|         0.0|         6.5|\n",
+      "|0B57B9633A2FECD3D...|CCD4367B417ED6634...|      CMT|2013-01-07 12:39:18|         CSH|       34.0|      0.0|    0.5|       0.0|         4.8|        39.3|\n",
+      "|2C0E91FF20A856C89...|1DA2F6543A62B8ED9...|      CMT|2013-01-07 18:15:47|         CSH|        5.5|      1.0|    0.5|       0.0|         0.0|         7.0|\n",
+      "|2D4B95E2FA7B2E851...|CD2F522EEE1FF5F5A...|      CMT|2013-01-07 15:33:28|         CSH|       13.0|      0.0|    0.5|       0.0|         0.0|        13.5|\n",
+      "|E12F6AF991172EAC3...|06918214E951FA000...|      CMT|2013-01-08 13:11:52|         CSH|        7.5|      0.0|    0.5|       0.0|         0.0|         8.0|\n",
+      "|E12F6AF991172EAC3...|06918214E951FA000...|      CMT|2013-01-08 09:50:05|         CSH|        9.0|      0.0|    0.5|       0.0|         0.0|         9.5|\n",
+      "|78FFD9CD0CDA541F3...|E949C583ECF62C8F0...|      CMT|2013-01-10 12:07:08|         CSH|        9.5|      0.0|    0.5|       0.0|         0.0|        10.0|\n",
+      "|237F49C3ECC11F502...|93C363DDF8ED9385D...|      CMT|2013-01-07 07:35:47|         CSH|       10.0|      0.0|    0.5|       0.0|         0.0|        10.5|\n",
+      "|3349F919AA8AE5DC9...|7CE849FEF67514F08...|      CMT|2013-01-10 15:42:29|         CSH|       15.5|      0.0|    0.5|       0.0|         0.0|        16.0|\n",
+      "|3349F919AA8AE5DC9...|7CE849FEF67514F08...|      CMT|2013-01-10 14:27:28|         CSH|       16.5|      0.0|    0.5|       0.0|         0.0|        17.0|\n",
+      "|4C005EEBAA7BF26B8...|351BE7D984BE17DB2...|      CMT|2013-01-07 22:09:59|         CSH|        9.0|      0.5|    0.5|       0.0|         0.0|        10.0|\n",
+      "|7D99C30FCE69B1A9D...|460C3F57DD9CB2265...|      CMT|2013-01-07 17:18:16|         CSH|        4.5|      1.0|    0.5|       0.0|         0.0|         6.0|\n",
+      "|E6FBF80668FE0611A...|36773E80775F26CD1...|      CMT|2013-01-07 06:08:51|         CSH|        7.0|      0.0|    0.5|       0.0|         0.0|         7.5|\n",
+      "|0C5296F3C8B16E702...|D2363240A9295EF57...|      CMT|2013-01-07 22:25:46|         CSH|       10.5|      0.5|    0.5|       0.0|         0.0|        11.5|\n",
+      "+--------------------+--------------------+---------+-------------------+------------+-----------+---------+-------+----------+------------+------------+\n",
+      "only showing top 20 rows\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "df_trip_fare.show()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "id": "61e21d2a",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "df_left = df_trip_data.filter(col('pickup_datetime').startswith(\"2013-01-01 \"))\n",
+    "df_right = df_trip_fare.filter(col('pickup_datetime').startswith(\"2013-01-01 \"))\n",
+    "\n",
+    "df_joined = df_left.join(df_right, ['medallion', 'pickup_datetime']).cache()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "id": "d73ab313",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "[Stage 7:====================================================>    (12 + 1) / 13]\r"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "+--------------------+-------------------+--------------------+---------+---------+------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+--------------------+---------+------------+-----------+---------+-------+----------+------------+------------+\n",
+      "|           medallion|    pickup_datetime|        hack_license|vendor_id|rate_code|store_and_fwd_flag|   dropoff_datetime|passenger_count|trip_time_in_secs|trip_distance|pickup_longitude|pickup_latitude|dropoff_longitude|dropoff_latitude|        hack_license|vendor_id|payment_type|fare_amount|surcharge|mta_tax|tip_amount|tolls_amount|total_amount|\n",
+      "+--------------------+-------------------+--------------------+---------+---------+------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+--------------------+---------+------------+-----------+---------+-------+----------+------------+------------+\n",
+      "|000318C2E3E638158...|2013-01-01 20:46:00|91CE3B3A2F548CD8A...|      VTS|        1|              null|2013-01-01 20:56:00|              5|              600|         1.35|      -73.989677|      40.756554|       -73.970673|       40.752541|91CE3B3A2F548CD8A...|      VTS|         CRD|        8.5|      0.5|    0.5|       1.8|         0.0|        11.3|\n",
+      "|00790C7BAD30B7A9E...|2013-01-01 04:26:00|3EF1ED607505C991D...|      VTS|        1|              null|2013-01-01 04:59:00|              1|             1980|        10.99|      -73.996811|      40.716587|       -73.949448|       40.827671|3EF1ED607505C991D...|      VTS|         CRD|       36.5|      0.5|    0.5|      9.25|         0.0|       46.75|\n",
+      "|00A1EA0E8CD47CE24...|2013-01-01 06:09:50|4FD770C068437BBA9...|      CMT|        1|                 N|2013-01-01 06:29:03|              1|             1153|          5.8|       -73.89653|      40.759472|       -73.952698|       40.780788|4FD770C068437BBA9...|      CMT|         CRD|       20.5|      0.0|    0.5|       4.0|         0.0|        25.0|\n",
+      "+--------------------+-------------------+--------------------+---------+---------+------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+--------------------+---------+------------+-----------+---------+-------+----------+------------+------------+\n",
+      "only showing top 3 rows\n",
+      "\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "                                                                                \r"
+     ]
+    }
+   ],
+   "source": [
+    "df_joined.show(3)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "5f246287",
+   "metadata": {},
+   "source": [
+    "### Exercise 2\n",
+    "Consider only Manhattan, Bronx and Brooklyn districts. Then create a dataframe that shows the total number of trips *within* the same district and *across* all the other districts mentioned before.\n",
+    "\n",
+    "For example, for Manhattan borough you should consider the total number of the following trips:\n",
+    "- Manhattan → Manhattan\n",
+    "- Manhattan → Brooklyn\n",
+    "- Manhattan → Bronx\n",
+    "\n",
+    "You should then do the same for Bronx and Brooklyn boroughs."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "id": "97e35f13",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from pyspark.sql import types as T\n",
+    "from pyspark.sql import functions as F\n",
+    "from shapely.geometry import Polygon, Point\n",
+    "from typing import Tuple, List\n",
+    "\n",
+    "df_boroughs = spark.read \\\n",
+    "    .option(\"multiline\", \"true\") \\\n",
+    "    .json(r'data/nyc-boroughs.geojson')\n",
+    "\n",
+    "df_boroughs = df_boroughs.select(F.explode(df_boroughs.features).alias(\"feature\"))\n",
+    "\n",
+    "boroughs_list = df_boroughs.select( \\\n",
+    "    df_boroughs.feature.properties.borough.alias(\"borough\"), \\\n",
+    "    df_boroughs.feature.geometry.coordinates.alias(\"coordinates\")).collect()\n",
+    "\n",
+    "boroughs_list: list[tuple[str, list[Polygon]]] = \\\n",
+    "    [(r.borough, [Polygon(shell=p) for p in r.coordinates]) for r in boroughs_list]\n",
+    "\n",
+    "@F.udf(returnType=T.StringType())\n",
+    "def get_borough(lon: float, lat: float) -> bool:\n",
+    "    global boroughs_list\n",
+    "\n",
+    "    if lon is None or lat is None:\n",
+    "        return None\n",
+    "\n",
+    "    point = Point(lon, lat)\n",
+    "    \n",
+    "    for b in boroughs_list:\n",
+    "        for p in b[1]:\n",
+    "            if p.contains(point):\n",
+    "                return b[0]\n",
+    "    return None"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "id": "b12aa2ec",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# use UDF as join condition\n",
+    "df_with_bor = df_joined \\\n",
+    "    .withColumn(\"pickup_borough\", get_borough(\"pickup_longitude\", \"pickup_latitude\")) \\\n",
+    "    .withColumn(\"dropoff_borough\", get_borough(\"dropoff_longitude\", \"dropoff_latitude\")) \\\n",
+    "    .cache()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "id": "9d386ada-5bd0-4db5-9ac7-e675f371682c",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "                                                                                \r"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Before borough join: 412630\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "[Stage 20:=====================================================>(199 + 1) / 200]\r"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "After borough join:412630\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "                                                                                \r"
+     ]
+    }
+   ],
+   "source": [
+    "print(\"Before borough join: \" + str(df_joined.count())) \n",
+    "print(\"After borough join:\" + str(df_with_bor.count()))"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "id": "9c14ad76-388a-454a-96c0-bf38765ce0dd",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "[Stage 55:=================================================>    (185 + 1) / 200]\r"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "+--------------+---------------+------+\n",
+      "|pickup_borough|dropoff_borough|count |\n",
+      "+--------------+---------------+------+\n",
+      "|Bronx         |Bronx          |487   |\n",
+      "|Bronx         |Brooklyn       |6     |\n",
+      "|Bronx         |Manhattan      |284   |\n",
+      "|Brooklyn      |Bronx          |57    |\n",
+      "|Brooklyn      |Brooklyn       |10454 |\n",
+      "|Brooklyn      |Manhattan      |6408  |\n",
+      "|Manhattan     |Bronx          |2779  |\n",
+      "|Manhattan     |Brooklyn       |14396 |\n",
+      "|Manhattan     |Manhattan      |319706|\n",
+      "+--------------+---------------+------+\n",
+      "\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "                                                                                \r"
+     ]
+    }
+   ],
+   "source": [
+    "def isin(var, values):\n",
+    "    cond = (var == values[0])\n",
+    "    for i in range(0, len(values)):\n",
+    "        cond = cond | (var == values[i])\n",
+    "    return cond\n",
+    "\n",
+    "boroughs = [\"Manhattan\", \"Bronx\", \"Brooklyn\"]\n",
+    "df_ex2 = df_with_bor \\\n",
+    "    .where((isin(df_with_bor.pickup_borough, boroughs)) & (isin(df_with_bor.dropoff_borough, boroughs))) \\\n",
+    "    .groupBy(\"pickup_borough\", \"dropoff_borough\") \\\n",
+    "    .count() \\\n",
+    "    .orderBy(\"pickup_borough\", \"dropoff_borough\")\n",
+    "df_ex2.show(truncate=False)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "21bd4ac8",
+   "metadata": {},
+   "source": [
+    "### Exercise 3\n",
+    "Imagine you are a taxi driver and one day you can work only two hours. Assume the data is representative of a typical working day. Which hours of the day - retrieved from `pickup_datetime` - would you choose to work based on the fare and tip amount?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 47,
+   "id": "46d191e1-fd13-4de3-8851-5e10a7319286",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "[Stage 215:===============================================>     (181 + 1) / 200]\r"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "+-----------+------------------+\n",
+      "|pickup_hour|fare_and_tip_total|\n",
+      "+-----------+------------------+\n",
+      "|1          |453700.23         |\n",
+      "|2          |418415.82         |\n",
+      "|0          |390741.27         |\n",
+      "|3          |367018.78         |\n",
+      "|14         |286852.68         |\n",
+      "|15         |278953.43         |\n",
+      "|4          |272856.05         |\n",
+      "|18         |269648.14         |\n",
+      "|13         |263915.72         |\n",
+      "|17         |258134.56         |\n",
+      "|16         |246552.73         |\n",
+      "|12         |238716.32         |\n",
+      "|19         |234377.86         |\n",
+      "|20         |211402.98         |\n",
+      "|21         |208110.83         |\n",
+      "|22         |204481.56         |\n",
+      "|11         |194952.87         |\n",
+      "|5          |180075.5          |\n",
+      "|23         |158957.41         |\n",
+      "|10         |146400.51         |\n",
+      "+-----------+------------------+\n",
+      "only showing top 20 rows\n",
+      "\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "                                                                                \r"
+     ]
+    }
+   ],
+   "source": [
+    "df_ex3 = df_joined.select( \\\n",
+    "    F.hour(F.from_utc_timestamp(df_joined.pickup_datetime, 'UTC')).alias('pickup_hour'), \\\n",
+    "    F.col(\"fare_amount\"), \\\n",
+    "    F.col(\"tip_amount\")) \\\n",
+    "    .groupby(\"pickup_hour\") \\\n",
+    "    .agg(F.round(F.sum(F.col(\"fare_amount\") + F.col(\"tip_amount\")), 2).alias('fare_and_tip_total')) \\\n",
+    "    .select(\"pickup_hour\", \"fare_and_tip_total\") \\\n",
+    "    .sort(F.desc(\"fare_and_tip_total\"))\n",
+    "\n",
+    "df_ex3.show(truncate=False)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "ffbbaf04-65b5-4fc2-879f-a2a8bcc87519",
+   "metadata": {},
+   "source": [
+    "Given the table above I would choose to work at **1 AM** and **2 AM** as they are the most profitable hours based on total fare and tip amount. This may be the case for the chosen date `2013-01-01` because of the new year celebrations."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b24e0922",
+   "metadata": {},
+   "source": [
+    "### Exercise 4\n",
+    "Provide a graphical representation to compare the average fare amount for trips _within_ and _across_ all the districts. You may want to have a look at: https://docs.bokeh.org/en/latest/docs/user_guide/topics/categorical.html#heatmaps."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 72,
+   "id": "0643d9e4",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "                                                                                \r"
+     ]
+    }
+   ],
+   "source": [
+    "import pandas as pd\n",
+    "\n",
+    "ex4_data = df_with_bor \\\n",
+    "    .withColumn(\"pickup_borough\", F.coalesce(F.col(\"pickup_borough\"), F.lit(\"Unknown\"))) \\\n",
+    "    .withColumn(\"dropoff_borough\", F.coalesce(F.col(\"dropoff_borough\"), F.lit(\"Unknown\"))) \\\n",
+    "    .groupBy(\"pickup_borough\", \"dropoff_borough\") \\\n",
+    "    .agg(F.mean(F.col('fare_amount')).alias('mean_fare_amount')) \\\n",
+    "    .collect()\n",
+    "\n",
+    "df_ex4 = pd.DataFrame()\n",
+    "for i, row in enumerate(ex4_data):\n",
+    "    df_ex4.loc[i, 'pickup_borough'] = row.pickup_borough\n",
+    "    df_ex4.loc[i, 'dropoff_borough'] = row.dropoff_borough\n",
+    "    df_ex4.loc[i, 'mean_fare'] = row.mean_fare_amount"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 79,
+   "id": "2cba45e6-7ad1-4044-b9f0-81943c1cf547",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from math import pi\n",
+    "from bokeh.models import BasicTicker, PrintfTickFormatter\n",
+    "from bokeh.plotting import figure, show\n",
+    "from bokeh.transform import linear_cmap\n",
+    "\n",
+    "pickup = list(sorted(df_ex4['pickup_borough'].unique()))\n",
+    "dropoff = list(reversed(sorted(df_ex4['dropoff_borough'].unique())))\n",
+    "\n",
+    "colors = [\"#75968f\", \"#a5bab7\", \"#c9d9d3\", \"#e2e2e2\", \"#dfccce\", \"#ddb7b1\", \"#cc7878\", \"#933b41\", \"#550b1d\"]\n",
+    "\n",
+    "p = figure(title=f\"Mean NYC Taxi fares on 2013-01-01\",\n",
+    "           x_range=pickup, y_range=dropoff,\n",
+    "           x_axis_location=\"above\", width=900, height=900,\n",
+    "           tools=\"hover,save,pan,box_zoom,reset,wheel_zoom\", toolbar_location='below',\n",
+    "           tooltips=[ \\\n",
+    "               ('Pickup Borough', '@pickup_borough'), \\\n",
+    "               ('Dropoff Borough', '@dropoff_borough'), \\\n",
+    "               ('Average Fare Amount', '$@mean_fare')])\n",
+    "\n",
+    "p.grid.grid_line_color = None\n",
+    "p.axis.axis_line_color = None\n",
+    "p.axis.major_tick_line_color = None\n",
+    "p.axis.major_label_text_font_size = \"14px\"\n",
+    "p.axis.major_label_standoff = 0\n",
+    "p.xaxis.major_label_orientation = pi / 3\n",
+    "\n",
+    "r = p.rect(x=\"pickup_borough\", y=\"dropoff_borough\", width=1, height=1, source=df_ex4,\n",
+    "           fill_color=linear_cmap(\"mean_fare\", colors, low=df_ex4.mean_fare.min(), high=df_ex4.mean_fare.max()),\n",
+    "           line_color=None)\n",
+    "\n",
+    "p.add_layout(r.construct_color_bar(\n",
+    "    major_label_text_font_size=\"14px\",\n",
+    "    ticker=BasicTicker(desired_num_ticks=len(colors)),\n",
+    "    formatter=PrintfTickFormatter(format=\"$%d\"),\n",
+    "    label_standoff=6,\n",
+    "    border_line_color=None,\n",
+    "    padding=5\n",
+    "), 'right')\n",
+    "\n",
+    "show(p)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "9b4a8445",
+   "metadata": {},
+   "source": [
+    "### Exercise 5\n",
+    "Find the average amount of tolls per hour for trips within the following districts: Manhattan, Bronx, Brooklyn, Queens. Show a graphical representation of the data and report if there is any trend or peak during the day. Overall which district has the largest amount of tolls?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 88,
+   "id": "b80cbb2d",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "[Stage 313:====================================================>(197 + 1) / 200]\r"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "+--------------+---------------+----+--------------------+\n",
+      "|pickup_borough|dropoff_borough|hour|   mean_tolls_amount|\n",
+      "+--------------+---------------+----+--------------------+\n",
+      "|         Bronx|          Bronx|   0|                 0.0|\n",
+      "|         Bronx|          Bronx|   1|                 0.0|\n",
+      "|         Bronx|          Bronx|   2|                 0.0|\n",
+      "|         Bronx|          Bronx|   3|                 0.0|\n",
+      "|         Bronx|          Bronx|   4|                 0.0|\n",
+      "|         Bronx|          Bronx|   5| 0.14545454545454545|\n",
+      "|         Bronx|          Bronx|   6|                 0.0|\n",
+      "|         Bronx|          Bronx|   7|                 0.0|\n",
+      "|         Bronx|          Bronx|   8|                 0.0|\n",
+      "|         Bronx|          Bronx|   9|                 0.0|\n",
+      "|         Bronx|          Bronx|  10|  1.1388888888888888|\n",
+      "|         Bronx|          Bronx|  11|  0.6857142857142857|\n",
+      "|         Bronx|          Bronx|  12|                 0.0|\n",
+      "|         Bronx|          Bronx|  13|                 0.0|\n",
+      "|         Bronx|          Bronx|  14|                 0.0|\n",
+      "|         Bronx|          Bronx|  15|                 0.0|\n",
+      "|         Bronx|          Bronx|  16|                 0.0|\n",
+      "|         Bronx|          Bronx|  17|                0.96|\n",
+      "|         Bronx|          Bronx|  18|                0.48|\n",
+      "|         Bronx|          Bronx|  19|                 0.0|\n",
+      "|         Bronx|          Bronx|  20|  0.6857142857142857|\n",
+      "|         Bronx|          Bronx|  21|  0.3692307692307692|\n",
+      "|         Bronx|          Bronx|  22|                 0.0|\n",
+      "|         Bronx|          Bronx|  23|                 0.0|\n",
+      "|         Bronx|       Brooklyn|   1|                 4.8|\n",
+      "|         Bronx|       Brooklyn|   2|                 0.0|\n",
+      "|         Bronx|       Brooklyn|   6|                 0.0|\n",
+      "|         Bronx|       Brooklyn|  12|                 0.0|\n",
+      "|         Bronx|       Brooklyn|  17|                 4.8|\n",
+      "|         Bronx|      Manhattan|   0|                 0.0|\n",
+      "|         Bronx|      Manhattan|   1|                 0.0|\n",
+      "|         Bronx|      Manhattan|   2|                 0.0|\n",
+      "|         Bronx|      Manhattan|   3|                 0.0|\n",
+      "|         Bronx|      Manhattan|   4| 0.18333333333333335|\n",
+      "|         Bronx|      Manhattan|   5|                 0.0|\n",
+      "|         Bronx|      Manhattan|   6|                 0.0|\n",
+      "|         Bronx|      Manhattan|   7|                 0.0|\n",
+      "|         Bronx|      Manhattan|   8|                 0.0|\n",
+      "|         Bronx|      Manhattan|   9|                 0.0|\n",
+      "|         Bronx|      Manhattan|  10|                 0.0|\n",
+      "|         Bronx|      Manhattan|  11|                 0.0|\n",
+      "|         Bronx|      Manhattan|  12|                 0.0|\n",
+      "|         Bronx|      Manhattan|  13|                 0.0|\n",
+      "|         Bronx|      Manhattan|  14|                 0.0|\n",
+      "|         Bronx|      Manhattan|  15|                 0.0|\n",
+      "|         Bronx|      Manhattan|  16|                 0.0|\n",
+      "|         Bronx|      Manhattan|  17|                 0.0|\n",
+      "|         Bronx|      Manhattan|  18|                 0.0|\n",
+      "|         Bronx|      Manhattan|  20|                 0.0|\n",
+      "|         Bronx|      Manhattan|  21|                 0.0|\n",
+      "|         Bronx|      Manhattan|  22|                 0.0|\n",
+      "|         Bronx|      Manhattan|  23|                 0.0|\n",
+      "|         Bronx|         Queens|   0|                 4.8|\n",
+      "|         Bronx|         Queens|   1|                 2.4|\n",
+      "|         Bronx|         Queens|   2|                 4.8|\n",
+      "|         Bronx|         Queens|   3|                 4.8|\n",
+      "|         Bronx|         Queens|   5|  3.5999999999999996|\n",
+      "|         Bronx|         Queens|   6|                 2.4|\n",
+      "|         Bronx|         Queens|   7|                 4.8|\n",
+      "|         Bronx|         Queens|  12|                 4.8|\n",
+      "|         Bronx|         Queens|  15|                 4.8|\n",
+      "|      Brooklyn|          Bronx|   0|                1.92|\n",
+      "|      Brooklyn|          Bronx|   1|   2.742857142857143|\n",
+      "|      Brooklyn|          Bronx|   2|  1.3499999999999999|\n",
+      "|      Brooklyn|          Bronx|   3|  1.3833333333333335|\n",
+      "|      Brooklyn|          Bronx|   4|  1.5999999999999999|\n",
+      "|      Brooklyn|          Bronx|   5|                 2.4|\n",
+      "|      Brooklyn|          Bronx|   6|  1.5999999999999999|\n",
+      "|      Brooklyn|          Bronx|   7|                 1.2|\n",
+      "|      Brooklyn|          Bronx|   8|                 0.0|\n",
+      "|      Brooklyn|          Bronx|  10|                 0.0|\n",
+      "|      Brooklyn|          Bronx|  11|                 4.8|\n",
+      "|      Brooklyn|          Bronx|  12|                 2.2|\n",
+      "|      Brooklyn|          Bronx|  18|                 0.0|\n",
+      "|      Brooklyn|          Bronx|  23|                 0.0|\n",
+      "|      Brooklyn|       Brooklyn|   0|                 0.0|\n",
+      "|      Brooklyn|       Brooklyn|   1|0.005357142857142857|\n",
+      "|      Brooklyn|       Brooklyn|   2|                 0.0|\n",
+      "|      Brooklyn|       Brooklyn|   3|0.019872701555869874|\n",
+      "|      Brooklyn|       Brooklyn|   4|0.009352189781021899|\n",
+      "|      Brooklyn|       Brooklyn|   5|                 0.0|\n",
+      "|      Brooklyn|       Brooklyn|   6|                 0.0|\n",
+      "|      Brooklyn|       Brooklyn|   7|                 0.0|\n",
+      "|      Brooklyn|       Brooklyn|   8|                 0.0|\n",
+      "|      Brooklyn|       Brooklyn|   9| 0.11851851851851851|\n",
+      "|      Brooklyn|       Brooklyn|  10|                 0.0|\n",
+      "|      Brooklyn|       Brooklyn|  11|                0.04|\n",
+      "|      Brooklyn|       Brooklyn|  12|                 0.0|\n",
+      "|      Brooklyn|       Brooklyn|  13|                 0.0|\n",
+      "|      Brooklyn|       Brooklyn|  14| 0.02711864406779661|\n",
+      "|      Brooklyn|       Brooklyn|  15|0.028402366863905324|\n",
+      "|      Brooklyn|       Brooklyn|  16|                 0.0|\n",
+      "|      Brooklyn|       Brooklyn|  17| 0.02711864406779661|\n",
+      "|      Brooklyn|       Brooklyn|  18|0.020600858369098713|\n",
+      "|      Brooklyn|       Brooklyn|  19|0.021052631578947368|\n",
+      "|      Brooklyn|       Brooklyn|  20|                 0.0|\n",
+      "|      Brooklyn|       Brooklyn|  21| 0.04324324324324324|\n",
+      "|      Brooklyn|       Brooklyn|  22| 0.05704697986577181|\n",
+      "|      Brooklyn|       Brooklyn|  23|                 0.0|\n",
+      "|      Brooklyn|      Manhattan|   0| 0.04419889502762431|\n",
+      "|      Brooklyn|      Manhattan|   1|  0.0632860040567951|\n",
+      "|      Brooklyn|      Manhattan|   2| 0.05387755102040815|\n",
+      "|      Brooklyn|      Manhattan|   3| 0.07449748743718591|\n",
+      "|      Brooklyn|      Manhattan|   4|0.038554216867469876|\n",
+      "|      Brooklyn|      Manhattan|   5|0.018532818532818532|\n",
+      "|      Brooklyn|      Manhattan|   6| 0.08372093023255812|\n",
+      "|      Brooklyn|      Manhattan|   7|                 0.0|\n",
+      "|      Brooklyn|      Manhattan|   8|                 0.0|\n",
+      "|      Brooklyn|      Manhattan|   9| 0.05581395348837209|\n",
+      "|      Brooklyn|      Manhattan|  10| 0.04403669724770642|\n",
+      "|      Brooklyn|      Manhattan|  11| 0.07218045112781954|\n",
+      "|      Brooklyn|      Manhattan|  12|                 0.0|\n",
+      "|      Brooklyn|      Manhattan|  13| 0.02981366459627329|\n",
+      "|      Brooklyn|      Manhattan|  14| 0.05962732919254658|\n",
+      "|      Brooklyn|      Manhattan|  15|                 0.0|\n",
+      "|      Brooklyn|      Manhattan|  16| 0.11290322580645161|\n",
+      "|      Brooklyn|      Manhattan|  17| 0.12314102564102562|\n",
+      "|      Brooklyn|      Manhattan|  18|                 0.0|\n",
+      "|      Brooklyn|      Manhattan|  19|                 0.0|\n",
+      "|      Brooklyn|      Manhattan|  20|                0.04|\n",
+      "|      Brooklyn|      Manhattan|  21| 0.08495575221238938|\n",
+      "|      Brooklyn|      Manhattan|  22| 0.04033613445378151|\n",
+      "|      Brooklyn|      Manhattan|  23|                 0.0|\n",
+      "|      Brooklyn|         Queens|   0|                 0.0|\n",
+      "|      Brooklyn|         Queens|   1|0.010526315789473684|\n",
+      "|      Brooklyn|         Queens|   2| 0.02513089005235602|\n",
+      "|      Brooklyn|         Queens|   3|0.026666666666666665|\n",
+      "|      Brooklyn|         Queens|   4|0.012413793103448275|\n",
+      "|      Brooklyn|         Queens|   5| 0.12258064516129033|\n",
+      "|      Brooklyn|         Queens|   6| 0.02857142857142857|\n",
+      "|      Brooklyn|         Queens|   7|                 0.0|\n",
+      "|      Brooklyn|         Queens|   8|                 0.0|\n",
+      "|      Brooklyn|         Queens|   9|                 0.0|\n",
+      "|      Brooklyn|         Queens|  10|                 0.0|\n",
+      "|      Brooklyn|         Queens|  11|                 0.0|\n",
+      "|      Brooklyn|         Queens|  12|  0.1846153846153846|\n",
+      "|      Brooklyn|         Queens|  13|                 0.0|\n",
+      "|      Brooklyn|         Queens|  14|                 0.0|\n",
+      "|      Brooklyn|         Queens|  15| 0.11707317073170731|\n",
+      "|      Brooklyn|         Queens|  16|                 0.0|\n",
+      "|      Brooklyn|         Queens|  17|                 0.0|\n",
+      "|      Brooklyn|         Queens|  18|                 0.0|\n",
+      "|      Brooklyn|         Queens|  19|                 0.0|\n",
+      "|      Brooklyn|         Queens|  20|                 0.0|\n",
+      "|      Brooklyn|         Queens|  21|                 0.0|\n",
+      "|      Brooklyn|         Queens|  22|                 0.0|\n",
+      "|      Brooklyn|         Queens|  23|                 0.0|\n",
+      "|     Manhattan|          Bronx|   0|  0.2533333333333334|\n",
+      "|     Manhattan|          Bronx|   1|  0.2715277777777779|\n",
+      "|     Manhattan|          Bronx|   2|  0.2100628930817611|\n",
+      "|     Manhattan|          Bronx|   3|  0.2696428571428573|\n",
+      "|     Manhattan|          Bronx|   4| 0.15384615384615388|\n",
+      "|     Manhattan|          Bronx|   5| 0.05527638190954774|\n",
+      "|     Manhattan|          Bronx|   6| 0.08096590909090909|\n",
+      "|     Manhattan|          Bronx|   7|  0.1333333333333333|\n",
+      "|     Manhattan|          Bronx|   8| 0.18133333333333335|\n",
+      "|     Manhattan|          Bronx|   9|               0.165|\n",
+      "|     Manhattan|          Bronx|  10|  0.3578947368421052|\n",
+      "|     Manhattan|          Bronx|  11|  0.3674418604651163|\n",
+      "|     Manhattan|          Bronx|  12| 0.43902439024390244|\n",
+      "|     Manhattan|          Bronx|  13| 0.22999999999999998|\n",
+      "|     Manhattan|          Bronx|  14|  0.2619047619047619|\n",
+      "|     Manhattan|          Bronx|  15|  0.2490566037735849|\n",
+      "|     Manhattan|          Bronx|  16|  0.5236170212765957|\n",
+      "|     Manhattan|          Bronx|  17| 0.23749999999999996|\n",
+      "|     Manhattan|          Bronx|  18|  0.2925925925925926|\n",
+      "|     Manhattan|          Bronx|  19|  0.1543859649122807|\n",
+      "|     Manhattan|          Bronx|  20| 0.14666666666666667|\n",
+      "|     Manhattan|          Bronx|  21| 0.20909090909090908|\n",
+      "|     Manhattan|          Bronx|  22|                0.29|\n",
+      "|     Manhattan|          Bronx|  23| 0.13609999999999997|\n",
+      "|     Manhattan|       Brooklyn|   0| 0.20921052631578962|\n",
+      "|     Manhattan|       Brooklyn|   1| 0.24647709320695127|\n",
+      "|     Manhattan|       Brooklyn|   2|  0.2537931034482761|\n",
+      "|     Manhattan|       Brooklyn|   3|   0.168358208955224|\n",
+      "|     Manhattan|       Brooklyn|   4| 0.14059939301972688|\n",
+      "|     Manhattan|       Brooklyn|   5| 0.11757188498402552|\n",
+      "|     Manhattan|       Brooklyn|   6|  0.1429467084639498|\n",
+      "|     Manhattan|       Brooklyn|   7| 0.12403433476394847|\n",
+      "|     Manhattan|       Brooklyn|   8|  0.1471264367816092|\n",
+      "|     Manhattan|       Brooklyn|   9| 0.16633663366336635|\n",
+      "|     Manhattan|       Brooklyn|  10| 0.11267605633802817|\n",
+      "|     Manhattan|       Brooklyn|  11| 0.18585657370517925|\n",
+      "|     Manhattan|       Brooklyn|  12| 0.19136212624584714|\n",
+      "|     Manhattan|       Brooklyn|  13| 0.15789473684210523|\n",
+      "|     Manhattan|       Brooklyn|  14|  0.2719999999999999|\n",
+      "|     Manhattan|       Brooklyn|  15|  0.2133333333333333|\n",
+      "|     Manhattan|       Brooklyn|  16|  0.2842105263157894|\n",
+      "|     Manhattan|       Brooklyn|  17|  0.2565139949109414|\n",
+      "|     Manhattan|       Brooklyn|  18| 0.18093126385809308|\n",
+      "|     Manhattan|       Brooklyn|  19|  0.1438972162740899|\n",
+      "|     Manhattan|       Brooklyn|  20| 0.13136842105263155|\n",
+      "|     Manhattan|       Brooklyn|  21|  0.1684405458089668|\n",
+      "|     Manhattan|       Brooklyn|  22| 0.16958041958041953|\n",
+      "|     Manhattan|       Brooklyn|  23| 0.09829351535836177|\n",
+      "|     Manhattan|      Manhattan|   0|0.002124846378776963|\n",
+      "|     Manhattan|      Manhattan|   1|0.003388822829964328|\n",
+      "|     Manhattan|      Manhattan|   2|0.002282543352601...|\n",
+      "|     Manhattan|      Manhattan|   3|6.617317182593092E-4|\n",
+      "|     Manhattan|      Manhattan|   4| 0.00711096245505477|\n",
+      "|     Manhattan|      Manhattan|   5|0.004739558892538714|\n",
+      "|     Manhattan|      Manhattan|   6|0.008770792827824583|\n",
+      "|     Manhattan|      Manhattan|   7| 0.01721972031287035|\n",
+      "|     Manhattan|      Manhattan|   8|0.007416208104052026|\n",
+      "|     Manhattan|      Manhattan|   9|0.008730447435431065|\n",
+      "|     Manhattan|      Manhattan|  10|0.007606766828344964|\n",
+      "|     Manhattan|      Manhattan|  11|0.003766874141136529|\n",
+      "|     Manhattan|      Manhattan|  12|0.002688551972247...|\n",
+      "|     Manhattan|      Manhattan|  13|0.002815919789692486|\n",
+      "|     Manhattan|      Manhattan|  14|0.003850092535471...|\n",
+      "|     Manhattan|      Manhattan|  15|0.008035703139629235|\n",
+      "|     Manhattan|      Manhattan|  16|  0.0056893032117583|\n",
+      "|     Manhattan|      Manhattan|  17|0.009296927493738926|\n",
+      "|     Manhattan|      Manhattan|  18|0.006115517819238...|\n",
+      "|     Manhattan|      Manhattan|  19|0.006486187125358352|\n",
+      "|     Manhattan|      Manhattan|  20|0.008908519239407095|\n",
+      "|     Manhattan|      Manhattan|  21|0.004213675213675213|\n",
+      "|     Manhattan|      Manhattan|  22|0.005885259631490787|\n",
+      "|     Manhattan|      Manhattan|  23|0.008152764067127342|\n",
+      "|     Manhattan|         Queens|   0|  0.8684324324324318|\n",
+      "|     Manhattan|         Queens|   1|  0.8232996323529406|\n",
+      "|     Manhattan|         Queens|   2|  0.8496747967479669|\n",
+      "|     Manhattan|         Queens|   3|   0.920373626373625|\n",
+      "|     Manhattan|         Queens|   4|  0.9509571209800902|\n",
+      "|     Manhattan|         Queens|   5|  1.2870841487279827|\n",
+      "|     Manhattan|         Queens|   6|  1.7025057208237966|\n",
+      "|     Manhattan|         Queens|   7|  2.1997175866495486|\n",
+      "|     Manhattan|         Queens|   8|  2.7828251121076213|\n",
+      "|     Manhattan|         Queens|   9|  2.6930985915492927|\n",
+      "|     Manhattan|         Queens|  10|   2.625207296849084|\n",
+      "|     Manhattan|         Queens|  11|  2.9828428571428574|\n",
+      "|     Manhattan|         Queens|  12|   3.070651685393257|\n",
+      "|     Manhattan|         Queens|  13|   2.920602536997886|\n",
+      "|     Manhattan|         Queens|  14|   3.059551760939169|\n",
+      "|     Manhattan|         Queens|  15|  3.2354977876106217|\n",
+      "|     Manhattan|         Queens|  16|  2.8950213371265985|\n",
+      "|     Manhattan|         Queens|  17|  2.6199999999999966|\n",
+      "|     Manhattan|         Queens|  18|   2.130339321357284|\n",
+      "|     Manhattan|         Queens|  19|  1.8387186629526464|\n",
+      "|     Manhattan|         Queens|  20|  1.0089171974522302|\n",
+      "|     Manhattan|         Queens|  21|  0.8297852760736203|\n",
+      "|     Manhattan|         Queens|  22|  0.6545454545454548|\n",
+      "|     Manhattan|         Queens|  23|  0.5005434782608698|\n",
+      "|        Queens|          Bronx|   0|   4.547368421052631|\n",
+      "|        Queens|          Bronx|   1|  2.9999999999999996|\n",
+      "|        Queens|          Bronx|   2|   2.742857142857143|\n",
+      "|        Queens|          Bronx|   3|  2.8799999999999994|\n",
+      "|        Queens|          Bronx|   4|  3.1999999999999997|\n",
+      "|        Queens|          Bronx|   5|  3.2842105263157886|\n",
+      "|        Queens|          Bronx|   6|  3.1999999999999997|\n",
+      "|        Queens|          Bronx|   7|   3.519999999999999|\n",
+      "|        Queens|          Bronx|   8|   3.756521739130434|\n",
+      "|        Queens|          Bronx|   9|   4.799999999999999|\n",
+      "|        Queens|          Bronx|  10|    4.26611111111111|\n",
+      "|        Queens|          Bronx|  11|   3.899999999999999|\n",
+      "|        Queens|          Bronx|  12|                 4.8|\n",
+      "|        Queens|          Bronx|  13|   4.499999999999999|\n",
+      "|        Queens|          Bronx|  14|   4.718749999999999|\n",
+      "|        Queens|          Bronx|  15|   4.669999999999999|\n",
+      "|        Queens|          Bronx|  16|   4.114285714285713|\n",
+      "|        Queens|          Bronx|  17|   4.799999999999998|\n",
+      "|        Queens|          Bronx|  18|    4.44090909090909|\n",
+      "|        Queens|          Bronx|  19|   4.235294117647058|\n",
+      "|        Queens|          Bronx|  20|   4.457142857142856|\n",
+      "|        Queens|          Bronx|  21|   4.199999999999999|\n",
+      "|        Queens|          Bronx|  22|   4.477272727272726|\n",
+      "|        Queens|          Bronx|  23|   4.319999999999999|\n",
+      "|        Queens|       Brooklyn|   0|                 0.0|\n",
+      "|        Queens|       Brooklyn|   1|                 0.0|\n",
+      "|        Queens|       Brooklyn|   2|                 0.0|\n",
+      "|        Queens|       Brooklyn|   3|                 0.0|\n",
+      "|        Queens|       Brooklyn|   4|                 0.0|\n",
+      "|        Queens|       Brooklyn|   5|                 0.0|\n",
+      "|        Queens|       Brooklyn|   6|                 0.0|\n",
+      "|        Queens|       Brooklyn|   7|                 0.0|\n",
+      "|        Queens|       Brooklyn|   8|                 0.0|\n",
+      "|        Queens|       Brooklyn|   9|                 0.0|\n",
+      "|        Queens|       Brooklyn|  10|                 0.0|\n",
+      "|        Queens|       Brooklyn|  11|                 0.0|\n",
+      "|        Queens|       Brooklyn|  12|                 0.0|\n",
+      "|        Queens|       Brooklyn|  13|                 0.0|\n",
+      "|        Queens|       Brooklyn|  14|                 0.0|\n",
+      "|        Queens|       Brooklyn|  15|                 0.0|\n",
+      "|        Queens|       Brooklyn|  16|                 0.0|\n",
+      "|        Queens|       Brooklyn|  17|                 0.0|\n",
+      "|        Queens|       Brooklyn|  18| 0.01846153846153846|\n",
+      "|        Queens|       Brooklyn|  19|                 0.0|\n",
+      "|        Queens|       Brooklyn|  20|                 0.0|\n",
+      "|        Queens|       Brooklyn|  21|                 0.0|\n",
+      "|        Queens|       Brooklyn|  22|                 0.0|\n",
+      "|        Queens|       Brooklyn|  23|0.019433198380566803|\n",
+      "|        Queens|      Manhattan|   0|  1.9786259541984754|\n",
+      "|        Queens|      Manhattan|   1|  0.9882352941176481|\n",
+      "|        Queens|      Manhattan|   2|  0.6832740213523135|\n",
+      "|        Queens|      Manhattan|   3|   0.672689075630252|\n",
+      "|        Queens|      Manhattan|   4|  0.8727272727272726|\n",
+      "|        Queens|      Manhattan|   5|   2.020737327188942|\n",
+      "|        Queens|      Manhattan|   6|   1.513492063492065|\n",
+      "|        Queens|      Manhattan|   7|  2.2232824427480935|\n",
+      "|        Queens|      Manhattan|   8|  2.3165217391304362|\n",
+      "|        Queens|      Manhattan|   9|  2.2579770992366432|\n",
+      "|        Queens|      Manhattan|  10|   2.782300884955749|\n",
+      "|        Queens|      Manhattan|  11|   3.039658848614068|\n",
+      "|        Queens|      Manhattan|  12|   3.084337349397588|\n",
+      "|        Queens|      Manhattan|  13|   3.301075268817201|\n",
+      "|        Queens|      Manhattan|  14|   3.456075808249721|\n",
+      "|        Queens|      Manhattan|  15|  3.4173983739837372|\n",
+      "|        Queens|      Manhattan|  16|  3.3323693803159182|\n",
+      "|        Queens|      Manhattan|  17|   3.358361774744028|\n",
+      "|        Queens|      Manhattan|  18|  3.2230088495575226|\n",
+      "|        Queens|      Manhattan|  19|  3.1127427184466017|\n",
+      "|        Queens|      Manhattan|  20|  3.1380410022779053|\n",
+      "|        Queens|      Manhattan|  21|    3.19478935698448|\n",
+      "|        Queens|      Manhattan|  22|  3.0503001200480195|\n",
+      "|        Queens|      Manhattan|  23|   2.954719764011798|\n",
+      "|        Queens|         Queens|   0| 0.03692307692307692|\n",
+      "|        Queens|         Queens|   1|0.010015174506828527|\n",
+      "|        Queens|         Queens|   2|0.012598425196850394|\n",
+      "|        Queens|         Queens|   3|0.005755395683453...|\n",
+      "|        Queens|         Queens|   4| 0.07384937238493722|\n",
+      "|        Queens|         Queens|   5| 0.04725274725274725|\n",
+      "|        Queens|         Queens|   6| 0.08010471204188482|\n",
+      "|        Queens|         Queens|   7|               0.096|\n",
+      "|        Queens|         Queens|   8| 0.07384615384615384|\n",
+      "|        Queens|         Queens|   9| 0.13531746031746034|\n",
+      "|        Queens|         Queens|  10| 0.14015444015444017|\n",
+      "|        Queens|         Queens|  11| 0.16809338521400777|\n",
+      "|        Queens|         Queens|  12| 0.09125475285171103|\n",
+      "|        Queens|         Queens|  13| 0.12818991097922847|\n",
+      "|        Queens|         Queens|  14| 0.15837563451776646|\n",
+      "|        Queens|         Queens|  15| 0.16179775280898873|\n",
+      "|        Queens|         Queens|  16|  0.3113513513513512|\n",
+      "|        Queens|         Queens|  17| 0.21036585365853655|\n",
+      "|        Queens|         Queens|  18| 0.15960451977401127|\n",
+      "|        Queens|         Queens|  19| 0.20064308681672022|\n",
+      "|        Queens|         Queens|  20| 0.12923076923076923|\n",
+      "|        Queens|         Queens|  21| 0.10892307692307691|\n",
+      "|        Queens|         Queens|  22|0.061224489795918366|\n",
+      "|        Queens|         Queens|  23| 0.07164179104477612|\n",
+      "+--------------+---------------+----+--------------------+\n",
+      "\n"
+     ]
+    },
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "                                                                                \r"
+     ]
+    }
+   ],
+   "source": [
+    "boroughs_ex5 = [\"Manhattan\", \"Bronx\", \"Brooklyn\", \"Queens\"]\n",
+    "\n",
+    "df_ex5 = df_with_bor \\\n",
+    "    .where((isin(df_with_bor.pickup_borough, boroughs_ex5)) & (isin(df_with_bor.dropoff_borough, boroughs_ex5))) \\\n",
+    "    .withColumn(\"hour\", F.hour(F.from_utc_timestamp(F.col(\"pickup_datetime\"), 'UTC'))) \\\n",
+    "    .groupBy(\"pickup_borough\", \"dropoff_borough\", \"hour\") \\\n",
+    "    .agg(F.mean(F.col('tolls_amount')).alias('mean_tolls_amount')) \\\n",
+    "    .select(F.col('pickup_borough'), F.col('dropoff_borough'), F.col('hour'), F.col('mean_tolls_amount')) \\\n",
+    "    .orderBy(\"pickup_borough\", \"dropoff_borough\", \"hour\") \n",
+    "\n",
+    "df_ex5.show(25 * 24 * 2)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "884b4cf9",
+   "metadata": {},
+   "source": [
+    "### Exercise 6\n",
+    "Create a dataframe that for each district shows the shortest and longest `trip_distance` starting and ending in the same district. What is the length of the longest and shortest trips in Manhattan?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0aa8d795",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "756da7e4",
+   "metadata": {},
+   "source": [
+    "### Exercise 7\n",
+    "Consider only the trips _within_ districts. What are the first and second-most expensive\n",
+    "trips - based on `total_amount` - in every district?"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "ca83556d",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "id": "4f1e0800",
+   "metadata": {},
+   "source": [
+    "### Exercise 8\n",
+    "Create a dataframe where each row represents a driver, and there is one column per district.\n",
+    "For each driver-district, the dataframe provides the maximum number of consecutive trips\n",
+    "for the given driver, within the given district. \n",
+    "\n",
+    "For example, if for driver A we have (sorted by time):\n",
+    "- Trip 1: Bronx → Bronx\n",
+    "- Trip 2: Bronx → Bronx\n",
+    "- Trip 3: Bronx → Manhattan\n",
+    "- Trip 4: Manhattan → Bronx.\n",
+    "    \n",
+    "The maximum number of consecutive trips for Bronx is 2."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "edde38bb",
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
diff --git a/Assignment3/requirements.txt b/Assignment3/requirements.txt
new file mode 100644
index 0000000..1dd3cea
--- /dev/null
+++ b/Assignment3/requirements.txt
@@ -0,0 +1,4 @@
+jupyterlab==4.0.1
+pyspark==3.4.0
+shapely==2.0.1
+bokeh==3.1.1