2023-05-30 15:52:00 +00:00
{
"cells": [
{
"cell_type": "markdown",
"id": "23b48f71",
"metadata": {},
"source": [
"# S&DE Atelier - Visual Analytics\n",
"\n",
"# Assignment 3\n",
"\n",
"**Due** June 2, 2023 @23:55\n",
"\n",
"**Contacts**: [marco.dambros@usi.ch](mailto:marco.dambros@usi.ch) - [carmen.armenti@usi.ch](mailto:carmen.armenti@usi.ch)\n",
"\n",
"---\n",
"\n",
"The goal of this assignment is to use Spark in Jupyter notebooks (PySpark). The files `trip_data.csv`, `trip_fare.csv` and `nyc_boroughs.geojson` can be found in the following folder: [Assignment3-data](https://usi365-my.sharepoint.com/:f:/g/personal/armenc_usi_ch/Ejp7sb8QAMROoWe0XUDcAkMBoqUFk-w2Vgroup025NhAww?e=TFG5CD). You should clean the data if needed. \n",
"\n",
"Note that you can use Spark [window functions](https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-window.html) whenever applicable. \n",
"\n",
"Please name your file as `SurnameName_Assignment3.ipynb`."
]
},
{
"cell_type": "code",
2023-05-31 16:26:51 +00:00
"execution_count": 28,
2023-05-30 15:52:00 +00:00
"id": "9f434eb8",
"metadata": {},
2023-05-31 16:26:51 +00:00
"outputs": [
{
"data": {
"text/html": [
"<style>\n",
" .bk-notebook-logo {\n",
" display: block;\n",
" width: 20px;\n",
" height: 20px;\n",
" background-image: url(data:image/png;base64,iVBORw0KGgoAAAANSUhEUgAAABQAAAAUCAYAAACNiR0NAAAABHNCSVQICAgIfAhkiAAAAAlwSFlzAAALEgAACxIB0t1+/AAAABx0RVh0U29mdHdhcmUAQWRvYmUgRmlyZXdvcmtzIENTNui8sowAAAOkSURBVDiNjZRtaJVlGMd/1/08zzln5zjP1LWcU9N0NkN8m2CYjpgQYQXqSs0I84OLIC0hkEKoPtiH3gmKoiJDU7QpLgoLjLIQCpEsNJ1vqUOdO7ppbuec5+V+rj4ctwzd8IIbbi6u+8f1539dt3A78eXC7QizUF7gyV1fD1Yqg4JWz84yffhm0qkFqBogB9rM8tZdtwVsPUhWhGcFJngGeWrPzHm5oaMmkfEg1usvLFyc8jLRqDOMru7AyC8saQr7GG7f5fvDeH7Ej8CM66nIF+8yngt6HWaKh7k49Soy9nXurCi1o3qUbS3zWfrYeQDTB/Qj6kX6Ybhw4B+bOYoLKCC9H3Nu/leUTZ1JdRWkkn2ldcCamzrcf47KKXdAJllSlxAOkRgyHsGC/zRday5Qld9DyoM4/q/rUoy/CXh3jzOu3bHUVZeU+DEn8FInkPBFlu3+nW3Nw0mk6vCDiWg8CeJaxEwuHS3+z5RgY+YBR6V1Z1nxSOfoaPa4LASWxxdNp+VWTk7+4vzaou8v8PN+xo+KY2xsw6une2frhw05CTYOmQvsEhjhWjn0bmXPjpE1+kplmmkP3suftwTubK9Vq22qKmrBhpY4jvd5afdRA3wGjFAgcnTK2s4hY0/GPNIb0nErGMCRxWOOX64Z8RAC4oCXdklmEvcL8o0BfkNK4lUg9HTl+oPlQxdNo3Mg4Nv175e/1LDGzZen30MEjRUtmXSfiTVu1kK8W4txyV6BMKlbgk3lMwYCiusNy9fVfvvwMxv8Ynl6vxoByANLTWplvuj/nF9m2+PDtt1eiHPBr1oIfhCChQMBw6Aw0UulqTKZdfVvfG7VcfIqLG9bcldL/+pdWTLxLUy8Qq38heUIjh4XlzZxzQm19lLFlr8vdQ97rjZVOLf8nclzckbcD4wxXMidpX30sFd37Fv/GtwwhzhxGVAprjbg0gCAEeIgwCZyTV2Z1REEW8O4py0wsjeloKoMr6iCY6dP92H6Vw/oTyICIthibxjm/DfN9lVz8IqtqKYLUXfoKVMVQVVJOElGjrnnUt9T9wbgp8AyYKaGlqingHZU/uG2NTZSVqwHQTWkx9hxjkpWDaCg6Ckj5qebgBVbT3V3NNXMSiWSDdGV3hrtzla7J+duwPOToIg42ChPQOQjspnSlp1V+Gjdged7+8UN5CRAV7a5EdFNwCjEaBR27b3W890TE7g24NAP/mMDXRWrGoFPQI9ls/MWO2dWFAar/xcOIImbbpA3zgAAAABJRU5ErkJggg==);\n",
" }\n",
" </style>\n",
" <div>\n",
" <a href=\"https://bokeh.org\" target=\"_blank\" class=\"bk-notebook-logo\"></a>\n",
" <span id=\"db708107-1acc-4d1f-9761-d8a8b49e0317\">Loading BokehJS ...</span>\n",
" </div>\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/javascript": [
"(function(root) {\n",
" function now() {\n",
" return new Date();\n",
" }\n",
"\n",
" const force = true;\n",
"\n",
" if (typeof root._bokeh_onload_callbacks === \"undefined\" || force === true) {\n",
" root._bokeh_onload_callbacks = [];\n",
" root._bokeh_is_loading = undefined;\n",
" }\n",
"\n",
"const JS_MIME_TYPE = 'application/javascript';\n",
" const HTML_MIME_TYPE = 'text/html';\n",
" const EXEC_MIME_TYPE = 'application/vnd.bokehjs_exec.v0+json';\n",
" const CLASS_NAME = 'output_bokeh rendered_html';\n",
"\n",
" /**\n",
" * Render data to the DOM node\n",
" */\n",
" function render(props, node) {\n",
" const script = document.createElement(\"script\");\n",
" node.appendChild(script);\n",
" }\n",
"\n",
" /**\n",
" * Handle when an output is cleared or removed\n",
" */\n",
" function handleClearOutput(event, handle) {\n",
" const cell = handle.cell;\n",
"\n",
" const id = cell.output_area._bokeh_element_id;\n",
" const server_id = cell.output_area._bokeh_server_id;\n",
" // Clean up Bokeh references\n",
" if (id != null && id in Bokeh.index) {\n",
" Bokeh.index[id].model.document.clear();\n",
" delete Bokeh.index[id];\n",
" }\n",
"\n",
" if (server_id !== undefined) {\n",
" // Clean up Bokeh references\n",
" const cmd_clean = \"from bokeh.io.state import curstate; print(curstate().uuid_to_server['\" + server_id + \"'].get_sessions()[0].document.roots[0]._id)\";\n",
" cell.notebook.kernel.execute(cmd_clean, {\n",
" iopub: {\n",
" output: function(msg) {\n",
" const id = msg.content.text.trim();\n",
" if (id in Bokeh.index) {\n",
" Bokeh.index[id].model.document.clear();\n",
" delete Bokeh.index[id];\n",
" }\n",
" }\n",
" }\n",
" });\n",
" // Destroy server and session\n",
" const cmd_destroy = \"import bokeh.io.notebook as ion; ion.destroy_server('\" + server_id + \"')\";\n",
" cell.notebook.kernel.execute(cmd_destroy);\n",
" }\n",
" }\n",
"\n",
" /**\n",
" * Handle when a new output is added\n",
" */\n",
" function handleAddOutput(event, handle) {\n",
" const output_area = handle.output_area;\n",
" const output = handle.output;\n",
"\n",
" // limit handleAddOutput to display_data with EXEC_MIME_TYPE content only\n",
" if ((output.output_type != \"display_data\") || (!Object.prototype.hasOwnProperty.call(output.data, EXEC_MIME_TYPE))) {\n",
" return\n",
" }\n",
"\n",
" const toinsert = output_area.element.find(\".\" + CLASS_NAME.split(' ')[0]);\n",
"\n",
" if (output.metadata[EXEC_MIME_TYPE][\"id\"] !== undefined) {\n",
" toinsert[toinsert.length - 1].firstChild.textContent = output.data[JS_MIME_TYPE];\n",
" // store reference to embed id on output_area\n",
" output_area._bokeh_element_id = output.metadata[EXEC_MIME_TYPE][\"id\"];\n",
" }\n",
" if (output.metadata[EXEC_MIME_TYPE][\"server_id\"] !== undefined) {\n",
" const bk_div = document.createElement(\"div\");\n",
" bk_div.innerHTML = output.data[HTML_MIME_TYPE];\n",
" const script_attrs = bk_div.children[0].attributes;\n",
" for (let i = 0; i < script_attrs.length; i++) {\n",
" toinsert[toinsert.length - 1].firstChild.setAttribute(script_attrs[i].name, script_attrs[i].value);\n",
" toinsert[toinsert.length - 1].firstChild.textContent = bk_div.children[0].textContent\n",
" }\n",
" // store reference to server id on output_area\n",
" output_area._bokeh_server_id = output.metadata[EXEC_MIME_TYPE][\"server_id\"];\n",
" }\n",
" }\n",
"\n",
" function register_renderer(events, OutputArea) {\n",
"\n",
" function append_mime(data, metadata, element) {\n",
" // create a DOM node to render to\n",
" const toinsert = this.create_output_subarea(\n",
" metadata,\n",
" CLASS_NAME,\n",
" EXEC_MIME_TYPE\n",
" );\n",
" this.keyboard_manager.register_events(toinsert);\n",
" // Render to node\n",
" const props = {data: data, metadata: metadata[EXEC_MIME_TYPE]};\n",
" render(props, toinsert[toinsert.length - 1]);\n",
" element.append(toinsert);\n",
" return toinsert\n",
" }\n",
"\n",
" /* Handle when an output is cleared or removed */\n",
" events.on('clear_output.CodeCell', handleClearOutput);\n",
" events.on('delete.Cell', handleClearOutput);\n",
"\n",
" /* Handle when a new output is added */\n",
" events.on('output_added.OutputArea', handleAddOutput);\n",
"\n",
" /**\n",
" * Register the mime type and append_mime function with output_area\n",
" */\n",
" OutputArea.prototype.register_mime_type(EXEC_MIME_TYPE, append_mime, {\n",
" /* Is output safe? */\n",
" safe: true,\n",
" /* Index of renderer in `output_area.display_order` */\n",
" index: 0\n",
" });\n",
" }\n",
"\n",
" // register the mime type if in Jupyter Notebook environment and previously unregistered\n",
" if (root.Jupyter !== undefined) {\n",
" const events = require('base/js/events');\n",
" const OutputArea = require('notebook/js/outputarea').OutputArea;\n",
"\n",
" if (OutputArea.prototype.mime_types().indexOf(EXEC_MIME_TYPE) == -1) {\n",
" register_renderer(events, OutputArea);\n",
" }\n",
" }\n",
" if (typeof (root._bokeh_timeout) === \"undefined\" || force === true) {\n",
" root._bokeh_timeout = Date.now() + 5000;\n",
" root._bokeh_failed_load = false;\n",
" }\n",
"\n",
" const NB_LOAD_WARNING = {'data': {'text/html':\n",
" \"<div style='background-color: #fdd'>\\n\"+\n",
" \"<p>\\n\"+\n",
" \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n",
" \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n",
" \"</p>\\n\"+\n",
" \"<ul>\\n\"+\n",
" \"<li>re-rerun `output_notebook()` to attempt to load from CDN again, or</li>\\n\"+\n",
" \"<li>use INLINE resources instead, as so:</li>\\n\"+\n",
" \"</ul>\\n\"+\n",
" \"<code>\\n\"+\n",
" \"from bokeh.resources import INLINE\\n\"+\n",
" \"output_notebook(resources=INLINE)\\n\"+\n",
" \"</code>\\n\"+\n",
" \"</div>\"}};\n",
"\n",
" function display_loaded() {\n",
" const el = document.getElementById(\"db708107-1acc-4d1f-9761-d8a8b49e0317\");\n",
" if (el != null) {\n",
" el.textContent = \"BokehJS is loading...\";\n",
" }\n",
" if (root.Bokeh !== undefined) {\n",
" if (el != null) {\n",
" el.textContent = \"BokehJS \" + root.Bokeh.version + \" successfully loaded.\";\n",
" }\n",
" } else if (Date.now() < root._bokeh_timeout) {\n",
" setTimeout(display_loaded, 100)\n",
" }\n",
" }\n",
"\n",
" function run_callbacks() {\n",
" try {\n",
" root._bokeh_onload_callbacks.forEach(function(callback) {\n",
" if (callback != null)\n",
" callback();\n",
" });\n",
" } finally {\n",
" delete root._bokeh_onload_callbacks\n",
" }\n",
" console.debug(\"Bokeh: all callbacks have finished\");\n",
" }\n",
"\n",
" function load_libs(css_urls, js_urls, callback) {\n",
" if (css_urls == null) css_urls = [];\n",
" if (js_urls == null) js_urls = [];\n",
"\n",
" root._bokeh_onload_callbacks.push(callback);\n",
" if (root._bokeh_is_loading > 0) {\n",
" console.debug(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n",
" return null;\n",
" }\n",
" if (js_urls == null || js_urls.length === 0) {\n",
" run_callbacks();\n",
" return null;\n",
" }\n",
" console.debug(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n",
" root._bokeh_is_loading = css_urls.length + js_urls.length;\n",
"\n",
" function on_load() {\n",
" root._bokeh_is_loading--;\n",
" if (root._bokeh_is_loading === 0) {\n",
" console.debug(\"Bokeh: all BokehJS libraries/stylesheets loaded\");\n",
" run_callbacks()\n",
" }\n",
" }\n",
"\n",
" function on_error(url) {\n",
" console.error(\"failed to load \" + url);\n",
" }\n",
"\n",
" for (let i = 0; i < css_urls.length; i++) {\n",
" const url = css_urls[i];\n",
" const element = document.createElement(\"link\");\n",
" element.onload = on_load;\n",
" element.onerror = on_error.bind(null, url);\n",
" element.rel = \"stylesheet\";\n",
" element.type = \"text/css\";\n",
" element.href = url;\n",
" console.debug(\"Bokeh: injecting link tag for BokehJS stylesheet: \", url);\n",
" document.body.appendChild(element);\n",
" }\n",
"\n",
" for (let i = 0; i < js_urls.length; i++) {\n",
" const url = js_urls[i];\n",
" const element = document.createElement('script');\n",
" element.onload = on_load;\n",
" element.onerror = on_error.bind(null, url);\n",
" element.async = false;\n",
" element.src = url;\n",
" console.debug(\"Bokeh: injecting script tag for BokehJS library: \", url);\n",
" document.head.appendChild(element);\n",
" }\n",
" };\n",
"\n",
" function inject_raw_css(css) {\n",
" const element = document.createElement(\"style\");\n",
" element.appendChild(document.createTextNode(css));\n",
" document.body.appendChild(element);\n",
" }\n",
"\n",
" const js_urls = [\"https://cdn.bokeh.org/bokeh/release/bokeh-3.1.1.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-gl-3.1.1.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-3.1.1.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-tables-3.1.1.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-mathjax-3.1.1.min.js\"];\n",
" const css_urls = [];\n",
"\n",
" const inline_js = [ function(Bokeh) {\n",
" Bokeh.set_log_level(\"info\");\n",
" },\n",
"function(Bokeh) {\n",
" }\n",
" ];\n",
"\n",
" function run_inline_js() {\n",
" if (root.Bokeh !== undefined || force === true) {\n",
" for (let i = 0; i < inline_js.length; i++) {\n",
" inline_js[i].call(root, root.Bokeh);\n",
" }\n",
"if (force === true) {\n",
" display_loaded();\n",
" }} else if (Date.now() < root._bokeh_timeout) {\n",
" setTimeout(run_inline_js, 100);\n",
" } else if (!root._bokeh_failed_load) {\n",
" console.log(\"Bokeh: BokehJS failed to load within specified timeout.\");\n",
" root._bokeh_failed_load = true;\n",
" } else if (force !== true) {\n",
" const cell = $(document.getElementById(\"db708107-1acc-4d1f-9761-d8a8b49e0317\")).parents('.cell').data().cell;\n",
" cell.output_area.append_execute_result(NB_LOAD_WARNING)\n",
" }\n",
" }\n",
"\n",
" if (root._bokeh_is_loading === 0) {\n",
" console.debug(\"Bokeh: BokehJS loaded, going straight to plotting\");\n",
" run_inline_js();\n",
" } else {\n",
" load_libs(css_urls, js_urls, function() {\n",
" console.debug(\"Bokeh: BokehJS plotting callback run at\", now());\n",
" run_inline_js();\n",
" });\n",
" }\n",
"}(window));"
],
"application/vnd.bokehjs_load.v0+json": "(function(root) {\n function now() {\n return new Date();\n }\n\n const force = true;\n\n if (typeof root._bokeh_onload_callbacks === \"undefined\" || force === true) {\n root._bokeh_onload_callbacks = [];\n root._bokeh_is_loading = undefined;\n }\n\n\n if (typeof (root._bokeh_timeout) === \"undefined\" || force === true) {\n root._bokeh_timeout = Date.now() + 5000;\n root._bokeh_failed_load = false;\n }\n\n const NB_LOAD_WARNING = {'data': {'text/html':\n \"<div style='background-color: #fdd'>\\n\"+\n \"<p>\\n\"+\n \"BokehJS does not appear to have successfully loaded. If loading BokehJS from CDN, this \\n\"+\n \"may be due to a slow or bad network connection. Possible fixes:\\n\"+\n \"</p>\\n\"+\n \"<ul>\\n\"+\n \"<li>re-rerun `output_notebook()` to attempt to load from CDN again, or</li>\\n\"+\n \"<li>use INLINE resources instead, as so:</li>\\n\"+\n \"</ul>\\n\"+\n \"<code>\\n\"+\n \"from bokeh.resources import INLINE\\n\"+\n \"output_notebook(resources=INLINE)\\n\"+\n \"</code>\\n\"+\n \"</div>\"}};\n\n function display_loaded() {\n const el = document.getElementById(\"db708107-1acc-4d1f-9761-d8a8b49e0317\");\n if (el != null) {\n el.textContent = \"BokehJS is loading...\";\n }\n if (root.Bokeh !== undefined) {\n if (el != null) {\n el.textContent = \"BokehJS \" + root.Bokeh.version + \" successfully loaded.\";\n }\n } else if (Date.now() < root._bokeh_timeout) {\n setTimeout(display_loaded, 100)\n }\n }\n\n function run_callbacks() {\n try {\n root._bokeh_onload_callbacks.forEach(function(callback) {\n if (callback != null)\n callback();\n });\n } finally {\n delete root._bokeh_onload_callbacks\n }\n console.debug(\"Bokeh: all callbacks have finished\");\n }\n\n function load_libs(css_urls, js_urls, callback) {\n if (css_urls == null) css_urls = [];\n if (js_urls == null) js_urls = [];\n\n root._bokeh_onload_callbacks.push(callback);\n if (root._bokeh_is_loading > 0) {\n console.debug(\"Bokeh: BokehJS is being loaded, scheduling callback at\", now());\n return null;\n }\n if (js_urls == null || js_urls.length === 0) {\n run_callbacks();\n return null;\n }\n console.debug(\"Bokeh: BokehJS not loaded, scheduling load and callback at\", now());\n root._bokeh_is_loading = css_urls.length + js_urls.length;\n\n function on_load() {\n root._bokeh_is_loading--;\n if (root._bokeh_is_loading === 0) {\n console.debug(\"Bokeh: all BokehJS libraries/stylesheets loaded\");\n run_callbacks()\n }\n }\n\n function on_error(url) {\n console.error(\"failed to load \" + url);\n }\n\n for (let i = 0; i < css_urls.length; i++) {\n const url = css_urls[i];\n const element = document.createElement(\"link\");\n element.onload = on_load;\n element.onerror = on_error.bind(null, url);\n element.rel = \"stylesheet\";\n element.type = \"text/css\";\n element.href = url;\n console.debug(\"Bokeh: injecting link tag for BokehJS stylesheet: \", url);\n document.body.appendChild(element);\n }\n\n for (let i = 0; i < js_urls.length; i++) {\n const url = js_urls[i];\n const element = document.createElement('script');\n element.onload = on_load;\n element.onerror = on_error.bind(null, url);\n element.async = false;\n element.src = url;\n console.debug(\"Bokeh: injecting script tag for BokehJS library: \", url);\n document.head.appendChild(element);\n }\n };\n\n function inject_raw_css(css) {\n const element = document.createElement(\"style\");\n element.appendChild(document.createTextNode(css));\n document.body.appendChild(element);\n }\n\n const js_urls = [\"https://cdn.bokeh.org/bokeh/release/bokeh-3.1.1.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-gl-3.1.1.min.js\", \"https://cdn.bokeh.org/bokeh/release/bokeh-widgets-3.1.1.mi
},
"metadata": {},
"output_type": "display_data"
}
],
2023-05-30 15:52:00 +00:00
"source": [
"# Import the basic spark library\n",
"from pyspark.sql import SparkSession\n",
2023-05-31 15:49:12 +00:00
"from pyspark.sql.functions import col\n",
"from math import pi\n",
"from bokeh.models import BasicTicker, PrintfTickFormatter\n",
"from bokeh.plotting import figure, show\n",
"from bokeh.transform import linear_cmap\n",
"from pyspark.sql import types as T\n",
"from pyspark.sql import functions as F\n",
"from pyspark.sql import Window\n",
"from shapely.geometry import Polygon, Point\n",
"from typing import Tuple, List\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
2023-05-31 16:26:51 +00:00
"import matplotlib as mpl\n",
"from bokeh.io import output_notebook\n",
"import sys\n",
"\n",
"output_notebook()\n",
"\n",
"# required libraries and versions, uncomment to install\n",
"#!{sys.executable} -m pip install jupyterlab==4.0.1 pyspark==3.4.0 shapely==2.0.1 bokeh==3.1.1 seaborn==0.12.2 shrek==0.0.2"
2023-05-30 15:52:00 +00:00
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "b9a87a5c",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Setting default log level to \"WARN\".\n",
"To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n",
2023-05-31 16:26:51 +00:00
"23/05/31 17:51:15 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n"
2023-05-30 15:52:00 +00:00
]
}
],
"source": [
"# Create an entry point to the PySpark Application\n",
"spark = SparkSession.builder \\\n",
" .config(\"spark.driver.bindAddress\", \"127.0.0.1\") \\\n",
" .config(\"spark.driver.memory\", \"16g\") \\\n",
" .config(\"spark.executor.memory\", \"16g\") \\\n",
" .config(\"spark.executor.cores\", \"4\") \\\n",
" .config(\"spark.executor.memory\", \"16g\") \\\n",
" .master(\"local\") \\\n",
" .appName(\"MaggioniClaudio_Assignment3\") \\\n",
" .getOrCreate()"
]
},
{
"cell_type": "markdown",
"id": "536a6cc4",
"metadata": {},
"source": [
"### Exercise 1\n",
"Join the `trip_data` and `trip_fare` dataframes into one and consider only data on 2013-01-01."
]
},
{
"cell_type": "code",
2023-05-31 15:49:12 +00:00
"execution_count": 3,
2023-05-30 15:52:00 +00:00
"id": "9fc094c8",
"metadata": {},
"outputs": [],
"source": [
"def sanitize_column_names(df):\n",
" for original, renamed in [(x, x.strip().replace(\" \", \"_\"),) for x in df.columns]:\n",
" df = df.withColumnRenamed(original, renamed)\n",
" return df"
]
},
{
"cell_type": "code",
2023-05-31 15:49:12 +00:00
"execution_count": 4,
2023-05-30 15:52:00 +00:00
"id": "afe8000d",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
" \r"
]
}
],
"source": [
"df_trip_data = spark.read \\\n",
" .option(\"header\", True) \\\n",
" .csv(\"data/trip_data.csv\", inferSchema=True)\n",
"\n",
"df_trip_data = sanitize_column_names(df_trip_data)"
]
},
{
"cell_type": "code",
2023-05-31 15:49:12 +00:00
"execution_count": 5,
2023-05-30 15:52:00 +00:00
"id": "4dfe92f6",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
" \r"
]
}
],
"source": [
"df_trip_fare = spark.read \\\n",
" .option(\"header\", True) \\\n",
" .csv(\"data/trip_fare.csv\", inferSchema=True)\n",
"\n",
"df_trip_fare = sanitize_column_names(df_trip_fare)"
]
},
{
"cell_type": "code",
2023-05-31 15:49:12 +00:00
"execution_count": 6,
2023-05-30 15:52:00 +00:00
"id": "d76abc83",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------------------+--------------------+---------+---------+------------------+-------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+\n",
"| medallion| hack_license|vendor_id|rate_code|store_and_fwd_flag| pickup_datetime| dropoff_datetime|passenger_count|trip_time_in_secs|trip_distance|pickup_longitude|pickup_latitude|dropoff_longitude|dropoff_latitude|\n",
"+--------------------+--------------------+---------+---------+------------------+-------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+\n",
"|89D227B655E5C82AE...|BA96DE419E711691B...| CMT| 1| N|2013-01-01 15:11:48|2013-01-01 15:18:10| 4| 382| 1.0| -73.978165| 40.757977| -73.989838| 40.751171|\n",
"|0BD7C8F5BA12B88E0...|9FD8F69F0804BDB55...| CMT| 1| N|2013-01-06 00:18:35|2013-01-06 00:22:54| 1| 259| 1.5| -74.006683| 40.731781| -73.994499| 40.75066|\n",
"|0BD7C8F5BA12B88E0...|9FD8F69F0804BDB55...| CMT| 1| N|2013-01-05 18:49:41|2013-01-05 18:54:23| 1| 282| 1.1| -74.004707| 40.73777| -74.009834| 40.726002|\n",
"|DFD2202EE08F7A8DC...|51EE87E3205C985EF...| CMT| 1| N|2013-01-07 23:54:15|2013-01-07 23:58:20| 2| 244| 0.7| -73.974602| 40.759945| -73.984734| 40.759388|\n",
"|DFD2202EE08F7A8DC...|51EE87E3205C985EF...| CMT| 1| N|2013-01-07 23:25:03|2013-01-07 23:34:24| 1| 560| 2.1| -73.97625| 40.748528| -74.002586| 40.747868|\n",
"|20D9ECB2CA0767CF7...|598CCE5B9C1918568...| CMT| 1| N|2013-01-07 15:27:48|2013-01-07 15:38:37| 1| 648| 1.7| -73.966743| 40.764252| -73.983322| 40.743763|\n",
"|496644932DF393260...|513189AD756FF14FE...| CMT| 1| N|2013-01-08 11:01:15|2013-01-08 11:08:14| 1| 418| 0.8| -73.995804| 40.743977| -74.007416| 40.744343|\n",
"|0B57B9633A2FECD3D...|CCD4367B417ED6634...| CMT| 1| N|2013-01-07 12:39:18|2013-01-07 13:10:56| 3| 1898| 10.7| -73.989937| 40.756775| -73.86525| 40.77063|\n",
"|2C0E91FF20A856C89...|1DA2F6543A62B8ED9...| CMT| 1| N|2013-01-07 18:15:47|2013-01-07 18:20:47| 1| 299| 0.8| -73.980072| 40.743137| -73.982712| 40.735336|\n",
"|2D4B95E2FA7B2E851...|CD2F522EEE1FF5F5A...| CMT| 1| N|2013-01-07 15:33:28|2013-01-07 15:49:26| 2| 957| 2.5| -73.977936| 40.786983| -73.952919| 40.80637|\n",
"|E12F6AF991172EAC3...|06918214E951FA000...| CMT| 1| N|2013-01-08 13:11:52|2013-01-08 13:19:50| 1| 477| 1.3| -73.982452| 40.773167| -73.964134| 40.773815|\n",
"|E12F6AF991172EAC3...|06918214E951FA000...| CMT| 1| N|2013-01-08 09:50:05|2013-01-08 10:02:54| 1| 768| 0.7| -73.99556| 40.749294| -73.988686| 40.759052|\n",
"|78FFD9CD0CDA541F3...|E949C583ECF62C8F0...| CMT| 1| N|2013-01-10 12:07:08|2013-01-10 12:17:29| 1| 620| 2.3| -73.971497| 40.791321| -73.964478| 40.775921|\n",
"|237F49C3ECC11F502...|93C363DDF8ED9385D...| CMT| 1| N|2013-01-07 07:35:47|2013-01-07 07:46:00| 1| 612| 2.3| -73.98851| 40.774307| -73.981094| 40.755325|\n",
"|3349F919AA8AE5DC9...|7CE849FEF67514F08...| CMT| 1| N|2013-01-10 15:42:29|2013-01-10 16:04:02| 1| 1293| 3.2| -73.994911| 40.723221| -73.971558| 40.761612|\n",
"|3349F919AA8AE5DC9...|7CE849FEF67514F08...| CMT| 1| N|2013-01-10 14:27:28|2013-01-10 14:45:21| 1| 1073| 4.4| -74.010391| 40.708702| -73.987846| 40.756104|\n",
"|4C005EEBAA7BF26B8...|351BE7D984BE17DB2...| CMT| 1| N|2013-01-07 22:09:59|2013-01-07 22:19:50| 1| 591| 1.7| -73.973732| 40.756287| -73.998413| 40.756832|\n",
"|7D99C30FCE69B1A9D...|460C3F57DD9CB2265...| CMT| 1| N|2013-01-07 17:18:16|2013-01-07 17:20:55| 1| 158| 0.7| -73.968925| 40.767704| -73.96199| 40.776566|\n",
"|E6FBF80668FE0611A...|36773E80775F26CD1...| CMT| 1| N|2013-01-07 06:08:51|2013-01-07 06:13:14| 1| 262| 1.7| -73.96212| 40.769737| -73.979561| 40.75539|\n",
"|0C5296F3C8B16E702...|D2363240A9295EF57...| CMT| 1| N|2013-01-07 22:25:46|2013-01-07 22:36:56| 1| 669| 2.3| -73.989708| 40.756714| -73.977615| 40.787575|\n",
"+--------------------+--------------------+---------+---------+------------------+-------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+\n",
"only showing top 20 rows\n",
"\n"
]
}
],
"source": [
"df_trip_data.show()"
]
},
{
"cell_type": "code",
2023-05-31 15:49:12 +00:00
"execution_count": 7,
2023-05-30 15:52:00 +00:00
"id": "3c7ccbd4",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------------------+--------------------+---------+-------------------+------------+-----------+---------+-------+----------+------------+------------+\n",
"| medallion| hack_license|vendor_id| pickup_datetime|payment_type|fare_amount|surcharge|mta_tax|tip_amount|tolls_amount|total_amount|\n",
"+--------------------+--------------------+---------+-------------------+------------+-----------+---------+-------+----------+------------+------------+\n",
"|89D227B655E5C82AE...|BA96DE419E711691B...| CMT|2013-01-01 15:11:48| CSH| 6.5| 0.0| 0.5| 0.0| 0.0| 7.0|\n",
"|0BD7C8F5BA12B88E0...|9FD8F69F0804BDB55...| CMT|2013-01-06 00:18:35| CSH| 6.0| 0.5| 0.5| 0.0| 0.0| 7.0|\n",
"|0BD7C8F5BA12B88E0...|9FD8F69F0804BDB55...| CMT|2013-01-05 18:49:41| CSH| 5.5| 1.0| 0.5| 0.0| 0.0| 7.0|\n",
"|DFD2202EE08F7A8DC...|51EE87E3205C985EF...| CMT|2013-01-07 23:54:15| CSH| 5.0| 0.5| 0.5| 0.0| 0.0| 6.0|\n",
"|DFD2202EE08F7A8DC...|51EE87E3205C985EF...| CMT|2013-01-07 23:25:03| CSH| 9.5| 0.5| 0.5| 0.0| 0.0| 10.5|\n",
"|20D9ECB2CA0767CF7...|598CCE5B9C1918568...| CMT|2013-01-07 15:27:48| CSH| 9.5| 0.0| 0.5| 0.0| 0.0| 10.0|\n",
"|496644932DF393260...|513189AD756FF14FE...| CMT|2013-01-08 11:01:15| CSH| 6.0| 0.0| 0.5| 0.0| 0.0| 6.5|\n",
"|0B57B9633A2FECD3D...|CCD4367B417ED6634...| CMT|2013-01-07 12:39:18| CSH| 34.0| 0.0| 0.5| 0.0| 4.8| 39.3|\n",
"|2C0E91FF20A856C89...|1DA2F6543A62B8ED9...| CMT|2013-01-07 18:15:47| CSH| 5.5| 1.0| 0.5| 0.0| 0.0| 7.0|\n",
"|2D4B95E2FA7B2E851...|CD2F522EEE1FF5F5A...| CMT|2013-01-07 15:33:28| CSH| 13.0| 0.0| 0.5| 0.0| 0.0| 13.5|\n",
"|E12F6AF991172EAC3...|06918214E951FA000...| CMT|2013-01-08 13:11:52| CSH| 7.5| 0.0| 0.5| 0.0| 0.0| 8.0|\n",
"|E12F6AF991172EAC3...|06918214E951FA000...| CMT|2013-01-08 09:50:05| CSH| 9.0| 0.0| 0.5| 0.0| 0.0| 9.5|\n",
"|78FFD9CD0CDA541F3...|E949C583ECF62C8F0...| CMT|2013-01-10 12:07:08| CSH| 9.5| 0.0| 0.5| 0.0| 0.0| 10.0|\n",
"|237F49C3ECC11F502...|93C363DDF8ED9385D...| CMT|2013-01-07 07:35:47| CSH| 10.0| 0.0| 0.5| 0.0| 0.0| 10.5|\n",
"|3349F919AA8AE5DC9...|7CE849FEF67514F08...| CMT|2013-01-10 15:42:29| CSH| 15.5| 0.0| 0.5| 0.0| 0.0| 16.0|\n",
"|3349F919AA8AE5DC9...|7CE849FEF67514F08...| CMT|2013-01-10 14:27:28| CSH| 16.5| 0.0| 0.5| 0.0| 0.0| 17.0|\n",
"|4C005EEBAA7BF26B8...|351BE7D984BE17DB2...| CMT|2013-01-07 22:09:59| CSH| 9.0| 0.5| 0.5| 0.0| 0.0| 10.0|\n",
"|7D99C30FCE69B1A9D...|460C3F57DD9CB2265...| CMT|2013-01-07 17:18:16| CSH| 4.5| 1.0| 0.5| 0.0| 0.0| 6.0|\n",
"|E6FBF80668FE0611A...|36773E80775F26CD1...| CMT|2013-01-07 06:08:51| CSH| 7.0| 0.0| 0.5| 0.0| 0.0| 7.5|\n",
"|0C5296F3C8B16E702...|D2363240A9295EF57...| CMT|2013-01-07 22:25:46| CSH| 10.5| 0.5| 0.5| 0.0| 0.0| 11.5|\n",
"+--------------------+--------------------+---------+-------------------+------------+-----------+---------+-------+----------+------------+------------+\n",
"only showing top 20 rows\n",
"\n"
]
}
],
"source": [
"df_trip_fare.show()"
]
},
{
"cell_type": "code",
2023-05-31 15:49:12 +00:00
"execution_count": 8,
2023-05-30 15:52:00 +00:00
"id": "61e21d2a",
"metadata": {},
"outputs": [],
"source": [
"df_left = df_trip_data.filter(col('pickup_datetime').startswith(\"2013-01-01 \"))\n",
"df_right = df_trip_fare.filter(col('pickup_datetime').startswith(\"2013-01-01 \"))\n",
"\n",
"df_joined = df_left.join(df_right, ['medallion', 'pickup_datetime']).cache()"
]
},
{
"cell_type": "code",
2023-05-31 15:49:12 +00:00
"execution_count": 9,
2023-05-30 15:52:00 +00:00
"id": "d73ab313",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Stage 7:====================================================> (12 + 1) / 13]\r"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------------------+-------------------+--------------------+---------+---------+------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+--------------------+---------+------------+-----------+---------+-------+----------+------------+------------+\n",
"| medallion| pickup_datetime| hack_license|vendor_id|rate_code|store_and_fwd_flag| dropoff_datetime|passenger_count|trip_time_in_secs|trip_distance|pickup_longitude|pickup_latitude|dropoff_longitude|dropoff_latitude| hack_license|vendor_id|payment_type|fare_amount|surcharge|mta_tax|tip_amount|tolls_amount|total_amount|\n",
"+--------------------+-------------------+--------------------+---------+---------+------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+--------------------+---------+------------+-----------+---------+-------+----------+------------+------------+\n",
"|000318C2E3E638158...|2013-01-01 20:46:00|91CE3B3A2F548CD8A...| VTS| 1| null|2013-01-01 20:56:00| 5| 600| 1.35| -73.989677| 40.756554| -73.970673| 40.752541|91CE3B3A2F548CD8A...| VTS| CRD| 8.5| 0.5| 0.5| 1.8| 0.0| 11.3|\n",
"|00790C7BAD30B7A9E...|2013-01-01 04:26:00|3EF1ED607505C991D...| VTS| 1| null|2013-01-01 04:59:00| 1| 1980| 10.99| -73.996811| 40.716587| -73.949448| 40.827671|3EF1ED607505C991D...| VTS| CRD| 36.5| 0.5| 0.5| 9.25| 0.0| 46.75|\n",
"|00A1EA0E8CD47CE24...|2013-01-01 06:09:50|4FD770C068437BBA9...| CMT| 1| N|2013-01-01 06:29:03| 1| 1153| 5.8| -73.89653| 40.759472| -73.952698| 40.780788|4FD770C068437BBA9...| CMT| CRD| 20.5| 0.0| 0.5| 4.0| 0.0| 25.0|\n",
"+--------------------+-------------------+--------------------+---------+---------+------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+--------------------+---------+------------+-----------+---------+-------+----------+------------+------------+\n",
"only showing top 3 rows\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" \r"
]
}
],
"source": [
"df_joined.show(3)"
]
},
{
"cell_type": "markdown",
"id": "5f246287",
"metadata": {},
"source": [
"### Exercise 2\n",
"Consider only Manhattan, Bronx and Brooklyn districts. Then create a dataframe that shows the total number of trips *within* the same district and *across* all the other districts mentioned before.\n",
"\n",
"For example, for Manhattan borough you should consider the total number of the following trips:\n",
"- Manhattan → Manhattan\n",
"- Manhattan → Brooklyn\n",
"- Manhattan → Bronx\n",
"\n",
"You should then do the same for Bronx and Brooklyn boroughs."
]
},
{
"cell_type": "code",
2023-05-31 15:49:12 +00:00
"execution_count": 10,
2023-05-30 15:52:00 +00:00
"id": "97e35f13",
"metadata": {},
"outputs": [],
"source": [
"df_boroughs = spark.read \\\n",
" .option(\"multiline\", \"true\") \\\n",
" .json(r'data/nyc-boroughs.geojson')\n",
"\n",
"df_boroughs = df_boroughs.select(F.explode(df_boroughs.features).alias(\"feature\"))\n",
"\n",
"boroughs_list = df_boroughs.select( \\\n",
" df_boroughs.feature.properties.borough.alias(\"borough\"), \\\n",
" df_boroughs.feature.geometry.coordinates.alias(\"coordinates\")).collect()\n",
"\n",
"boroughs_list: list[tuple[str, list[Polygon]]] = \\\n",
" [(r.borough, [Polygon(shell=p) for p in r.coordinates]) for r in boroughs_list]\n",
"\n",
"@F.udf(returnType=T.StringType())\n",
"def get_borough(lon: float, lat: float) -> bool:\n",
" global boroughs_list\n",
"\n",
" if lon is None or lat is None:\n",
" return None\n",
"\n",
" point = Point(lon, lat)\n",
" \n",
" for b in boroughs_list:\n",
" for p in b[1]:\n",
" if p.contains(point):\n",
" return b[0]\n",
" return None"
]
},
{
"cell_type": "code",
2023-05-31 15:49:12 +00:00
"execution_count": 11,
2023-05-30 15:52:00 +00:00
"id": "b12aa2ec",
"metadata": {},
"outputs": [],
"source": [
"# use UDF as join condition\n",
"df_with_bor = df_joined \\\n",
" .withColumn(\"pickup_borough\", get_borough(\"pickup_longitude\", \"pickup_latitude\")) \\\n",
" .withColumn(\"dropoff_borough\", get_borough(\"dropoff_longitude\", \"dropoff_latitude\")) \\\n",
" .cache()"
]
},
{
"cell_type": "code",
2023-05-31 15:49:12 +00:00
"execution_count": 12,
2023-05-30 15:52:00 +00:00
"id": "9c14ad76-388a-454a-96c0-bf38765ce0dd",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
2023-05-31 15:49:12 +00:00
"[Stage 13:=====================================================>(199 + 1) / 200]\r"
2023-05-30 15:52:00 +00:00
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------------+---------------+------+\n",
"|pickup_borough|dropoff_borough|count |\n",
"+--------------+---------------+------+\n",
"|Bronx |Bronx |487 |\n",
"|Bronx |Brooklyn |6 |\n",
"|Bronx |Manhattan |284 |\n",
"|Brooklyn |Bronx |57 |\n",
"|Brooklyn |Brooklyn |10454 |\n",
"|Brooklyn |Manhattan |6408 |\n",
"|Manhattan |Bronx |2779 |\n",
"|Manhattan |Brooklyn |14396 |\n",
"|Manhattan |Manhattan |319706|\n",
"+--------------+---------------+------+\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" \r"
]
}
],
"source": [
"def isin(var, values):\n",
" cond = (var == values[0])\n",
" for i in range(0, len(values)):\n",
" cond = cond | (var == values[i])\n",
" return cond\n",
"\n",
"boroughs = [\"Manhattan\", \"Bronx\", \"Brooklyn\"]\n",
"df_ex2 = df_with_bor \\\n",
" .where((isin(df_with_bor.pickup_borough, boroughs)) & (isin(df_with_bor.dropoff_borough, boroughs))) \\\n",
" .groupBy(\"pickup_borough\", \"dropoff_borough\") \\\n",
" .count() \\\n",
" .orderBy(\"pickup_borough\", \"dropoff_borough\")\n",
"df_ex2.show(truncate=False)"
]
},
{
"cell_type": "markdown",
"id": "21bd4ac8",
"metadata": {},
"source": [
"### Exercise 3\n",
"Imagine you are a taxi driver and one day you can work only two hours. Assume the data is representative of a typical working day. Which hours of the day - retrieved from `pickup_datetime` - would you choose to work based on the fare and tip amount?"
]
},
{
"cell_type": "code",
2023-05-31 15:49:12 +00:00
"execution_count": 13,
2023-05-30 15:52:00 +00:00
"id": "46d191e1-fd13-4de3-8851-5e10a7319286",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
2023-05-31 16:26:51 +00:00
"[Stage 20:==================================================> (187 + 1) / 200]\r"
2023-05-30 15:52:00 +00:00
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----------+------------------+\n",
"|pickup_hour|fare_and_tip_total|\n",
"+-----------+------------------+\n",
2023-05-31 15:49:12 +00:00
"| 1| 453700.23|\n",
"| 2| 418415.82|\n",
"| 0| 390741.27|\n",
"| 3| 367018.78|\n",
"| 14| 286852.68|\n",
"| 15| 278953.43|\n",
"| 4| 272856.05|\n",
"| 18| 269648.14|\n",
"| 13| 263915.72|\n",
"| 17| 258134.56|\n",
"| 16| 246552.73|\n",
"| 12| 238716.32|\n",
"| 19| 234377.86|\n",
"| 20| 211402.98|\n",
"| 21| 208110.83|\n",
"| 22| 204481.56|\n",
"| 11| 194952.87|\n",
"| 5| 180075.5|\n",
"| 23| 158957.41|\n",
"| 10| 146400.51|\n",
"| 6| 135810.97|\n",
"| 7| 118466.26|\n",
"| 9| 111925.58|\n",
"| 8| 99021.68|\n",
2023-05-30 15:52:00 +00:00
"+-----------+------------------+\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" \r"
]
}
],
"source": [
"df_ex3 = df_joined.select( \\\n",
" F.hour(F.from_utc_timestamp(df_joined.pickup_datetime, 'UTC')).alias('pickup_hour'), \\\n",
" F.col(\"fare_amount\"), \\\n",
" F.col(\"tip_amount\")) \\\n",
" .groupby(\"pickup_hour\") \\\n",
" .agg(F.round(F.sum(F.col(\"fare_amount\") + F.col(\"tip_amount\")), 2).alias('fare_and_tip_total')) \\\n",
" .select(\"pickup_hour\", \"fare_and_tip_total\") \\\n",
" .sort(F.desc(\"fare_and_tip_total\"))\n",
"\n",
2023-05-31 15:49:12 +00:00
"df_ex3.show(24)"
2023-05-30 15:52:00 +00:00
]
},
{
"cell_type": "markdown",
"id": "ffbbaf04-65b5-4fc2-879f-a2a8bcc87519",
"metadata": {},
"source": [
"Given the table above I would choose to work at **1 AM** and **2 AM** as they are the most profitable hours based on total fare and tip amount. This may be the case for the chosen date `2013-01-01` because of the new year celebrations."
]
},
{
"cell_type": "markdown",
"id": "b24e0922",
"metadata": {},
"source": [
"### Exercise 4\n",
"Provide a graphical representation to compare the average fare amount for trips _within_ and _across_ all the districts. You may want to have a look at: https://docs.bokeh.org/en/latest/docs/user_guide/topics/categorical.html#heatmaps."
]
},
{
"cell_type": "code",
2023-05-31 15:49:12 +00:00
"execution_count": 14,
2023-05-30 15:52:00 +00:00
"id": "0643d9e4",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
" \r"
]
}
],
"source": [
"ex4_data = df_with_bor \\\n",
" .withColumn(\"pickup_borough\", F.coalesce(F.col(\"pickup_borough\"), F.lit(\"Unknown\"))) \\\n",
" .withColumn(\"dropoff_borough\", F.coalesce(F.col(\"dropoff_borough\"), F.lit(\"Unknown\"))) \\\n",
" .groupBy(\"pickup_borough\", \"dropoff_borough\") \\\n",
" .agg(F.mean(F.col('fare_amount')).alias('mean_fare_amount')) \\\n",
" .collect()\n",
"\n",
"df_ex4 = pd.DataFrame()\n",
"for i, row in enumerate(ex4_data):\n",
" df_ex4.loc[i, 'pickup_borough'] = row.pickup_borough\n",
" df_ex4.loc[i, 'dropoff_borough'] = row.dropoff_borough\n",
" df_ex4.loc[i, 'mean_fare'] = row.mean_fare_amount"
]
},
{
"cell_type": "code",
2023-05-31 16:26:51 +00:00
"execution_count": 25,
2023-05-30 15:52:00 +00:00
"id": "2cba45e6-7ad1-4044-b9f0-81943c1cf547",
"metadata": {},
2023-05-31 16:26:51 +00:00
"outputs": [
{
"data": {
"text/html": [
"\n",
" <div id=\"dac2c983-4bfc-4924-b86e-51a0856585b4\" data-root-id=\"p1198\" style=\"display: contents;\"></div>\n"
]
},
"metadata": {},
"output_type": "display_data"
},
{
"data": {
"application/javascript": [
"(function(root) {\n",
" function embed_document(root) {\n",
" const docs_json = {\"053aed4f-f39a-49e7-bb80-2614ad894c1b\":{\"version\":\"3.1.1\",\"title\":\"Bokeh Application\",\"defs\":[],\"roots\":[{\"type\":\"object\",\"name\":\"Figure\",\"id\":\"p1198\",\"attributes\":{\"width\":900,\"height\":900,\"x_range\":{\"type\":\"object\",\"name\":\"FactorRange\",\"id\":\"p1208\",\"attributes\":{\"factors\":[\"Bronx\",\"Brooklyn\",\"Manhattan\",\"Queens\",\"Staten Island\",\"Unknown\"]}},\"y_range\":{\"type\":\"object\",\"name\":\"FactorRange\",\"id\":\"p1210\",\"attributes\":{\"factors\":[\"Unknown\",\"Staten Island\",\"Queens\",\"Manhattan\",\"Brooklyn\",\"Bronx\"]}},\"x_scale\":{\"type\":\"object\",\"name\":\"CategoricalScale\",\"id\":\"p1212\"},\"y_scale\":{\"type\":\"object\",\"name\":\"CategoricalScale\",\"id\":\"p1214\"},\"title\":{\"type\":\"object\",\"name\":\"Title\",\"id\":\"p1201\",\"attributes\":{\"text\":\"Mean NYC Taxi fares on 2013-01-01\"}},\"renderers\":[{\"type\":\"object\",\"name\":\"GlyphRenderer\",\"id\":\"p1253\",\"attributes\":{\"data_source\":{\"type\":\"object\",\"name\":\"ColumnDataSource\",\"id\":\"p1244\",\"attributes\":{\"selected\":{\"type\":\"object\",\"name\":\"Selection\",\"id\":\"p1245\",\"attributes\":{\"indices\":[],\"line_indices\":[]}},\"selection_policy\":{\"type\":\"object\",\"name\":\"UnionRenderers\",\"id\":\"p1246\"},\"data\":{\"type\":\"map\",\"entries\":[[\"index\",{\"type\":\"ndarray\",\"array\":{\"type\":\"bytes\",\"data\":\"AAAAAAEAAAACAAAAAwAAAAQAAAAFAAAABgAAAAcAAAAIAAAACQAAAAoAAAALAAAADAAAAA0AAAAOAAAADwAAABAAAAARAAAAEgAAABMAAAAUAAAAFQAAABYAAAAXAAAAGAAAABkAAAAaAAAAGwAAABwAAAAdAAAAHgAAAB8AAAAgAAAA\"},\"shape\":[33],\"dtype\":\"int32\",\"order\":\"little\"}],[\"pickup_borough\",{\"type\":\"ndarray\",\"array\":[\"Brooklyn\",\"Manhattan\",\"Brooklyn\",\"Queens\",\"Queens\",\"Unknown\",\"Bronx\",\"Bronx\",\"Unknown\",\"Brooklyn\",\"Queens\",\"Manhattan\",\"Manhattan\",\"Manhattan\",\"Queens\",\"Bronx\",\"Bronx\",\"Unknown\",\"Queens\",\"Manhattan\",\"Unknown\",\"Brooklyn\",\"Brooklyn\",\"Manhattan\",\"Bronx\",\"Unknown\",\"Staten Island\",\"Queens\",\"Brooklyn\",\"Staten Island\",\"Staten Island\",\"Staten Island\",\"Unknown\"],\"shape\":[33],\"dtype\":\"object\",\"order\":\"little\"}],[\"dropoff_borough\",{\"type\":\"ndarray\",\"array\":[\"Manhattan\",\"Manhattan\",\"Brooklyn\",\"Queens\",\"Bronx\",\"Bronx\",\"Queens\",\"Brooklyn\",\"Manhattan\",\"Queens\",\"Unknown\",\"Brooklyn\",\"Queens\",\"Bronx\",\"Manhattan\",\"Manhattan\",\"Bronx\",\"Unknown\",\"Brooklyn\",\"Unknown\",\"Queens\",\"Unknown\",\"Bronx\",\"Staten Island\",\"Unknown\",\"Brooklyn\",\"Manhattan\",\"Staten Island\",\"Staten Island\",\"Staten Island\",\"Brooklyn\",\"Unknown\",\"Staten Island\"],\"shape\":[33],\"dtype\":\"object\",\"order\":\"little\"}],[\"mean_fare\",{\"type\":\"ndarray\",\"array\":{\"type\":\"bytes\",\"data\":\"pqw35aYRM0CTbkAAW/giQCumpMHRNiZAERFty5GdL0AOosd34XBEQBQ7sRM7MTpA09LS0tJSPkAAAAAAAOBDQLgUn+7lWC5AqRLxC5AZOUBhNyH0SMhKQEHDSd2XJzZArFqn/qRQPUAN76fzcfE6QMAKNE3KD0FA/HapJ+P3MUAcUtGT1i0mQHsHWZIeATJA8oIPz2bZQUDjtpksKOZIQImIiIiIiDpAAAAAAAAsQUCH8hrKayhHQBdddNFF50pAXXTRRRedPUCidiVqV6I9QAAAAAAAcEJA6k1vetMbUkAAAAAAAEBBQG/kRm7kxjlAAAAAAAAANEAAAAAAAAAEQAAAAAAAABZA\"},\"shape\":[33],\"dtype\":\"float64\",\"order\":\"little\"}]]}}},\"view\":{\"type\":\"object\",\"name\":\"CDSView\",\"id\":\"p1254\",\"attributes\":{\"filter\":{\"type\":\"object\",\"name\":\"AllIndices\",\"id\":\"p1255\"}}},\"glyph\":{\"type\":\"object\",\"name\":\"Rect\",\"id\":\"p1250\",\"attributes\":{\"x\":{\"type\":\"field\",\"field\":\"pickup_borough\"},\"y\":{\"type\":\"field\",\"field\":\"dropoff_borough\"},\"width\":{\"type\":\"value\",\"value\":1},\"height\":{\"type\":\"value\",\"value\":1},\"line_color\":{\"type\":\"value\",\"value\":null},\"fill_color\":{\"type\":\"field\",\"field\":\"mean_fare\",\"transform\":{\"type\":\"object\",\"name\":\"LinearColorMapper\",\"id\":\"p1243\",\"attributes\":{\"palette\":[\"#75968f\",\"#a5bab7\",\"#c9d9d3\",\"#e2e2e2\",\"#dfccce\",\"#ddb7b1\",\"#cc7878\",\"#933b41\",\"#550b1d\"],\"low\":2.5,\"high\":72.43478260869566}}}}},\"nonselection_glyph\":{
" const render_items = [{\"docid\":\"053aed4f-f39a-49e7-bb80-2614ad894c1b\",\"roots\":{\"p1198\":\"dac2c983-4bfc-4924-b86e-51a0856585b4\"},\"root_ids\":[\"p1198\"]}];\n",
" root.Bokeh.embed.embed_items_notebook(docs_json, render_items);\n",
" }\n",
" if (root.Bokeh !== undefined) {\n",
" embed_document(root);\n",
" } else {\n",
" let attempts = 0;\n",
" const timer = setInterval(function(root) {\n",
" if (root.Bokeh !== undefined) {\n",
" clearInterval(timer);\n",
" embed_document(root);\n",
" } else {\n",
" attempts++;\n",
" if (attempts > 100) {\n",
" clearInterval(timer);\n",
" console.log(\"Bokeh: ERROR: Unable to run BokehJS code because BokehJS library is missing\");\n",
" }\n",
" }\n",
" }, 10, root)\n",
" }\n",
"})(window);"
],
"application/vnd.bokehjs_exec.v0+json": ""
},
"metadata": {
"application/vnd.bokehjs_exec.v0+json": {
"id": "p1198"
}
},
"output_type": "display_data"
}
],
2023-05-30 15:52:00 +00:00
"source": [
"pickup = list(sorted(df_ex4['pickup_borough'].unique()))\n",
"dropoff = list(reversed(sorted(df_ex4['dropoff_borough'].unique())))\n",
"\n",
"colors = [\"#75968f\", \"#a5bab7\", \"#c9d9d3\", \"#e2e2e2\", \"#dfccce\", \"#ddb7b1\", \"#cc7878\", \"#933b41\", \"#550b1d\"]\n",
"\n",
"p = figure(title=f\"Mean NYC Taxi fares on 2013-01-01\",\n",
" x_range=pickup, y_range=dropoff,\n",
" x_axis_location=\"above\", width=900, height=900,\n",
" tools=\"hover,save,pan,box_zoom,reset,wheel_zoom\", toolbar_location='below',\n",
" tooltips=[ \\\n",
" ('Pickup Borough', '@pickup_borough'), \\\n",
" ('Dropoff Borough', '@dropoff_borough'), \\\n",
" ('Average Fare Amount', '$@mean_fare')])\n",
"\n",
"p.grid.grid_line_color = None\n",
"p.axis.axis_line_color = None\n",
"p.axis.major_tick_line_color = None\n",
"p.axis.major_label_text_font_size = \"14px\"\n",
"p.axis.major_label_standoff = 0\n",
"p.xaxis.major_label_orientation = pi / 3\n",
"\n",
"r = p.rect(x=\"pickup_borough\", y=\"dropoff_borough\", width=1, height=1, source=df_ex4,\n",
" fill_color=linear_cmap(\"mean_fare\", colors, low=df_ex4.mean_fare.min(), high=df_ex4.mean_fare.max()),\n",
" line_color=None)\n",
"\n",
"p.add_layout(r.construct_color_bar(\n",
" major_label_text_font_size=\"14px\",\n",
" ticker=BasicTicker(desired_num_ticks=len(colors)),\n",
" formatter=PrintfTickFormatter(format=\"$%d\"),\n",
" label_standoff=6,\n",
" border_line_color=None,\n",
" padding=5\n",
"), 'right')\n",
"\n",
"show(p)"
]
},
{
"cell_type": "markdown",
"id": "9b4a8445",
"metadata": {},
"source": [
"### Exercise 5\n",
"Find the average amount of tolls per hour for trips within the following districts: Manhattan, Bronx, Brooklyn, Queens. Show a graphical representation of the data and report if there is any trend or peak during the day. Overall which district has the largest amount of tolls?"
]
},
{
"cell_type": "code",
2023-05-31 15:49:12 +00:00
"execution_count": 16,
2023-05-30 15:52:00 +00:00
"id": "b80cbb2d",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
2023-05-31 15:49:12 +00:00
" \r"
2023-05-30 15:52:00 +00:00
]
},
{
2023-05-31 15:49:12 +00:00
"data": {
"text/plain": [
"<Axes: xlabel='Hour of day of 2013-01-01', ylabel='Mean toll amount'>"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
2023-05-30 15:52:00 +00:00
},
{
2023-05-31 15:49:12 +00:00
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABboAAAKnCAYAAABAjvvfAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/bCgiHAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdeXjU9b3+//szM0kme0I2khASdsIqaqW4EAE9aJVW7aketYJY7bF1wVrb4tdTrbbVY4/aDf35VauotWrbY/u11WoVBJRFQUQFw54EyB6yJ2Sd+f0x+QykbJkwM5+ZzPNxXXNdZNYXEQm585r7bbjdbrcAAAAAAAAAAAhTNqsHAAAAAAAAAADgVBB0AwAAAAAAAADCGkE3AAAAAAAAACCsEXQDAAAAAAAAAMIaQTcAAAAAAAAAIKwRdAMAAAAAAAAAwhpBNwAAAAAAAAAgrBF0AwAAAAAAAADCmsPqAYLN5XKpoqJCiYmJMgzD6nEAAAAAAAAAAMfgdrvV0tKinJwc2Wwn3tmOuKC7oqJCeXl5Vo8BAAAAAAAAABiA/fv3a8SIESe8T8QF3YmJiZI8n5ykpCSLpwEAAAAAAAAAHEtzc7Py8vK8me6JRFzQbdaVJCUlEXQDAAAAAAAAQIgbSAU1h1ECAAAAAAAAAMIaQTcAAAAAAAAAIKwRdAMAAAAAAAAAwlrEdXQDAAAAAAAACG1ut1s9PT3q7e21ehQEWFRUlOx2+yk/D0E3AAAAAAAAgJDR1dWlyspKtbe3Wz0KgsAwDI0YMUIJCQmn9DwE3QAAAAAAAABCgsvlUklJiex2u3JychQdHS3DMKweCwHidrtVW1urAwcOaNy4cae02U3QDQAAAAAAACAkdHV1yeVyKS8vT3FxcVaPgyDIyMhQaWmpuru7Tyno5jBKAAAAAAAAACHFZiO2jBT+2tjnTwwAAAAAAAAAIKwRdAMAAAAAAADAMZx//vm64447rB5j0MJ9fl8QdAMAAAAAAAAAwhpBNwAAAAAAAAAEQW9vr1wul9VjDEkE3QAAAAAAAABwHD09Pbr11luVnJys9PR0/fjHP5bb7ZYkNTQ0aOHChUpNTVVcXJwuvvhi7dq1y/vY5cuXKyUlRa+//romTZqkmJgY7du376SP+8lPfqLTTjut3xy/+tWvVFBQ0G+u22+/XSkpKUpLS9OPfvQjLVq0SJdddlm/x7lcLv3whz/UsGHDNHz4cP3kJz/x96coJBB0AwAAAAAAAMBxPP/883I4HProo4/061//Wo899pieeeYZSdL111+vTZs26fXXX9f69evldrv1la98Rd3d3d7Ht7e36+GHH9Yzzzyjbdu2KTMzc0CPO5mHH35YL730kp577jmtXbtWzc3N+utf/3rM+ePj4/Xhhx/qF7/4hR544AG98847p/x5CTUOqwcAAAAAAAAAgFCVl5enX/7ylzIMQxMmTNDnn3+uX/7ylzr//PP1+uuva+3atTr77LMlSS+99JLy8vL017/+Vd/4xjckSd3d3XriiSc0ffp0SdKuXbsG9LiT+e1vf6u7775bl19+uSRp2bJlevPNN4+637Rp03TfffdJksaNG6dly5ZpxYoVuvDCC0/tExNi2OgGAAAAAAAAgOP48pe/LMMwvB/PmjVLu3bt0hdffCGHw6GZM2d6b0tLS9OECRNUXFzsvS46OlrTpk3zflxcXDygx51IU1OTqqurddZZZ3mvs9vtOuOMM46675GvLUnZ2dmqqakZ0OuEE4JuAAAAAAAAAAiQ2NjYfkH5QNhsNm8PuMmXWpMjRUVF9fvYMIwheSAmQTcAAAAAAAAAHMeHH37Y7+MNGzZo3LhxmjRpknp6evrdfvDgQe3YsUOTJk067vMVFhae9HEZGRmqqqrqF3Zv2bLF++vk5GRlZWVp48aN3ut6e3u1efPmQf8+wx1BNwAAAAAAAAAcx759+3TnnXdqx44devnll/Xb3/5WS5Ys0bhx4/S1r31NN910kz744AN9+umn+uY3v6nc3Fx97WtfO+7zDeRx559/vmpra/WLX/xCe/bs0eOPP65//OMf/Z7ntttu00MPPaT/9//+n3bs2KElS5aooaHB5+3xoYKgGwAAAAAAAACOY+HChTp06JDOOuss3XLLLVqyZIm+/e1vS5Kee+45nXHGGbr00ks1a9Ysud1uvfnmm0fVhfyrkz2usLBQTzzxhB5//HFNnz5dH330ke66665+z/GjH/1IV199tRYuXKhZs2YpISFB8+fPl9PpDMwnIsQZ7n8texnimpublZycrKamJiUlJVk9DgAAAAAAAIA+HR0dKikp0ahRoyI2sB0sl8ulwsJCXXnllfrpT39q9TgDdqL/5r5kuY5ADgkAAAAAAAAA8L+ysjL985//VFFRkTo7O7Vs2TKVlJTommuusXo0S1BdAgAAAAAAAABhxmazafny5frSl76kc845R59//rneffddFRYWWj2aJdjoBgAAAAAAAIAwk5eXp7Vr11o9RshgoxsAAAAAAAAAENYIugEAAAAAAAAAYY2gGwAAAAAAAAAQ1gi6AQAAAAAAAABhjaAbAAAAAAAAABDWCLoBAAAAAAAAAGGNoBsAAAAAAAAAENYIugEAAAAAAADgFF1//fUyDMN7SUtL00UXXaTPPvvM6tEiAkE3AAAAAAAAAPjBRRddpMrKSlVWVmrFihVyOBy69NJLj3v/7u7uIE43tBF0AwAAAAAAAIAfxMTEaPjw4Ro+fLhOO+00LV26VPv371dtba1KS0tlGIZeffVVFRUVyel06qWXXpLL5dIDDzygESNGKCYmRqeddpreeust73Oaj3vttdc0Z84cxcXFafr06Vq/fr33PjfccIOmTZumzs5OSVJXV5dmzJihhQsXBv1zYBVLg+41a9ZowYIFysnJkWEY+utf/zrgx65du1YOh0OnnXZawOYDAAAAAAAAYC232632rh5LLm63e9Bzt7a26ve//73Gjh2rtLQ07/VLly7VkiVLVFxcrPnz5+vXv/61Hn30UT3yyCP67LPPNH/+fH31q1/Vrl27+j3fPffco7vuuktbtmzR+PHjdfXVV6unp0eS9Jvf/EZtbW1aunSp976NjY1atmzZoOcPNw4rX7ytrU3Tp0/XDTfcoCuuuGLAj2tsbNTChQs1b948VVdXB3BCAAAAAAAAAFY61N2rSfe+bclrf/HAfMVFDzxC/fvf/66EhARJnuwzOztbf//732WzHd43vuOOO/ploY888oh+9KMf6T/+4z8kSQ8//LDee+89/epXv9Ljjz/uvd9dd92lSy65RJJ0//33a/Lkydq9e7cmTpyohIQE/f73v1dRUZESExP1q1/9Su+9956SkpJO6fcfTiwNui+++GJdfPHFPj/u5ptv1jXXXCO73e7TFjgAAAAAAAAABMqcOXP0//1//58kqaGhQU888YQuvvhiffTRR977nHnmmd5fNzc3q6KiQuecc06/5znnnHP06aef9rtu2rRp3l9nZ2dLkmpqajRx4kRJ0qxZs3TXXXfppz/9qX70ox/p3HPP9e9vLsRZGnQPxnPPPae9e/fq97//vX72s5+d9P6dnZ3ebhrJ84cHAAAAAAAAQHiIjbLriwfmW/bavoiPj9fYsWO9Hz/zzDNKTk7W008/rRtvvNF7n8GIiory/towDEmSy+XyXudyubR27VrZ7Xbt3r17UK8RzsIq6N61a5eWLl2q999/Xw7HwEZ/6KGHdP/99wd4MgAAAAAAAACBYBiGT/UhocQwDNlsNh06dOiYtyclJSknJ0dr165VUVGR9/q1a9fqrLPO8um1/ud//kfbt2/X6tWrNX/+fD333HNavHjxKc0fTiw9jNIXvb29uuaaa3T//fdr/PjxA37c3XffraamJu9l//79AZwSAAAAAAAAQKTq7OxUVVWVqqqqVFxcrNtuu02tra1asGDBcR/zgx/8QA8//LBeffVV7dixQ0uXLtWWLVu0ZMmSAb/uJ598onvvvVfPPPOMzjnnHD322GNasmSJ9u7d64/fVlgImx+FtLS0aNOmTfrkk0906623SvKs47vdbjkcDv3
"text/plain": [
"<Figure size 1800x800 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
2023-05-30 15:52:00 +00:00
}
],
"source": [
"boroughs_ex5 = [\"Manhattan\", \"Bronx\", \"Brooklyn\", \"Queens\"]\n",
"\n",
2023-05-31 15:49:12 +00:00
"ex5_data = df_with_bor \\\n",
" .where((isin(df_with_bor.pickup_borough, boroughs_ex5)) & (df_with_bor.pickup_borough == df_with_bor.dropoff_borough)) \\\n",
2023-05-30 15:52:00 +00:00
" .withColumn(\"hour\", F.hour(F.from_utc_timestamp(F.col(\"pickup_datetime\"), 'UTC'))) \\\n",
" .groupBy(\"pickup_borough\", \"dropoff_borough\", \"hour\") \\\n",
" .agg(F.mean(F.col('tolls_amount')).alias('mean_tolls_amount')) \\\n",
2023-05-31 15:49:12 +00:00
" .select(F.col('pickup_borough').alias('borough'), F.col('hour'), F.col('mean_tolls_amount')) \\\n",
" .orderBy(\"borough\", \"hour\") \\\n",
" .collect()\n",
2023-05-30 15:52:00 +00:00
"\n",
2023-05-31 15:49:12 +00:00
"df_ex5 = pd.DataFrame()\n",
"for i, row in enumerate(ex5_data):\n",
" df_ex5.loc[i, 'borough'] = row.borough\n",
" df_ex5.loc[i, 'hour'] = row.hour\n",
" df_ex5.loc[i, 'mean_tolls_amount'] = row.mean_tolls_amount\n",
"\n",
"# Initialize the matplotlib figure\n",
"f, ax = plt.subplots(figsize=(18, 8))\n",
"ax.set(ylabel=\"Mean toll amount\", ylim=[0, 1.5], xticks=range(24), \n",
" xlabel=\"Hour of day of 2013-01-01\")\n",
"sns.lineplot(data=df_ex5, x=\"hour\", y=\"mean_tolls_amount\", hue=\"borough\")"
]
},
{
"cell_type": "markdown",
"id": "a575ddfa-5b39-4871-ad02-e81439eb13e6",
"metadata": {},
"source": [
"For trips within _Bronx_, there are several toll amount peaks, namely in decreasing order of magnitude between 10 AM and 11 AM, at 5 PM, at 8 PM and at 5 AM. Trips within _Queens_ show a steady toll amount increase peaking at 4 PM and then decreasing again. "
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "d0153a6e-3da5-49a2-8772-488c7d364ac2",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>mean_tolls_amount</th>\n",
" </tr>\n",
" <tr>\n",
" <th>borough</th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Bronx</th>\n",
" <td>4.465003</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Queens</th>\n",
" <td>2.672513</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Brooklyn</th>\n",
" <td>0.417684</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Manhattan</th>\n",
" <td>0.145958</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" mean_tolls_amount\n",
"borough \n",
"Bronx 4.465003\n",
"Queens 2.672513\n",
"Brooklyn 0.417684\n",
"Manhattan 0.145958"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_ex5.groupby(\"borough\").sum().loc[:, [\"mean_tolls_amount\"]].sort_values(\"mean_tolls_amount\", ascending=False)"
]
},
{
"cell_type": "markdown",
"id": "501ef279-4b61-43d5-aa33-5c20d5354bb7",
"metadata": {},
"source": [
"As shown by the table above, _Bronx_ is the borough with the overall highest toll amounts for within-borough trips on 2013-01-01."
2023-05-30 15:52:00 +00:00
]
},
{
"cell_type": "markdown",
"id": "884b4cf9",
"metadata": {},
"source": [
"### Exercise 6\n",
"Create a dataframe that for each district shows the shortest and longest `trip_distance` starting and ending in the same district. What is the length of the longest and shortest trips in Manhattan?"
]
},
{
"cell_type": "code",
2023-05-31 15:49:12 +00:00
"execution_count": 18,
2023-05-30 15:52:00 +00:00
"id": "0aa8d795",
2023-05-31 15:49:12 +00:00
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
2023-05-31 16:26:51 +00:00
"[Stage 50:=================================================> (183 + 1) / 200]\r"
2023-05-31 15:49:12 +00:00
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-------------+-----------------+-----------------+\n",
"|borough |min_trip_distance|max_trip_distance|\n",
"+-------------+-----------------+-----------------+\n",
"|Bronx |0.0 |20.0 |\n",
"|Brooklyn |0.0 |80.5 |\n",
"|Manhattan |0.0 |100.0 |\n",
"|Queens |0.0 |98.7 |\n",
"|Staten Island|0.0 |5.7 |\n",
"+-------------+-----------------+-----------------+\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" \r"
]
}
],
"source": [
"df_ex6 = df_with_bor \\\n",
" .where((df_with_bor.pickup_borough == df_with_bor.dropoff_borough) & (df_with_bor.pickup_borough.isNotNull())) \\\n",
" .groupBy(\"pickup_borough\") \\\n",
" .agg(F.min('trip_distance').alias('min_trip_distance'), F.max('trip_distance').alias('max_trip_distance')) \\\n",
" .withColumnRenamed(\"pickup_borough\", \"borough\") \\\n",
" .orderBy(\"borough\")\n",
"\n",
"df_ex6.show(truncate=False)"
]
},
{
"cell_type": "markdown",
"id": "7a903390-2ef0-45da-8d76-f992d43a53b1",
2023-05-30 15:52:00 +00:00
"metadata": {},
2023-05-31 15:49:12 +00:00
"source": [
"The shortest trip within _Manhattan_ has distance $= 0$ while the longest one has distance $= 100$."
]
2023-05-30 15:52:00 +00:00
},
{
"cell_type": "markdown",
"id": "756da7e4",
"metadata": {},
"source": [
"### Exercise 7\n",
"Consider only the trips _within_ districts. What are the first and second-most expensive\n",
"trips - based on `total_amount` - in every district?"
]
},
{
"cell_type": "code",
2023-05-31 16:26:51 +00:00
"execution_count": 19,
2023-05-30 15:52:00 +00:00
"id": "ca83556d",
"metadata": {},
2023-05-31 15:49:12 +00:00
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
2023-05-31 16:26:51 +00:00
"23/05/31 18:21:27 WARN package: Truncated the string representation of a plan since it was too large. This behavior can be adjusted by setting 'spark.sql.debug.maxToStringFields'.\n",
"[Stage 61:> (0 + 1) / 1]\r"
2023-05-31 15:49:12 +00:00
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------------------+-------------------+--------------------+---------+---------+------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+--------------------+---------+------------+-----------+---------+-------+----------+------------+------------+-------------+---------------+----+\n",
"| medallion| pickup_datetime| hack_license|vendor_id|rate_code|store_and_fwd_flag| dropoff_datetime|passenger_count|trip_time_in_secs|trip_distance|pickup_longitude|pickup_latitude|dropoff_longitude|dropoff_latitude| hack_license|vendor_id|payment_type|fare_amount|surcharge|mta_tax|tip_amount|tolls_amount|total_amount| borough|dropoff_borough|rank|\n",
"+--------------------+-------------------+--------------------+---------+---------+------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+--------------------+---------+------------+-----------+---------+-------+----------+------------+------------+-------------+---------------+----+\n",
"|157E792C9A2041556...|2013-01-01 04:12:46|5AE2BD64DE046BC5C...| CMT| 5| N|2013-01-01 04:13:30| 3| 44| 0.0| -73.884644| 40.856674| -73.884636| 40.856693|5AE2BD64DE046BC5C...| CMT| CRD| 70.0| 0.0| 0.0| 14.0| 0.0| 84.0| Bronx| Bronx| 1|\n",
"|0984728E985ADC092...|2013-01-01 05:02:38|A3E9537FA108A49E4...| CMT| 5| N|2013-01-01 05:03:53| 2| 75| 0.2| -73.896759| 40.886013| -73.89904| 40.887779|A3E9537FA108A49E4...| CMT| CRD| 80.0| 0.0| 0.0| 0.0| 0.0| 80.0| Bronx| Bronx| 2|\n",
"|2D84EC6CD02550324...|2013-01-01 04:25:00|AC22E37790A7E433E...| VTS| 5| null|2013-01-01 04:25:00| 2| 0| 0.12| -73.939285| 40.723331| -73.939285| 40.723343|AC22E37790A7E433E...| VTS| CRD| 136.0| 0.0| 0.0| 34.0| 10.25| 180.25| Brooklyn| Brooklyn| 1|\n",
"|2A7C1AF76D40C1D22...|2013-01-01 03:56:00|7EAD01D87E93BA1E5...| VTS| 5| null|2013-01-01 03:56:00| 1| 0| 0.0| -73.983307| 40.679096| -73.983307| 40.6791|7EAD01D87E93BA1E5...| VTS| CRD| 100.0| 0.0| 0.0| 20.0| 10.25| 130.25| Brooklyn| Brooklyn| 2|\n",
"|152CBE18BB178155B...|2013-01-01 03:59:34|46B7AEDD5C8ECFF1E...| CMT| 5| N|2013-01-01 04:00:41| 3| 66| 0.0| -73.976433| 40.746506| -73.976433| 40.746506|46B7AEDD5C8ECFF1E...| CMT| DIS| 500.0| 0.0| 0.0| 0.0| 0.0| 500.0| Manhattan| Manhattan| 1|\n",
"|152CBE18BB178155B...|2013-01-01 04:03:32|46B7AEDD5C8ECFF1E...| CMT| 5| N|2013-01-01 04:04:51| 1| 79| 0.0| -73.976433| 40.746506| -73.976433| 40.746506|46B7AEDD5C8ECFF1E...| CMT| CSH| 475.0| 0.0| 0.0| 0.0| 0.0| 475.0| Manhattan| Manhattan| 2|\n",
"|FA189EABBB4058AC0...|2013-01-01 11:10:12|4E557EC0844425C75...| CMT| 5| N|2013-01-01 11:13:34| 2| 202| 2.2| -73.875206| 40.773304| -73.912048| 40.769394|4E557EC0844425C75...| CMT| CRD| 123.0| 0.0| 0.0| 15.0| 0.0| 138.0| Queens| Queens| 1|\n",
"|6B22AE697469CEA3D...|2013-01-01 03:53:00|52169A073CB4E5B1D...| VTS| 1| null|2013-01-01 04:35:00| 2| 2520| 17.24| -73.902206| 40.775982| -73.769829| 40.778721|52169A073CB4E5B1D...| VTS| CRD| 52.5| 0.5| 0.5| 62.5| 0.0| 116.0| Queens| Queens| 2|\n",
"|25C8D6B5EFFDE4FA5...|2013-01-01 02:41:48|70CD78D1142589EF0...| CMT| 5| N|2013-01-01 02:42:22| 1| 33| 0.0| -74.092506| 40.594997| -74.092484| 40.595036|70CD78D1142589EF0...| CMT| CRD| 73.0| 0.0| 0.0| 18.25| 0.0| 91.25|Staten Island| Staten Island| 1|\n",
"|B0B78CD05C8A1737E...|2013-01-01 03:42:19|B104BA3D279D230A0...| CMT| 5| N|2013-01-01 03:43:38| 2| 78| 0.0| -74.149422| 40.612503| -74.149399| 40.61248|B104BA3D279D230A0...| CMT| CRD| 89.6| 0.0| 0.0| 0.0| 0.0| 89.6|Staten Island| Staten Island| 2|\n",
"+--------------------+-------------------+--------------------+---------+---------+------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+--------------------+---------+------------+-----------+---------+-------+----------+------------+------------+-------------+---------------+----+\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" \r"
]
}
],
"source": [
"w = Window.partitionBy(\"borough\").orderBy(F.col(\"total_amount\").desc())\n",
"\n",
"df_ex7 = df_with_bor \\\n",
" .where((df_with_bor.pickup_borough == df_with_bor.dropoff_borough) & (df_with_bor.pickup_borough.isNotNull())) \\\n",
" .withColumnRenamed(\"pickup_borough\", \"borough\") \\\n",
" .withColumn(\"rank\", F.rank().over(w)) \\\n",
" .where(F.col(\"rank\") <= 2) \\\n",
"\n",
"df_ex7.show()"
]
},
{
"cell_type": "markdown",
"id": "6f88a475-1ef1-4b5d-829c-fb33e4b71d76",
"metadata": {},
"source": [
"The dataframe above shows the most expensive (with `rank` $=1$) and second most expensive (with `rank` $=2$) within-district trip data for each district."
]
2023-05-30 15:52:00 +00:00
},
{
"cell_type": "markdown",
"id": "4f1e0800",
"metadata": {},
"source": [
"### Exercise 8\n",
"Create a dataframe where each row represents a driver, and there is one column per district.\n",
"For each driver-district, the dataframe provides the maximum number of consecutive trips\n",
"for the given driver, within the given district. \n",
"\n",
"For example, if for driver A we have (sorted by time):\n",
"- Trip 1: Bronx → Bronx\n",
"- Trip 2: Bronx → Bronx\n",
"- Trip 3: Bronx → Manhattan\n",
"- Trip 4: Manhattan → Bronx.\n",
" \n",
"The maximum number of consecutive trips for Bronx is 2."
]
},
{
"cell_type": "code",
2023-05-31 16:26:51 +00:00
"execution_count": 20,
2023-05-30 15:52:00 +00:00
"id": "edde38bb",
"metadata": {},
2023-05-31 15:49:12 +00:00
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
2023-05-31 16:26:51 +00:00
"[Stage 84:> (0 + 1) / 1]\r"
2023-05-31 15:49:12 +00:00
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
2023-05-31 16:26:51 +00:00
"+--------------------------------+-----+--------+---------+------+-------------+\n",
"|medallion |Bronx|Brooklyn|Manhattan|Queens|Staten Island|\n",
"+--------------------------------+-----+--------+---------+------+-------------+\n",
"|35E11D9D2AE5C8A80261CF6A309BD9FD|0 |1 |15 |5 |0 |\n",
"|DA350783B6954CC672B3830F3A40C0F7|0 |0 |27 |0 |0 |\n",
"|35B2F21FAF5E53F1EB17848E7DC82055|0 |3 |8 |3 |0 |\n",
"|6695FB6E06F7D99F56B579A27759B6F2|0 |1 |21 |0 |0 |\n",
"|36372627462019376C639E270076E599|0 |0 |5 |2 |0 |\n",
"|EF882BDAF03D4151746F1A5A235FC454|0 |0 |17 |1 |0 |\n",
"|846DFE2D59F6E76ECE92959C7827FC12|0 |1 |25 |0 |0 |\n",
"|9B69C5971F62F151BB1C412B35090015|0 |0 |13 |1 |0 |\n",
"|0F621E366CFE63044BFED29EA126CDB9|0 |1 |11 |1 |0 |\n",
"|87EB479F55B88D47C643E19A11B4BEBF|0 |1 |19 |0 |0 |\n",
"|4EE5F2532F57F21244FCC00EEFC37BBC|0 |0 |14 |1 |0 |\n",
"|4F4CA97166A04A4551611769E2C01016|0 |0 |13 |1 |0 |\n",
"|DB1964B903773868E191176E8EF47946|0 |0 |6 |0 |0 |\n",
"|B01A3E26873C4B5145DED29355E1CEFD|0 |3 |20 |1 |0 |\n",
"|F49F752E7E9CAAE41B953FB96E25D059|1 |0 |8 |2 |0 |\n",
"|D72C164FE66ADFFFE94472B10DA5F9E3|0 |1 |20 |1 |0 |\n",
"|E1BD31C1BF8DDCFCB288ACD0A5B8015C|0 |2 |3 |1 |0 |\n",
"|80F732B990A7E37633782074F64AEF8B|0 |0 |13 |2 |0 |\n",
"|F9B3A00E6DDCA4F8BF2560DFF36B9E91|0 |0 |15 |0 |0 |\n",
"|1E8EDF1C2EF489B7AB3712977C7C08B5|0 |0 |2 |1 |0 |\n",
"+--------------------------------+-----+--------+---------+------+-------------+\n",
"only showing top 20 rows\n",
2023-05-31 15:49:12 +00:00
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" \r"
]
}
],
"source": [
"w_ex8 = Window.partitionBy(\"medallion\").orderBy(F.col(\"pickup_datetime\"))\n",
"\n",
"@F.udf(returnType=T.IntegerType())\n",
"def max_consecutive_rank_seq_len(ranks: list[int]) -> int:\n",
" if len(ranks) <= 1:\n",
" return len(ranks)\n",
"\n",
" longest_len = 0\n",
" start = 0\n",
" \n",
" for i, rank in enumerate(ranks):\n",
" if i == 0:\n",
" continue\n",
" if rank - 1 != ranks[i - 1]:\n",
" longest_len = max(i - start, longest_len)\n",
" start = i\n",
" \n",
" longest_len = max(len(ranks) - start, longest_len) \n",
" return longest_len\n",
"\n",
"df_ex8 = df_with_bor \\\n",
" .select(\"medallion\", \"pickup_borough\", \"dropoff_borough\", \"pickup_datetime\") \\\n",
" .withColumn(\"tripNo\", F.rank().over(w_ex8)) \\\n",
" .where((F.col(\"pickup_borough\") == F.col(\"dropoff_borough\")) & (F.col(\"pickup_borough\").isNotNull())) \\\n",
" .select(F.col(\"medallion\"), F.col(\"pickup_borough\").alias(\"borough\"), F.col(\"tripNo\")) \\\n",
" .groupBy(\"medallion\", \"borough\").agg(max_consecutive_rank_seq_len(F.collect_list(\"tripNo\")).alias('maxTrips')) \\\n",
" .groupBy(\"medallion\").pivot(\"borough\").sum(\"maxTrips\") \\\n",
" .fillna(value=0)\n",
"\n",
2023-05-31 16:26:51 +00:00
"df_ex8.show(truncate=False)"
2023-05-31 15:49:12 +00:00
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2261d0e4-cf9d-4190-836b-32981b8ceb64",
"metadata": {},
2023-05-30 15:52:00 +00:00
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}