From 5a59262a91d855fb71f901de512feafc584d075a Mon Sep 17 00:00:00 2001 From: Claudio Maggioni Date: Tue, 30 May 2023 17:52:00 +0200 Subject: [PATCH] hw3: done 1-4, wip 5 --- Assignment3/.gitignore | 3 + Assignment3/MaggioniClaudio_Assignment3.ipynb | 1107 +++++++++++++++++ Assignment3/requirements.txt | 4 + 3 files changed, 1114 insertions(+) create mode 100644 Assignment3/.gitignore create mode 100644 Assignment3/MaggioniClaudio_Assignment3.ipynb create mode 100644 Assignment3/requirements.txt diff --git a/Assignment3/.gitignore b/Assignment3/.gitignore new file mode 100644 index 0000000..4bd34d4 --- /dev/null +++ b/Assignment3/.gitignore @@ -0,0 +1,3 @@ +.env/ +data/ +!data/.gitkeep diff --git a/Assignment3/MaggioniClaudio_Assignment3.ipynb b/Assignment3/MaggioniClaudio_Assignment3.ipynb new file mode 100644 index 0000000..11c3ac6 --- /dev/null +++ b/Assignment3/MaggioniClaudio_Assignment3.ipynb @@ -0,0 +1,1107 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "id": "23b48f71", + "metadata": {}, + "source": [ + "# S&DE Atelier - Visual Analytics\n", + "\n", + "# Assignment 3\n", + "\n", + "**Due** June 2, 2023 @23:55\n", + "\n", + "**Contacts**: [marco.dambros@usi.ch](mailto:marco.dambros@usi.ch) - [carmen.armenti@usi.ch](mailto:carmen.armenti@usi.ch)\n", + "\n", + "---\n", + "\n", + "The goal of this assignment is to use Spark in Jupyter notebooks (PySpark). The files `trip_data.csv`, `trip_fare.csv` and `nyc_boroughs.geojson` can be found in the following folder: [Assignment3-data](https://usi365-my.sharepoint.com/:f:/g/personal/armenc_usi_ch/Ejp7sb8QAMROoWe0XUDcAkMBoqUFk-w2Vgroup025NhAww?e=TFG5CD). You should clean the data if needed. \n", + "\n", + "Note that you can use Spark [window functions](https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-window.html) whenever applicable. \n", + "\n", + "Please name your file as `SurnameName_Assignment3.ipynb`." + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "id": "9f434eb8", + "metadata": {}, + "outputs": [], + "source": [ + "# Import the basic spark library\n", + "from pyspark.sql import SparkSession\n", + "from pyspark.sql.functions import col" + ] + }, + { + "cell_type": "code", + "execution_count": 2, + "id": "4a3188f4", + "metadata": {}, + "outputs": [], + "source": [ + "#import sys\n", + "#!{sys.executable} -m pip install geospark" + ] + }, + { + "cell_type": "code", + "execution_count": 3, + "id": "b9a87a5c", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "Setting default log level to \"WARN\".\n", + "To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n", + "23/05/30 15:56:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n" + ] + } + ], + "source": [ + "# Create an entry point to the PySpark Application\n", + "spark = SparkSession.builder \\\n", + " .config(\"spark.driver.bindAddress\", \"127.0.0.1\") \\\n", + " .config(\"spark.driver.memory\", \"16g\") \\\n", + " .config(\"spark.executor.memory\", \"16g\") \\\n", + " .config(\"spark.executor.cores\", \"4\") \\\n", + " .config(\"spark.executor.memory\", \"16g\") \\\n", + " .master(\"local\") \\\n", + " .appName(\"MaggioniClaudio_Assignment3\") \\\n", + " .getOrCreate()" + ] + }, + { + "cell_type": "markdown", + "id": "536a6cc4", + "metadata": {}, + "source": [ + "### Exercise 1\n", + "Join the `trip_data` and `trip_fare` dataframes into one and consider only data on 2013-01-01." + ] + }, + { + "cell_type": "code", + "execution_count": 4, + "id": "9fc094c8", + "metadata": {}, + "outputs": [], + "source": [ + "def sanitize_column_names(df):\n", + " for original, renamed in [(x, x.strip().replace(\" \", \"_\"),) for x in df.columns]:\n", + " df = df.withColumnRenamed(original, renamed)\n", + " return df" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "id": "afe8000d", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + " \r" + ] + } + ], + "source": [ + "df_trip_data = spark.read \\\n", + " .option(\"header\", True) \\\n", + " .csv(\"data/trip_data.csv\", inferSchema=True)\n", + "\n", + "df_trip_data = sanitize_column_names(df_trip_data)" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "id": "4dfe92f6", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + " \r" + ] + } + ], + "source": [ + "df_trip_fare = spark.read \\\n", + " .option(\"header\", True) \\\n", + " .csv(\"data/trip_fare.csv\", inferSchema=True)\n", + "\n", + "df_trip_fare = sanitize_column_names(df_trip_fare)" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "id": "d76abc83", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "+--------------------+--------------------+---------+---------+------------------+-------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+\n", + "| medallion| hack_license|vendor_id|rate_code|store_and_fwd_flag| pickup_datetime| dropoff_datetime|passenger_count|trip_time_in_secs|trip_distance|pickup_longitude|pickup_latitude|dropoff_longitude|dropoff_latitude|\n", + "+--------------------+--------------------+---------+---------+------------------+-------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+\n", + "|89D227B655E5C82AE...|BA96DE419E711691B...| CMT| 1| N|2013-01-01 15:11:48|2013-01-01 15:18:10| 4| 382| 1.0| -73.978165| 40.757977| -73.989838| 40.751171|\n", + "|0BD7C8F5BA12B88E0...|9FD8F69F0804BDB55...| CMT| 1| N|2013-01-06 00:18:35|2013-01-06 00:22:54| 1| 259| 1.5| -74.006683| 40.731781| -73.994499| 40.75066|\n", + "|0BD7C8F5BA12B88E0...|9FD8F69F0804BDB55...| CMT| 1| N|2013-01-05 18:49:41|2013-01-05 18:54:23| 1| 282| 1.1| -74.004707| 40.73777| -74.009834| 40.726002|\n", + "|DFD2202EE08F7A8DC...|51EE87E3205C985EF...| CMT| 1| N|2013-01-07 23:54:15|2013-01-07 23:58:20| 2| 244| 0.7| -73.974602| 40.759945| -73.984734| 40.759388|\n", + "|DFD2202EE08F7A8DC...|51EE87E3205C985EF...| CMT| 1| N|2013-01-07 23:25:03|2013-01-07 23:34:24| 1| 560| 2.1| -73.97625| 40.748528| -74.002586| 40.747868|\n", + "|20D9ECB2CA0767CF7...|598CCE5B9C1918568...| CMT| 1| N|2013-01-07 15:27:48|2013-01-07 15:38:37| 1| 648| 1.7| -73.966743| 40.764252| -73.983322| 40.743763|\n", + "|496644932DF393260...|513189AD756FF14FE...| CMT| 1| N|2013-01-08 11:01:15|2013-01-08 11:08:14| 1| 418| 0.8| -73.995804| 40.743977| -74.007416| 40.744343|\n", + "|0B57B9633A2FECD3D...|CCD4367B417ED6634...| CMT| 1| N|2013-01-07 12:39:18|2013-01-07 13:10:56| 3| 1898| 10.7| -73.989937| 40.756775| -73.86525| 40.77063|\n", + "|2C0E91FF20A856C89...|1DA2F6543A62B8ED9...| CMT| 1| N|2013-01-07 18:15:47|2013-01-07 18:20:47| 1| 299| 0.8| -73.980072| 40.743137| -73.982712| 40.735336|\n", + "|2D4B95E2FA7B2E851...|CD2F522EEE1FF5F5A...| CMT| 1| N|2013-01-07 15:33:28|2013-01-07 15:49:26| 2| 957| 2.5| -73.977936| 40.786983| -73.952919| 40.80637|\n", + "|E12F6AF991172EAC3...|06918214E951FA000...| CMT| 1| N|2013-01-08 13:11:52|2013-01-08 13:19:50| 1| 477| 1.3| -73.982452| 40.773167| -73.964134| 40.773815|\n", + "|E12F6AF991172EAC3...|06918214E951FA000...| CMT| 1| N|2013-01-08 09:50:05|2013-01-08 10:02:54| 1| 768| 0.7| -73.99556| 40.749294| -73.988686| 40.759052|\n", + "|78FFD9CD0CDA541F3...|E949C583ECF62C8F0...| CMT| 1| N|2013-01-10 12:07:08|2013-01-10 12:17:29| 1| 620| 2.3| -73.971497| 40.791321| -73.964478| 40.775921|\n", + "|237F49C3ECC11F502...|93C363DDF8ED9385D...| CMT| 1| N|2013-01-07 07:35:47|2013-01-07 07:46:00| 1| 612| 2.3| -73.98851| 40.774307| -73.981094| 40.755325|\n", + "|3349F919AA8AE5DC9...|7CE849FEF67514F08...| CMT| 1| N|2013-01-10 15:42:29|2013-01-10 16:04:02| 1| 1293| 3.2| -73.994911| 40.723221| -73.971558| 40.761612|\n", + "|3349F919AA8AE5DC9...|7CE849FEF67514F08...| CMT| 1| N|2013-01-10 14:27:28|2013-01-10 14:45:21| 1| 1073| 4.4| -74.010391| 40.708702| -73.987846| 40.756104|\n", + "|4C005EEBAA7BF26B8...|351BE7D984BE17DB2...| CMT| 1| N|2013-01-07 22:09:59|2013-01-07 22:19:50| 1| 591| 1.7| -73.973732| 40.756287| -73.998413| 40.756832|\n", + "|7D99C30FCE69B1A9D...|460C3F57DD9CB2265...| CMT| 1| N|2013-01-07 17:18:16|2013-01-07 17:20:55| 1| 158| 0.7| -73.968925| 40.767704| -73.96199| 40.776566|\n", + "|E6FBF80668FE0611A...|36773E80775F26CD1...| CMT| 1| N|2013-01-07 06:08:51|2013-01-07 06:13:14| 1| 262| 1.7| -73.96212| 40.769737| -73.979561| 40.75539|\n", + "|0C5296F3C8B16E702...|D2363240A9295EF57...| CMT| 1| N|2013-01-07 22:25:46|2013-01-07 22:36:56| 1| 669| 2.3| -73.989708| 40.756714| -73.977615| 40.787575|\n", + "+--------------------+--------------------+---------+---------+------------------+-------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+\n", + "only showing top 20 rows\n", + "\n" + ] + } + ], + "source": [ + "df_trip_data.show()" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "id": "3c7ccbd4", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "+--------------------+--------------------+---------+-------------------+------------+-----------+---------+-------+----------+------------+------------+\n", + "| medallion| hack_license|vendor_id| pickup_datetime|payment_type|fare_amount|surcharge|mta_tax|tip_amount|tolls_amount|total_amount|\n", + "+--------------------+--------------------+---------+-------------------+------------+-----------+---------+-------+----------+------------+------------+\n", + "|89D227B655E5C82AE...|BA96DE419E711691B...| CMT|2013-01-01 15:11:48| CSH| 6.5| 0.0| 0.5| 0.0| 0.0| 7.0|\n", + "|0BD7C8F5BA12B88E0...|9FD8F69F0804BDB55...| CMT|2013-01-06 00:18:35| CSH| 6.0| 0.5| 0.5| 0.0| 0.0| 7.0|\n", + "|0BD7C8F5BA12B88E0...|9FD8F69F0804BDB55...| CMT|2013-01-05 18:49:41| CSH| 5.5| 1.0| 0.5| 0.0| 0.0| 7.0|\n", + "|DFD2202EE08F7A8DC...|51EE87E3205C985EF...| CMT|2013-01-07 23:54:15| CSH| 5.0| 0.5| 0.5| 0.0| 0.0| 6.0|\n", + "|DFD2202EE08F7A8DC...|51EE87E3205C985EF...| CMT|2013-01-07 23:25:03| CSH| 9.5| 0.5| 0.5| 0.0| 0.0| 10.5|\n", + "|20D9ECB2CA0767CF7...|598CCE5B9C1918568...| CMT|2013-01-07 15:27:48| CSH| 9.5| 0.0| 0.5| 0.0| 0.0| 10.0|\n", + "|496644932DF393260...|513189AD756FF14FE...| CMT|2013-01-08 11:01:15| CSH| 6.0| 0.0| 0.5| 0.0| 0.0| 6.5|\n", + "|0B57B9633A2FECD3D...|CCD4367B417ED6634...| CMT|2013-01-07 12:39:18| CSH| 34.0| 0.0| 0.5| 0.0| 4.8| 39.3|\n", + "|2C0E91FF20A856C89...|1DA2F6543A62B8ED9...| CMT|2013-01-07 18:15:47| CSH| 5.5| 1.0| 0.5| 0.0| 0.0| 7.0|\n", + "|2D4B95E2FA7B2E851...|CD2F522EEE1FF5F5A...| CMT|2013-01-07 15:33:28| CSH| 13.0| 0.0| 0.5| 0.0| 0.0| 13.5|\n", + "|E12F6AF991172EAC3...|06918214E951FA000...| CMT|2013-01-08 13:11:52| CSH| 7.5| 0.0| 0.5| 0.0| 0.0| 8.0|\n", + "|E12F6AF991172EAC3...|06918214E951FA000...| CMT|2013-01-08 09:50:05| CSH| 9.0| 0.0| 0.5| 0.0| 0.0| 9.5|\n", + "|78FFD9CD0CDA541F3...|E949C583ECF62C8F0...| CMT|2013-01-10 12:07:08| CSH| 9.5| 0.0| 0.5| 0.0| 0.0| 10.0|\n", + "|237F49C3ECC11F502...|93C363DDF8ED9385D...| CMT|2013-01-07 07:35:47| CSH| 10.0| 0.0| 0.5| 0.0| 0.0| 10.5|\n", + "|3349F919AA8AE5DC9...|7CE849FEF67514F08...| CMT|2013-01-10 15:42:29| CSH| 15.5| 0.0| 0.5| 0.0| 0.0| 16.0|\n", + "|3349F919AA8AE5DC9...|7CE849FEF67514F08...| CMT|2013-01-10 14:27:28| CSH| 16.5| 0.0| 0.5| 0.0| 0.0| 17.0|\n", + "|4C005EEBAA7BF26B8...|351BE7D984BE17DB2...| CMT|2013-01-07 22:09:59| CSH| 9.0| 0.5| 0.5| 0.0| 0.0| 10.0|\n", + "|7D99C30FCE69B1A9D...|460C3F57DD9CB2265...| CMT|2013-01-07 17:18:16| CSH| 4.5| 1.0| 0.5| 0.0| 0.0| 6.0|\n", + "|E6FBF80668FE0611A...|36773E80775F26CD1...| CMT|2013-01-07 06:08:51| CSH| 7.0| 0.0| 0.5| 0.0| 0.0| 7.5|\n", + "|0C5296F3C8B16E702...|D2363240A9295EF57...| CMT|2013-01-07 22:25:46| CSH| 10.5| 0.5| 0.5| 0.0| 0.0| 11.5|\n", + "+--------------------+--------------------+---------+-------------------+------------+-----------+---------+-------+----------+------------+------------+\n", + "only showing top 20 rows\n", + "\n" + ] + } + ], + "source": [ + "df_trip_fare.show()" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "id": "61e21d2a", + "metadata": {}, + "outputs": [], + "source": [ + "df_left = df_trip_data.filter(col('pickup_datetime').startswith(\"2013-01-01 \"))\n", + "df_right = df_trip_fare.filter(col('pickup_datetime').startswith(\"2013-01-01 \"))\n", + "\n", + "df_joined = df_left.join(df_right, ['medallion', 'pickup_datetime']).cache()" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "id": "d73ab313", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "[Stage 7:====================================================> (12 + 1) / 13]\r" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "+--------------------+-------------------+--------------------+---------+---------+------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+--------------------+---------+------------+-----------+---------+-------+----------+------------+------------+\n", + "| medallion| pickup_datetime| hack_license|vendor_id|rate_code|store_and_fwd_flag| dropoff_datetime|passenger_count|trip_time_in_secs|trip_distance|pickup_longitude|pickup_latitude|dropoff_longitude|dropoff_latitude| hack_license|vendor_id|payment_type|fare_amount|surcharge|mta_tax|tip_amount|tolls_amount|total_amount|\n", + "+--------------------+-------------------+--------------------+---------+---------+------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+--------------------+---------+------------+-----------+---------+-------+----------+------------+------------+\n", + "|000318C2E3E638158...|2013-01-01 20:46:00|91CE3B3A2F548CD8A...| VTS| 1| null|2013-01-01 20:56:00| 5| 600| 1.35| -73.989677| 40.756554| -73.970673| 40.752541|91CE3B3A2F548CD8A...| VTS| CRD| 8.5| 0.5| 0.5| 1.8| 0.0| 11.3|\n", + "|00790C7BAD30B7A9E...|2013-01-01 04:26:00|3EF1ED607505C991D...| VTS| 1| null|2013-01-01 04:59:00| 1| 1980| 10.99| -73.996811| 40.716587| -73.949448| 40.827671|3EF1ED607505C991D...| VTS| CRD| 36.5| 0.5| 0.5| 9.25| 0.0| 46.75|\n", + "|00A1EA0E8CD47CE24...|2013-01-01 06:09:50|4FD770C068437BBA9...| CMT| 1| N|2013-01-01 06:29:03| 1| 1153| 5.8| -73.89653| 40.759472| -73.952698| 40.780788|4FD770C068437BBA9...| CMT| CRD| 20.5| 0.0| 0.5| 4.0| 0.0| 25.0|\n", + "+--------------------+-------------------+--------------------+---------+---------+------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+--------------------+---------+------------+-----------+---------+-------+----------+------------+------------+\n", + "only showing top 3 rows\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + " \r" + ] + } + ], + "source": [ + "df_joined.show(3)" + ] + }, + { + "cell_type": "markdown", + "id": "5f246287", + "metadata": {}, + "source": [ + "### Exercise 2\n", + "Consider only Manhattan, Bronx and Brooklyn districts. Then create a dataframe that shows the total number of trips *within* the same district and *across* all the other districts mentioned before.\n", + "\n", + "For example, for Manhattan borough you should consider the total number of the following trips:\n", + "- Manhattan → Manhattan\n", + "- Manhattan → Brooklyn\n", + "- Manhattan → Bronx\n", + "\n", + "You should then do the same for Bronx and Brooklyn boroughs." + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "id": "97e35f13", + "metadata": {}, + "outputs": [], + "source": [ + "from pyspark.sql import types as T\n", + "from pyspark.sql import functions as F\n", + "from shapely.geometry import Polygon, Point\n", + "from typing import Tuple, List\n", + "\n", + "df_boroughs = spark.read \\\n", + " .option(\"multiline\", \"true\") \\\n", + " .json(r'data/nyc-boroughs.geojson')\n", + "\n", + "df_boroughs = df_boroughs.select(F.explode(df_boroughs.features).alias(\"feature\"))\n", + "\n", + "boroughs_list = df_boroughs.select( \\\n", + " df_boroughs.feature.properties.borough.alias(\"borough\"), \\\n", + " df_boroughs.feature.geometry.coordinates.alias(\"coordinates\")).collect()\n", + "\n", + "boroughs_list: list[tuple[str, list[Polygon]]] = \\\n", + " [(r.borough, [Polygon(shell=p) for p in r.coordinates]) for r in boroughs_list]\n", + "\n", + "@F.udf(returnType=T.StringType())\n", + "def get_borough(lon: float, lat: float) -> bool:\n", + " global boroughs_list\n", + "\n", + " if lon is None or lat is None:\n", + " return None\n", + "\n", + " point = Point(lon, lat)\n", + " \n", + " for b in boroughs_list:\n", + " for p in b[1]:\n", + " if p.contains(point):\n", + " return b[0]\n", + " return None" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "id": "b12aa2ec", + "metadata": {}, + "outputs": [], + "source": [ + "# use UDF as join condition\n", + "df_with_bor = df_joined \\\n", + " .withColumn(\"pickup_borough\", get_borough(\"pickup_longitude\", \"pickup_latitude\")) \\\n", + " .withColumn(\"dropoff_borough\", get_borough(\"dropoff_longitude\", \"dropoff_latitude\")) \\\n", + " .cache()" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "id": "9d386ada-5bd0-4db5-9ac7-e675f371682c", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + " \r" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Before borough join: 412630\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + "[Stage 20:=====================================================>(199 + 1) / 200]\r" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "After borough join:412630\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + " \r" + ] + } + ], + "source": [ + "print(\"Before borough join: \" + str(df_joined.count())) \n", + "print(\"After borough join:\" + str(df_with_bor.count()))" + ] + }, + { + "cell_type": "code", + "execution_count": 20, + "id": "9c14ad76-388a-454a-96c0-bf38765ce0dd", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "[Stage 55:=================================================> (185 + 1) / 200]\r" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "+--------------+---------------+------+\n", + "|pickup_borough|dropoff_borough|count |\n", + "+--------------+---------------+------+\n", + "|Bronx |Bronx |487 |\n", + "|Bronx |Brooklyn |6 |\n", + "|Bronx |Manhattan |284 |\n", + "|Brooklyn |Bronx |57 |\n", + "|Brooklyn |Brooklyn |10454 |\n", + "|Brooklyn |Manhattan |6408 |\n", + "|Manhattan |Bronx |2779 |\n", + "|Manhattan |Brooklyn |14396 |\n", + "|Manhattan |Manhattan |319706|\n", + "+--------------+---------------+------+\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + " \r" + ] + } + ], + "source": [ + "def isin(var, values):\n", + " cond = (var == values[0])\n", + " for i in range(0, len(values)):\n", + " cond = cond | (var == values[i])\n", + " return cond\n", + "\n", + "boroughs = [\"Manhattan\", \"Bronx\", \"Brooklyn\"]\n", + "df_ex2 = df_with_bor \\\n", + " .where((isin(df_with_bor.pickup_borough, boroughs)) & (isin(df_with_bor.dropoff_borough, boroughs))) \\\n", + " .groupBy(\"pickup_borough\", \"dropoff_borough\") \\\n", + " .count() \\\n", + " .orderBy(\"pickup_borough\", \"dropoff_borough\")\n", + "df_ex2.show(truncate=False)" + ] + }, + { + "cell_type": "markdown", + "id": "21bd4ac8", + "metadata": {}, + "source": [ + "### Exercise 3\n", + "Imagine you are a taxi driver and one day you can work only two hours. Assume the data is representative of a typical working day. Which hours of the day - retrieved from `pickup_datetime` - would you choose to work based on the fare and tip amount?" + ] + }, + { + "cell_type": "code", + "execution_count": 47, + "id": "46d191e1-fd13-4de3-8851-5e10a7319286", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "[Stage 215:===============================================> (181 + 1) / 200]\r" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "+-----------+------------------+\n", + "|pickup_hour|fare_and_tip_total|\n", + "+-----------+------------------+\n", + "|1 |453700.23 |\n", + "|2 |418415.82 |\n", + "|0 |390741.27 |\n", + "|3 |367018.78 |\n", + "|14 |286852.68 |\n", + "|15 |278953.43 |\n", + "|4 |272856.05 |\n", + "|18 |269648.14 |\n", + "|13 |263915.72 |\n", + "|17 |258134.56 |\n", + "|16 |246552.73 |\n", + "|12 |238716.32 |\n", + "|19 |234377.86 |\n", + "|20 |211402.98 |\n", + "|21 |208110.83 |\n", + "|22 |204481.56 |\n", + "|11 |194952.87 |\n", + "|5 |180075.5 |\n", + "|23 |158957.41 |\n", + "|10 |146400.51 |\n", + "+-----------+------------------+\n", + "only showing top 20 rows\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + " \r" + ] + } + ], + "source": [ + "df_ex3 = df_joined.select( \\\n", + " F.hour(F.from_utc_timestamp(df_joined.pickup_datetime, 'UTC')).alias('pickup_hour'), \\\n", + " F.col(\"fare_amount\"), \\\n", + " F.col(\"tip_amount\")) \\\n", + " .groupby(\"pickup_hour\") \\\n", + " .agg(F.round(F.sum(F.col(\"fare_amount\") + F.col(\"tip_amount\")), 2).alias('fare_and_tip_total')) \\\n", + " .select(\"pickup_hour\", \"fare_and_tip_total\") \\\n", + " .sort(F.desc(\"fare_and_tip_total\"))\n", + "\n", + "df_ex3.show(truncate=False)" + ] + }, + { + "cell_type": "markdown", + "id": "ffbbaf04-65b5-4fc2-879f-a2a8bcc87519", + "metadata": {}, + "source": [ + "Given the table above I would choose to work at **1 AM** and **2 AM** as they are the most profitable hours based on total fare and tip amount. This may be the case for the chosen date `2013-01-01` because of the new year celebrations." + ] + }, + { + "cell_type": "markdown", + "id": "b24e0922", + "metadata": {}, + "source": [ + "### Exercise 4\n", + "Provide a graphical representation to compare the average fare amount for trips _within_ and _across_ all the districts. You may want to have a look at: https://docs.bokeh.org/en/latest/docs/user_guide/topics/categorical.html#heatmaps." + ] + }, + { + "cell_type": "code", + "execution_count": 72, + "id": "0643d9e4", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + " \r" + ] + } + ], + "source": [ + "import pandas as pd\n", + "\n", + "ex4_data = df_with_bor \\\n", + " .withColumn(\"pickup_borough\", F.coalesce(F.col(\"pickup_borough\"), F.lit(\"Unknown\"))) \\\n", + " .withColumn(\"dropoff_borough\", F.coalesce(F.col(\"dropoff_borough\"), F.lit(\"Unknown\"))) \\\n", + " .groupBy(\"pickup_borough\", \"dropoff_borough\") \\\n", + " .agg(F.mean(F.col('fare_amount')).alias('mean_fare_amount')) \\\n", + " .collect()\n", + "\n", + "df_ex4 = pd.DataFrame()\n", + "for i, row in enumerate(ex4_data):\n", + " df_ex4.loc[i, 'pickup_borough'] = row.pickup_borough\n", + " df_ex4.loc[i, 'dropoff_borough'] = row.dropoff_borough\n", + " df_ex4.loc[i, 'mean_fare'] = row.mean_fare_amount" + ] + }, + { + "cell_type": "code", + "execution_count": 79, + "id": "2cba45e6-7ad1-4044-b9f0-81943c1cf547", + "metadata": {}, + "outputs": [], + "source": [ + "from math import pi\n", + "from bokeh.models import BasicTicker, PrintfTickFormatter\n", + "from bokeh.plotting import figure, show\n", + "from bokeh.transform import linear_cmap\n", + "\n", + "pickup = list(sorted(df_ex4['pickup_borough'].unique()))\n", + "dropoff = list(reversed(sorted(df_ex4['dropoff_borough'].unique())))\n", + "\n", + "colors = [\"#75968f\", \"#a5bab7\", \"#c9d9d3\", \"#e2e2e2\", \"#dfccce\", \"#ddb7b1\", \"#cc7878\", \"#933b41\", \"#550b1d\"]\n", + "\n", + "p = figure(title=f\"Mean NYC Taxi fares on 2013-01-01\",\n", + " x_range=pickup, y_range=dropoff,\n", + " x_axis_location=\"above\", width=900, height=900,\n", + " tools=\"hover,save,pan,box_zoom,reset,wheel_zoom\", toolbar_location='below',\n", + " tooltips=[ \\\n", + " ('Pickup Borough', '@pickup_borough'), \\\n", + " ('Dropoff Borough', '@dropoff_borough'), \\\n", + " ('Average Fare Amount', '$@mean_fare')])\n", + "\n", + "p.grid.grid_line_color = None\n", + "p.axis.axis_line_color = None\n", + "p.axis.major_tick_line_color = None\n", + "p.axis.major_label_text_font_size = \"14px\"\n", + "p.axis.major_label_standoff = 0\n", + "p.xaxis.major_label_orientation = pi / 3\n", + "\n", + "r = p.rect(x=\"pickup_borough\", y=\"dropoff_borough\", width=1, height=1, source=df_ex4,\n", + " fill_color=linear_cmap(\"mean_fare\", colors, low=df_ex4.mean_fare.min(), high=df_ex4.mean_fare.max()),\n", + " line_color=None)\n", + "\n", + "p.add_layout(r.construct_color_bar(\n", + " major_label_text_font_size=\"14px\",\n", + " ticker=BasicTicker(desired_num_ticks=len(colors)),\n", + " formatter=PrintfTickFormatter(format=\"$%d\"),\n", + " label_standoff=6,\n", + " border_line_color=None,\n", + " padding=5\n", + "), 'right')\n", + "\n", + "show(p)" + ] + }, + { + "cell_type": "markdown", + "id": "9b4a8445", + "metadata": {}, + "source": [ + "### Exercise 5\n", + "Find the average amount of tolls per hour for trips within the following districts: Manhattan, Bronx, Brooklyn, Queens. Show a graphical representation of the data and report if there is any trend or peak during the day. Overall which district has the largest amount of tolls?" + ] + }, + { + "cell_type": "code", + "execution_count": 88, + "id": "b80cbb2d", + "metadata": {}, + "outputs": [ + { + "name": "stderr", + "output_type": "stream", + "text": [ + "[Stage 313:====================================================>(197 + 1) / 200]\r" + ] + }, + { + "name": "stdout", + "output_type": "stream", + "text": [ + "+--------------+---------------+----+--------------------+\n", + "|pickup_borough|dropoff_borough|hour| mean_tolls_amount|\n", + "+--------------+---------------+----+--------------------+\n", + "| Bronx| Bronx| 0| 0.0|\n", + "| Bronx| Bronx| 1| 0.0|\n", + "| Bronx| Bronx| 2| 0.0|\n", + "| Bronx| Bronx| 3| 0.0|\n", + "| Bronx| Bronx| 4| 0.0|\n", + "| Bronx| Bronx| 5| 0.14545454545454545|\n", + "| Bronx| Bronx| 6| 0.0|\n", + "| Bronx| Bronx| 7| 0.0|\n", + "| Bronx| Bronx| 8| 0.0|\n", + "| Bronx| Bronx| 9| 0.0|\n", + "| Bronx| Bronx| 10| 1.1388888888888888|\n", + "| Bronx| Bronx| 11| 0.6857142857142857|\n", + "| Bronx| Bronx| 12| 0.0|\n", + "| Bronx| Bronx| 13| 0.0|\n", + "| Bronx| Bronx| 14| 0.0|\n", + "| Bronx| Bronx| 15| 0.0|\n", + "| Bronx| Bronx| 16| 0.0|\n", + "| Bronx| Bronx| 17| 0.96|\n", + "| Bronx| Bronx| 18| 0.48|\n", + "| Bronx| Bronx| 19| 0.0|\n", + "| Bronx| Bronx| 20| 0.6857142857142857|\n", + "| Bronx| Bronx| 21| 0.3692307692307692|\n", + "| Bronx| Bronx| 22| 0.0|\n", + "| Bronx| Bronx| 23| 0.0|\n", + "| Bronx| Brooklyn| 1| 4.8|\n", + "| Bronx| Brooklyn| 2| 0.0|\n", + "| Bronx| Brooklyn| 6| 0.0|\n", + "| Bronx| Brooklyn| 12| 0.0|\n", + "| Bronx| Brooklyn| 17| 4.8|\n", + "| Bronx| Manhattan| 0| 0.0|\n", + "| Bronx| Manhattan| 1| 0.0|\n", + "| Bronx| Manhattan| 2| 0.0|\n", + "| Bronx| Manhattan| 3| 0.0|\n", + "| Bronx| Manhattan| 4| 0.18333333333333335|\n", + "| Bronx| Manhattan| 5| 0.0|\n", + "| Bronx| Manhattan| 6| 0.0|\n", + "| Bronx| Manhattan| 7| 0.0|\n", + "| Bronx| Manhattan| 8| 0.0|\n", + "| Bronx| Manhattan| 9| 0.0|\n", + "| Bronx| Manhattan| 10| 0.0|\n", + "| Bronx| Manhattan| 11| 0.0|\n", + "| Bronx| Manhattan| 12| 0.0|\n", + "| Bronx| Manhattan| 13| 0.0|\n", + "| Bronx| Manhattan| 14| 0.0|\n", + "| Bronx| Manhattan| 15| 0.0|\n", + "| Bronx| Manhattan| 16| 0.0|\n", + "| Bronx| Manhattan| 17| 0.0|\n", + "| Bronx| Manhattan| 18| 0.0|\n", + "| Bronx| Manhattan| 20| 0.0|\n", + "| Bronx| Manhattan| 21| 0.0|\n", + "| Bronx| Manhattan| 22| 0.0|\n", + "| Bronx| Manhattan| 23| 0.0|\n", + "| Bronx| Queens| 0| 4.8|\n", + "| Bronx| Queens| 1| 2.4|\n", + "| Bronx| Queens| 2| 4.8|\n", + "| Bronx| Queens| 3| 4.8|\n", + "| Bronx| Queens| 5| 3.5999999999999996|\n", + "| Bronx| Queens| 6| 2.4|\n", + "| Bronx| Queens| 7| 4.8|\n", + "| Bronx| Queens| 12| 4.8|\n", + "| Bronx| Queens| 15| 4.8|\n", + "| Brooklyn| Bronx| 0| 1.92|\n", + "| Brooklyn| Bronx| 1| 2.742857142857143|\n", + "| Brooklyn| Bronx| 2| 1.3499999999999999|\n", + "| Brooklyn| Bronx| 3| 1.3833333333333335|\n", + "| Brooklyn| Bronx| 4| 1.5999999999999999|\n", + "| Brooklyn| Bronx| 5| 2.4|\n", + "| Brooklyn| Bronx| 6| 1.5999999999999999|\n", + "| Brooklyn| Bronx| 7| 1.2|\n", + "| Brooklyn| Bronx| 8| 0.0|\n", + "| Brooklyn| Bronx| 10| 0.0|\n", + "| Brooklyn| Bronx| 11| 4.8|\n", + "| Brooklyn| Bronx| 12| 2.2|\n", + "| Brooklyn| Bronx| 18| 0.0|\n", + "| Brooklyn| Bronx| 23| 0.0|\n", + "| Brooklyn| Brooklyn| 0| 0.0|\n", + "| Brooklyn| Brooklyn| 1|0.005357142857142857|\n", + "| Brooklyn| Brooklyn| 2| 0.0|\n", + "| Brooklyn| Brooklyn| 3|0.019872701555869874|\n", + "| Brooklyn| Brooklyn| 4|0.009352189781021899|\n", + "| Brooklyn| Brooklyn| 5| 0.0|\n", + "| Brooklyn| Brooklyn| 6| 0.0|\n", + "| Brooklyn| Brooklyn| 7| 0.0|\n", + "| Brooklyn| Brooklyn| 8| 0.0|\n", + "| Brooklyn| Brooklyn| 9| 0.11851851851851851|\n", + "| Brooklyn| Brooklyn| 10| 0.0|\n", + "| Brooklyn| Brooklyn| 11| 0.04|\n", + "| Brooklyn| Brooklyn| 12| 0.0|\n", + "| Brooklyn| Brooklyn| 13| 0.0|\n", + "| Brooklyn| Brooklyn| 14| 0.02711864406779661|\n", + "| Brooklyn| Brooklyn| 15|0.028402366863905324|\n", + "| Brooklyn| Brooklyn| 16| 0.0|\n", + "| Brooklyn| Brooklyn| 17| 0.02711864406779661|\n", + "| Brooklyn| Brooklyn| 18|0.020600858369098713|\n", + "| Brooklyn| Brooklyn| 19|0.021052631578947368|\n", + "| Brooklyn| Brooklyn| 20| 0.0|\n", + "| Brooklyn| Brooklyn| 21| 0.04324324324324324|\n", + "| Brooklyn| Brooklyn| 22| 0.05704697986577181|\n", + "| Brooklyn| Brooklyn| 23| 0.0|\n", + "| Brooklyn| Manhattan| 0| 0.04419889502762431|\n", + "| Brooklyn| Manhattan| 1| 0.0632860040567951|\n", + "| Brooklyn| Manhattan| 2| 0.05387755102040815|\n", + "| Brooklyn| Manhattan| 3| 0.07449748743718591|\n", + "| Brooklyn| Manhattan| 4|0.038554216867469876|\n", + "| Brooklyn| Manhattan| 5|0.018532818532818532|\n", + "| Brooklyn| Manhattan| 6| 0.08372093023255812|\n", + "| Brooklyn| Manhattan| 7| 0.0|\n", + "| Brooklyn| Manhattan| 8| 0.0|\n", + "| Brooklyn| Manhattan| 9| 0.05581395348837209|\n", + "| Brooklyn| Manhattan| 10| 0.04403669724770642|\n", + "| Brooklyn| Manhattan| 11| 0.07218045112781954|\n", + "| Brooklyn| Manhattan| 12| 0.0|\n", + "| Brooklyn| Manhattan| 13| 0.02981366459627329|\n", + "| Brooklyn| Manhattan| 14| 0.05962732919254658|\n", + "| Brooklyn| Manhattan| 15| 0.0|\n", + "| Brooklyn| Manhattan| 16| 0.11290322580645161|\n", + "| Brooklyn| Manhattan| 17| 0.12314102564102562|\n", + "| Brooklyn| Manhattan| 18| 0.0|\n", + "| Brooklyn| Manhattan| 19| 0.0|\n", + "| Brooklyn| Manhattan| 20| 0.04|\n", + "| Brooklyn| Manhattan| 21| 0.08495575221238938|\n", + "| Brooklyn| Manhattan| 22| 0.04033613445378151|\n", + "| Brooklyn| Manhattan| 23| 0.0|\n", + "| Brooklyn| Queens| 0| 0.0|\n", + "| Brooklyn| Queens| 1|0.010526315789473684|\n", + "| Brooklyn| Queens| 2| 0.02513089005235602|\n", + "| Brooklyn| Queens| 3|0.026666666666666665|\n", + "| Brooklyn| Queens| 4|0.012413793103448275|\n", + "| Brooklyn| Queens| 5| 0.12258064516129033|\n", + "| Brooklyn| Queens| 6| 0.02857142857142857|\n", + "| Brooklyn| Queens| 7| 0.0|\n", + "| Brooklyn| Queens| 8| 0.0|\n", + "| Brooklyn| Queens| 9| 0.0|\n", + "| Brooklyn| Queens| 10| 0.0|\n", + "| Brooklyn| Queens| 11| 0.0|\n", + "| Brooklyn| Queens| 12| 0.1846153846153846|\n", + "| Brooklyn| Queens| 13| 0.0|\n", + "| Brooklyn| Queens| 14| 0.0|\n", + "| Brooklyn| Queens| 15| 0.11707317073170731|\n", + "| Brooklyn| Queens| 16| 0.0|\n", + "| Brooklyn| Queens| 17| 0.0|\n", + "| Brooklyn| Queens| 18| 0.0|\n", + "| Brooklyn| Queens| 19| 0.0|\n", + "| Brooklyn| Queens| 20| 0.0|\n", + "| Brooklyn| Queens| 21| 0.0|\n", + "| Brooklyn| Queens| 22| 0.0|\n", + "| Brooklyn| Queens| 23| 0.0|\n", + "| Manhattan| Bronx| 0| 0.2533333333333334|\n", + "| Manhattan| Bronx| 1| 0.2715277777777779|\n", + "| Manhattan| Bronx| 2| 0.2100628930817611|\n", + "| Manhattan| Bronx| 3| 0.2696428571428573|\n", + "| Manhattan| Bronx| 4| 0.15384615384615388|\n", + "| Manhattan| Bronx| 5| 0.05527638190954774|\n", + "| Manhattan| Bronx| 6| 0.08096590909090909|\n", + "| Manhattan| Bronx| 7| 0.1333333333333333|\n", + "| Manhattan| Bronx| 8| 0.18133333333333335|\n", + "| Manhattan| Bronx| 9| 0.165|\n", + "| Manhattan| Bronx| 10| 0.3578947368421052|\n", + "| Manhattan| Bronx| 11| 0.3674418604651163|\n", + "| Manhattan| Bronx| 12| 0.43902439024390244|\n", + "| Manhattan| Bronx| 13| 0.22999999999999998|\n", + "| Manhattan| Bronx| 14| 0.2619047619047619|\n", + "| Manhattan| Bronx| 15| 0.2490566037735849|\n", + "| Manhattan| Bronx| 16| 0.5236170212765957|\n", + "| Manhattan| Bronx| 17| 0.23749999999999996|\n", + "| Manhattan| Bronx| 18| 0.2925925925925926|\n", + "| Manhattan| Bronx| 19| 0.1543859649122807|\n", + "| Manhattan| Bronx| 20| 0.14666666666666667|\n", + "| Manhattan| Bronx| 21| 0.20909090909090908|\n", + "| Manhattan| Bronx| 22| 0.29|\n", + "| Manhattan| Bronx| 23| 0.13609999999999997|\n", + "| Manhattan| Brooklyn| 0| 0.20921052631578962|\n", + "| Manhattan| Brooklyn| 1| 0.24647709320695127|\n", + "| Manhattan| Brooklyn| 2| 0.2537931034482761|\n", + "| Manhattan| Brooklyn| 3| 0.168358208955224|\n", + "| Manhattan| Brooklyn| 4| 0.14059939301972688|\n", + "| Manhattan| Brooklyn| 5| 0.11757188498402552|\n", + "| Manhattan| Brooklyn| 6| 0.1429467084639498|\n", + "| Manhattan| Brooklyn| 7| 0.12403433476394847|\n", + "| Manhattan| Brooklyn| 8| 0.1471264367816092|\n", + "| Manhattan| Brooklyn| 9| 0.16633663366336635|\n", + "| Manhattan| Brooklyn| 10| 0.11267605633802817|\n", + "| Manhattan| Brooklyn| 11| 0.18585657370517925|\n", + "| Manhattan| Brooklyn| 12| 0.19136212624584714|\n", + "| Manhattan| Brooklyn| 13| 0.15789473684210523|\n", + "| Manhattan| Brooklyn| 14| 0.2719999999999999|\n", + "| Manhattan| Brooklyn| 15| 0.2133333333333333|\n", + "| Manhattan| Brooklyn| 16| 0.2842105263157894|\n", + "| Manhattan| Brooklyn| 17| 0.2565139949109414|\n", + "| Manhattan| Brooklyn| 18| 0.18093126385809308|\n", + "| Manhattan| Brooklyn| 19| 0.1438972162740899|\n", + "| Manhattan| Brooklyn| 20| 0.13136842105263155|\n", + "| Manhattan| Brooklyn| 21| 0.1684405458089668|\n", + "| Manhattan| Brooklyn| 22| 0.16958041958041953|\n", + "| Manhattan| Brooklyn| 23| 0.09829351535836177|\n", + "| Manhattan| Manhattan| 0|0.002124846378776963|\n", + "| Manhattan| Manhattan| 1|0.003388822829964328|\n", + "| Manhattan| Manhattan| 2|0.002282543352601...|\n", + "| Manhattan| Manhattan| 3|6.617317182593092E-4|\n", + "| Manhattan| Manhattan| 4| 0.00711096245505477|\n", + "| Manhattan| Manhattan| 5|0.004739558892538714|\n", + "| Manhattan| Manhattan| 6|0.008770792827824583|\n", + "| Manhattan| Manhattan| 7| 0.01721972031287035|\n", + "| Manhattan| Manhattan| 8|0.007416208104052026|\n", + "| Manhattan| Manhattan| 9|0.008730447435431065|\n", + "| Manhattan| Manhattan| 10|0.007606766828344964|\n", + "| Manhattan| Manhattan| 11|0.003766874141136529|\n", + "| Manhattan| Manhattan| 12|0.002688551972247...|\n", + "| Manhattan| Manhattan| 13|0.002815919789692486|\n", + "| Manhattan| Manhattan| 14|0.003850092535471...|\n", + "| Manhattan| Manhattan| 15|0.008035703139629235|\n", + "| Manhattan| Manhattan| 16| 0.0056893032117583|\n", + "| Manhattan| Manhattan| 17|0.009296927493738926|\n", + "| Manhattan| Manhattan| 18|0.006115517819238...|\n", + "| Manhattan| Manhattan| 19|0.006486187125358352|\n", + "| Manhattan| Manhattan| 20|0.008908519239407095|\n", + "| Manhattan| Manhattan| 21|0.004213675213675213|\n", + "| Manhattan| Manhattan| 22|0.005885259631490787|\n", + "| Manhattan| Manhattan| 23|0.008152764067127342|\n", + "| Manhattan| Queens| 0| 0.8684324324324318|\n", + "| Manhattan| Queens| 1| 0.8232996323529406|\n", + "| Manhattan| Queens| 2| 0.8496747967479669|\n", + "| Manhattan| Queens| 3| 0.920373626373625|\n", + "| Manhattan| Queens| 4| 0.9509571209800902|\n", + "| Manhattan| Queens| 5| 1.2870841487279827|\n", + "| Manhattan| Queens| 6| 1.7025057208237966|\n", + "| Manhattan| Queens| 7| 2.1997175866495486|\n", + "| Manhattan| Queens| 8| 2.7828251121076213|\n", + "| Manhattan| Queens| 9| 2.6930985915492927|\n", + "| Manhattan| Queens| 10| 2.625207296849084|\n", + "| Manhattan| Queens| 11| 2.9828428571428574|\n", + "| Manhattan| Queens| 12| 3.070651685393257|\n", + "| Manhattan| Queens| 13| 2.920602536997886|\n", + "| Manhattan| Queens| 14| 3.059551760939169|\n", + "| Manhattan| Queens| 15| 3.2354977876106217|\n", + "| Manhattan| Queens| 16| 2.8950213371265985|\n", + "| Manhattan| Queens| 17| 2.6199999999999966|\n", + "| Manhattan| Queens| 18| 2.130339321357284|\n", + "| Manhattan| Queens| 19| 1.8387186629526464|\n", + "| Manhattan| Queens| 20| 1.0089171974522302|\n", + "| Manhattan| Queens| 21| 0.8297852760736203|\n", + "| Manhattan| Queens| 22| 0.6545454545454548|\n", + "| Manhattan| Queens| 23| 0.5005434782608698|\n", + "| Queens| Bronx| 0| 4.547368421052631|\n", + "| Queens| Bronx| 1| 2.9999999999999996|\n", + "| Queens| Bronx| 2| 2.742857142857143|\n", + "| Queens| Bronx| 3| 2.8799999999999994|\n", + "| Queens| Bronx| 4| 3.1999999999999997|\n", + "| Queens| Bronx| 5| 3.2842105263157886|\n", + "| Queens| Bronx| 6| 3.1999999999999997|\n", + "| Queens| Bronx| 7| 3.519999999999999|\n", + "| Queens| Bronx| 8| 3.756521739130434|\n", + "| Queens| Bronx| 9| 4.799999999999999|\n", + "| Queens| Bronx| 10| 4.26611111111111|\n", + "| Queens| Bronx| 11| 3.899999999999999|\n", + "| Queens| Bronx| 12| 4.8|\n", + "| Queens| Bronx| 13| 4.499999999999999|\n", + "| Queens| Bronx| 14| 4.718749999999999|\n", + "| Queens| Bronx| 15| 4.669999999999999|\n", + "| Queens| Bronx| 16| 4.114285714285713|\n", + "| Queens| Bronx| 17| 4.799999999999998|\n", + "| Queens| Bronx| 18| 4.44090909090909|\n", + "| Queens| Bronx| 19| 4.235294117647058|\n", + "| Queens| Bronx| 20| 4.457142857142856|\n", + "| Queens| Bronx| 21| 4.199999999999999|\n", + "| Queens| Bronx| 22| 4.477272727272726|\n", + "| Queens| Bronx| 23| 4.319999999999999|\n", + "| Queens| Brooklyn| 0| 0.0|\n", + "| Queens| Brooklyn| 1| 0.0|\n", + "| Queens| Brooklyn| 2| 0.0|\n", + "| Queens| Brooklyn| 3| 0.0|\n", + "| Queens| Brooklyn| 4| 0.0|\n", + "| Queens| Brooklyn| 5| 0.0|\n", + "| Queens| Brooklyn| 6| 0.0|\n", + "| Queens| Brooklyn| 7| 0.0|\n", + "| Queens| Brooklyn| 8| 0.0|\n", + "| Queens| Brooklyn| 9| 0.0|\n", + "| Queens| Brooklyn| 10| 0.0|\n", + "| Queens| Brooklyn| 11| 0.0|\n", + "| Queens| Brooklyn| 12| 0.0|\n", + "| Queens| Brooklyn| 13| 0.0|\n", + "| Queens| Brooklyn| 14| 0.0|\n", + "| Queens| Brooklyn| 15| 0.0|\n", + "| Queens| Brooklyn| 16| 0.0|\n", + "| Queens| Brooklyn| 17| 0.0|\n", + "| Queens| Brooklyn| 18| 0.01846153846153846|\n", + "| Queens| Brooklyn| 19| 0.0|\n", + "| Queens| Brooklyn| 20| 0.0|\n", + "| Queens| Brooklyn| 21| 0.0|\n", + "| Queens| Brooklyn| 22| 0.0|\n", + "| Queens| Brooklyn| 23|0.019433198380566803|\n", + "| Queens| Manhattan| 0| 1.9786259541984754|\n", + "| Queens| Manhattan| 1| 0.9882352941176481|\n", + "| Queens| Manhattan| 2| 0.6832740213523135|\n", + "| Queens| Manhattan| 3| 0.672689075630252|\n", + "| Queens| Manhattan| 4| 0.8727272727272726|\n", + "| Queens| Manhattan| 5| 2.020737327188942|\n", + "| Queens| Manhattan| 6| 1.513492063492065|\n", + "| Queens| Manhattan| 7| 2.2232824427480935|\n", + "| Queens| Manhattan| 8| 2.3165217391304362|\n", + "| Queens| Manhattan| 9| 2.2579770992366432|\n", + "| Queens| Manhattan| 10| 2.782300884955749|\n", + "| Queens| Manhattan| 11| 3.039658848614068|\n", + "| Queens| Manhattan| 12| 3.084337349397588|\n", + "| Queens| Manhattan| 13| 3.301075268817201|\n", + "| Queens| Manhattan| 14| 3.456075808249721|\n", + "| Queens| Manhattan| 15| 3.4173983739837372|\n", + "| Queens| Manhattan| 16| 3.3323693803159182|\n", + "| Queens| Manhattan| 17| 3.358361774744028|\n", + "| Queens| Manhattan| 18| 3.2230088495575226|\n", + "| Queens| Manhattan| 19| 3.1127427184466017|\n", + "| Queens| Manhattan| 20| 3.1380410022779053|\n", + "| Queens| Manhattan| 21| 3.19478935698448|\n", + "| Queens| Manhattan| 22| 3.0503001200480195|\n", + "| Queens| Manhattan| 23| 2.954719764011798|\n", + "| Queens| Queens| 0| 0.03692307692307692|\n", + "| Queens| Queens| 1|0.010015174506828527|\n", + "| Queens| Queens| 2|0.012598425196850394|\n", + "| Queens| Queens| 3|0.005755395683453...|\n", + "| Queens| Queens| 4| 0.07384937238493722|\n", + "| Queens| Queens| 5| 0.04725274725274725|\n", + "| Queens| Queens| 6| 0.08010471204188482|\n", + "| Queens| Queens| 7| 0.096|\n", + "| Queens| Queens| 8| 0.07384615384615384|\n", + "| Queens| Queens| 9| 0.13531746031746034|\n", + "| Queens| Queens| 10| 0.14015444015444017|\n", + "| Queens| Queens| 11| 0.16809338521400777|\n", + "| Queens| Queens| 12| 0.09125475285171103|\n", + "| Queens| Queens| 13| 0.12818991097922847|\n", + "| Queens| Queens| 14| 0.15837563451776646|\n", + "| Queens| Queens| 15| 0.16179775280898873|\n", + "| Queens| Queens| 16| 0.3113513513513512|\n", + "| Queens| Queens| 17| 0.21036585365853655|\n", + "| Queens| Queens| 18| 0.15960451977401127|\n", + "| Queens| Queens| 19| 0.20064308681672022|\n", + "| Queens| Queens| 20| 0.12923076923076923|\n", + "| Queens| Queens| 21| 0.10892307692307691|\n", + "| Queens| Queens| 22|0.061224489795918366|\n", + "| Queens| Queens| 23| 0.07164179104477612|\n", + "+--------------+---------------+----+--------------------+\n", + "\n" + ] + }, + { + "name": "stderr", + "output_type": "stream", + "text": [ + " \r" + ] + } + ], + "source": [ + "boroughs_ex5 = [\"Manhattan\", \"Bronx\", \"Brooklyn\", \"Queens\"]\n", + "\n", + "df_ex5 = df_with_bor \\\n", + " .where((isin(df_with_bor.pickup_borough, boroughs_ex5)) & (isin(df_with_bor.dropoff_borough, boroughs_ex5))) \\\n", + " .withColumn(\"hour\", F.hour(F.from_utc_timestamp(F.col(\"pickup_datetime\"), 'UTC'))) \\\n", + " .groupBy(\"pickup_borough\", \"dropoff_borough\", \"hour\") \\\n", + " .agg(F.mean(F.col('tolls_amount')).alias('mean_tolls_amount')) \\\n", + " .select(F.col('pickup_borough'), F.col('dropoff_borough'), F.col('hour'), F.col('mean_tolls_amount')) \\\n", + " .orderBy(\"pickup_borough\", \"dropoff_borough\", \"hour\") \n", + "\n", + "df_ex5.show(25 * 24 * 2)" + ] + }, + { + "cell_type": "markdown", + "id": "884b4cf9", + "metadata": {}, + "source": [ + "### Exercise 6\n", + "Create a dataframe that for each district shows the shortest and longest `trip_distance` starting and ending in the same district. What is the length of the longest and shortest trips in Manhattan?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0aa8d795", + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "id": "756da7e4", + "metadata": {}, + "source": [ + "### Exercise 7\n", + "Consider only the trips _within_ districts. What are the first and second-most expensive\n", + "trips - based on `total_amount` - in every district?" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "ca83556d", + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "id": "4f1e0800", + "metadata": {}, + "source": [ + "### Exercise 8\n", + "Create a dataframe where each row represents a driver, and there is one column per district.\n", + "For each driver-district, the dataframe provides the maximum number of consecutive trips\n", + "for the given driver, within the given district. \n", + "\n", + "For example, if for driver A we have (sorted by time):\n", + "- Trip 1: Bronx → Bronx\n", + "- Trip 2: Bronx → Bronx\n", + "- Trip 3: Bronx → Manhattan\n", + "- Trip 4: Manhattan → Bronx.\n", + " \n", + "The maximum number of consecutive trips for Bronx is 2." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "edde38bb", + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.11.3" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +} diff --git a/Assignment3/requirements.txt b/Assignment3/requirements.txt new file mode 100644 index 0000000..1dd3cea --- /dev/null +++ b/Assignment3/requirements.txt @@ -0,0 +1,4 @@ +jupyterlab==4.0.1 +pyspark==3.4.0 +shapely==2.0.1 +bokeh==3.1.1