2023-05-30 15:52:00 +00:00
{
"cells": [
{
"cell_type": "markdown",
"id": "23b48f71",
"metadata": {},
"source": [
"# S&DE Atelier - Visual Analytics\n",
"\n",
"# Assignment 3\n",
"\n",
"**Due** June 2, 2023 @23:55\n",
"\n",
"**Contacts**: [marco.dambros@usi.ch](mailto:marco.dambros@usi.ch) - [carmen.armenti@usi.ch](mailto:carmen.armenti@usi.ch)\n",
"\n",
"---\n",
"\n",
"The goal of this assignment is to use Spark in Jupyter notebooks (PySpark). The files `trip_data.csv`, `trip_fare.csv` and `nyc_boroughs.geojson` can be found in the following folder: [Assignment3-data](https://usi365-my.sharepoint.com/:f:/g/personal/armenc_usi_ch/Ejp7sb8QAMROoWe0XUDcAkMBoqUFk-w2Vgroup025NhAww?e=TFG5CD). You should clean the data if needed. \n",
"\n",
"Note that you can use Spark [window functions](https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-window.html) whenever applicable. \n",
"\n",
"Please name your file as `SurnameName_Assignment3.ipynb`."
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "9f434eb8",
"metadata": {},
"outputs": [],
"source": [
"# Import the basic spark library\n",
"from pyspark.sql import SparkSession\n",
2023-05-31 15:49:12 +00:00
"from pyspark.sql.functions import col\n",
"from math import pi\n",
"from bokeh.models import BasicTicker, PrintfTickFormatter\n",
"from bokeh.plotting import figure, show\n",
"from bokeh.transform import linear_cmap\n",
"from pyspark.sql import types as T\n",
"from pyspark.sql import functions as F\n",
"from pyspark.sql import Window\n",
"from shapely.geometry import Polygon, Point\n",
"from typing import Tuple, List\n",
"import pandas as pd\n",
"import seaborn as sns\n",
"import matplotlib.pyplot as plt\n",
"import matplotlib as mpl"
2023-05-30 15:52:00 +00:00
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "b9a87a5c",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"Setting default log level to \"WARN\".\n",
"To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n",
2023-05-31 15:49:12 +00:00
"23/05/31 16:53:08 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n"
2023-05-30 15:52:00 +00:00
]
}
],
"source": [
"# Create an entry point to the PySpark Application\n",
"spark = SparkSession.builder \\\n",
" .config(\"spark.driver.bindAddress\", \"127.0.0.1\") \\\n",
" .config(\"spark.driver.memory\", \"16g\") \\\n",
" .config(\"spark.executor.memory\", \"16g\") \\\n",
" .config(\"spark.executor.cores\", \"4\") \\\n",
" .config(\"spark.executor.memory\", \"16g\") \\\n",
" .master(\"local\") \\\n",
" .appName(\"MaggioniClaudio_Assignment3\") \\\n",
" .getOrCreate()"
]
},
{
"cell_type": "markdown",
"id": "536a6cc4",
"metadata": {},
"source": [
"### Exercise 1\n",
"Join the `trip_data` and `trip_fare` dataframes into one and consider only data on 2013-01-01."
]
},
{
"cell_type": "code",
2023-05-31 15:49:12 +00:00
"execution_count": 3,
2023-05-30 15:52:00 +00:00
"id": "9fc094c8",
"metadata": {},
"outputs": [],
"source": [
"def sanitize_column_names(df):\n",
" for original, renamed in [(x, x.strip().replace(\" \", \"_\"),) for x in df.columns]:\n",
" df = df.withColumnRenamed(original, renamed)\n",
" return df"
]
},
{
"cell_type": "code",
2023-05-31 15:49:12 +00:00
"execution_count": 4,
2023-05-30 15:52:00 +00:00
"id": "afe8000d",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
" \r"
]
}
],
"source": [
"df_trip_data = spark.read \\\n",
" .option(\"header\", True) \\\n",
" .csv(\"data/trip_data.csv\", inferSchema=True)\n",
"\n",
"df_trip_data = sanitize_column_names(df_trip_data)"
]
},
{
"cell_type": "code",
2023-05-31 15:49:12 +00:00
"execution_count": 5,
2023-05-30 15:52:00 +00:00
"id": "4dfe92f6",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
" \r"
]
}
],
"source": [
"df_trip_fare = spark.read \\\n",
" .option(\"header\", True) \\\n",
" .csv(\"data/trip_fare.csv\", inferSchema=True)\n",
"\n",
"df_trip_fare = sanitize_column_names(df_trip_fare)"
]
},
{
"cell_type": "code",
2023-05-31 15:49:12 +00:00
"execution_count": 6,
2023-05-30 15:52:00 +00:00
"id": "d76abc83",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------------------+--------------------+---------+---------+------------------+-------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+\n",
"| medallion| hack_license|vendor_id|rate_code|store_and_fwd_flag| pickup_datetime| dropoff_datetime|passenger_count|trip_time_in_secs|trip_distance|pickup_longitude|pickup_latitude|dropoff_longitude|dropoff_latitude|\n",
"+--------------------+--------------------+---------+---------+------------------+-------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+\n",
"|89D227B655E5C82AE...|BA96DE419E711691B...| CMT| 1| N|2013-01-01 15:11:48|2013-01-01 15:18:10| 4| 382| 1.0| -73.978165| 40.757977| -73.989838| 40.751171|\n",
"|0BD7C8F5BA12B88E0...|9FD8F69F0804BDB55...| CMT| 1| N|2013-01-06 00:18:35|2013-01-06 00:22:54| 1| 259| 1.5| -74.006683| 40.731781| -73.994499| 40.75066|\n",
"|0BD7C8F5BA12B88E0...|9FD8F69F0804BDB55...| CMT| 1| N|2013-01-05 18:49:41|2013-01-05 18:54:23| 1| 282| 1.1| -74.004707| 40.73777| -74.009834| 40.726002|\n",
"|DFD2202EE08F7A8DC...|51EE87E3205C985EF...| CMT| 1| N|2013-01-07 23:54:15|2013-01-07 23:58:20| 2| 244| 0.7| -73.974602| 40.759945| -73.984734| 40.759388|\n",
"|DFD2202EE08F7A8DC...|51EE87E3205C985EF...| CMT| 1| N|2013-01-07 23:25:03|2013-01-07 23:34:24| 1| 560| 2.1| -73.97625| 40.748528| -74.002586| 40.747868|\n",
"|20D9ECB2CA0767CF7...|598CCE5B9C1918568...| CMT| 1| N|2013-01-07 15:27:48|2013-01-07 15:38:37| 1| 648| 1.7| -73.966743| 40.764252| -73.983322| 40.743763|\n",
"|496644932DF393260...|513189AD756FF14FE...| CMT| 1| N|2013-01-08 11:01:15|2013-01-08 11:08:14| 1| 418| 0.8| -73.995804| 40.743977| -74.007416| 40.744343|\n",
"|0B57B9633A2FECD3D...|CCD4367B417ED6634...| CMT| 1| N|2013-01-07 12:39:18|2013-01-07 13:10:56| 3| 1898| 10.7| -73.989937| 40.756775| -73.86525| 40.77063|\n",
"|2C0E91FF20A856C89...|1DA2F6543A62B8ED9...| CMT| 1| N|2013-01-07 18:15:47|2013-01-07 18:20:47| 1| 299| 0.8| -73.980072| 40.743137| -73.982712| 40.735336|\n",
"|2D4B95E2FA7B2E851...|CD2F522EEE1FF5F5A...| CMT| 1| N|2013-01-07 15:33:28|2013-01-07 15:49:26| 2| 957| 2.5| -73.977936| 40.786983| -73.952919| 40.80637|\n",
"|E12F6AF991172EAC3...|06918214E951FA000...| CMT| 1| N|2013-01-08 13:11:52|2013-01-08 13:19:50| 1| 477| 1.3| -73.982452| 40.773167| -73.964134| 40.773815|\n",
"|E12F6AF991172EAC3...|06918214E951FA000...| CMT| 1| N|2013-01-08 09:50:05|2013-01-08 10:02:54| 1| 768| 0.7| -73.99556| 40.749294| -73.988686| 40.759052|\n",
"|78FFD9CD0CDA541F3...|E949C583ECF62C8F0...| CMT| 1| N|2013-01-10 12:07:08|2013-01-10 12:17:29| 1| 620| 2.3| -73.971497| 40.791321| -73.964478| 40.775921|\n",
"|237F49C3ECC11F502...|93C363DDF8ED9385D...| CMT| 1| N|2013-01-07 07:35:47|2013-01-07 07:46:00| 1| 612| 2.3| -73.98851| 40.774307| -73.981094| 40.755325|\n",
"|3349F919AA8AE5DC9...|7CE849FEF67514F08...| CMT| 1| N|2013-01-10 15:42:29|2013-01-10 16:04:02| 1| 1293| 3.2| -73.994911| 40.723221| -73.971558| 40.761612|\n",
"|3349F919AA8AE5DC9...|7CE849FEF67514F08...| CMT| 1| N|2013-01-10 14:27:28|2013-01-10 14:45:21| 1| 1073| 4.4| -74.010391| 40.708702| -73.987846| 40.756104|\n",
"|4C005EEBAA7BF26B8...|351BE7D984BE17DB2...| CMT| 1| N|2013-01-07 22:09:59|2013-01-07 22:19:50| 1| 591| 1.7| -73.973732| 40.756287| -73.998413| 40.756832|\n",
"|7D99C30FCE69B1A9D...|460C3F57DD9CB2265...| CMT| 1| N|2013-01-07 17:18:16|2013-01-07 17:20:55| 1| 158| 0.7| -73.968925| 40.767704| -73.96199| 40.776566|\n",
"|E6FBF80668FE0611A...|36773E80775F26CD1...| CMT| 1| N|2013-01-07 06:08:51|2013-01-07 06:13:14| 1| 262| 1.7| -73.96212| 40.769737| -73.979561| 40.75539|\n",
"|0C5296F3C8B16E702...|D2363240A9295EF57...| CMT| 1| N|2013-01-07 22:25:46|2013-01-07 22:36:56| 1| 669| 2.3| -73.989708| 40.756714| -73.977615| 40.787575|\n",
"+--------------------+--------------------+---------+---------+------------------+-------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+\n",
"only showing top 20 rows\n",
"\n"
]
}
],
"source": [
"df_trip_data.show()"
]
},
{
"cell_type": "code",
2023-05-31 15:49:12 +00:00
"execution_count": 7,
2023-05-30 15:52:00 +00:00
"id": "3c7ccbd4",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------------------+--------------------+---------+-------------------+------------+-----------+---------+-------+----------+------------+------------+\n",
"| medallion| hack_license|vendor_id| pickup_datetime|payment_type|fare_amount|surcharge|mta_tax|tip_amount|tolls_amount|total_amount|\n",
"+--------------------+--------------------+---------+-------------------+------------+-----------+---------+-------+----------+------------+------------+\n",
"|89D227B655E5C82AE...|BA96DE419E711691B...| CMT|2013-01-01 15:11:48| CSH| 6.5| 0.0| 0.5| 0.0| 0.0| 7.0|\n",
"|0BD7C8F5BA12B88E0...|9FD8F69F0804BDB55...| CMT|2013-01-06 00:18:35| CSH| 6.0| 0.5| 0.5| 0.0| 0.0| 7.0|\n",
"|0BD7C8F5BA12B88E0...|9FD8F69F0804BDB55...| CMT|2013-01-05 18:49:41| CSH| 5.5| 1.0| 0.5| 0.0| 0.0| 7.0|\n",
"|DFD2202EE08F7A8DC...|51EE87E3205C985EF...| CMT|2013-01-07 23:54:15| CSH| 5.0| 0.5| 0.5| 0.0| 0.0| 6.0|\n",
"|DFD2202EE08F7A8DC...|51EE87E3205C985EF...| CMT|2013-01-07 23:25:03| CSH| 9.5| 0.5| 0.5| 0.0| 0.0| 10.5|\n",
"|20D9ECB2CA0767CF7...|598CCE5B9C1918568...| CMT|2013-01-07 15:27:48| CSH| 9.5| 0.0| 0.5| 0.0| 0.0| 10.0|\n",
"|496644932DF393260...|513189AD756FF14FE...| CMT|2013-01-08 11:01:15| CSH| 6.0| 0.0| 0.5| 0.0| 0.0| 6.5|\n",
"|0B57B9633A2FECD3D...|CCD4367B417ED6634...| CMT|2013-01-07 12:39:18| CSH| 34.0| 0.0| 0.5| 0.0| 4.8| 39.3|\n",
"|2C0E91FF20A856C89...|1DA2F6543A62B8ED9...| CMT|2013-01-07 18:15:47| CSH| 5.5| 1.0| 0.5| 0.0| 0.0| 7.0|\n",
"|2D4B95E2FA7B2E851...|CD2F522EEE1FF5F5A...| CMT|2013-01-07 15:33:28| CSH| 13.0| 0.0| 0.5| 0.0| 0.0| 13.5|\n",
"|E12F6AF991172EAC3...|06918214E951FA000...| CMT|2013-01-08 13:11:52| CSH| 7.5| 0.0| 0.5| 0.0| 0.0| 8.0|\n",
"|E12F6AF991172EAC3...|06918214E951FA000...| CMT|2013-01-08 09:50:05| CSH| 9.0| 0.0| 0.5| 0.0| 0.0| 9.5|\n",
"|78FFD9CD0CDA541F3...|E949C583ECF62C8F0...| CMT|2013-01-10 12:07:08| CSH| 9.5| 0.0| 0.5| 0.0| 0.0| 10.0|\n",
"|237F49C3ECC11F502...|93C363DDF8ED9385D...| CMT|2013-01-07 07:35:47| CSH| 10.0| 0.0| 0.5| 0.0| 0.0| 10.5|\n",
"|3349F919AA8AE5DC9...|7CE849FEF67514F08...| CMT|2013-01-10 15:42:29| CSH| 15.5| 0.0| 0.5| 0.0| 0.0| 16.0|\n",
"|3349F919AA8AE5DC9...|7CE849FEF67514F08...| CMT|2013-01-10 14:27:28| CSH| 16.5| 0.0| 0.5| 0.0| 0.0| 17.0|\n",
"|4C005EEBAA7BF26B8...|351BE7D984BE17DB2...| CMT|2013-01-07 22:09:59| CSH| 9.0| 0.5| 0.5| 0.0| 0.0| 10.0|\n",
"|7D99C30FCE69B1A9D...|460C3F57DD9CB2265...| CMT|2013-01-07 17:18:16| CSH| 4.5| 1.0| 0.5| 0.0| 0.0| 6.0|\n",
"|E6FBF80668FE0611A...|36773E80775F26CD1...| CMT|2013-01-07 06:08:51| CSH| 7.0| 0.0| 0.5| 0.0| 0.0| 7.5|\n",
"|0C5296F3C8B16E702...|D2363240A9295EF57...| CMT|2013-01-07 22:25:46| CSH| 10.5| 0.5| 0.5| 0.0| 0.0| 11.5|\n",
"+--------------------+--------------------+---------+-------------------+------------+-----------+---------+-------+----------+------------+------------+\n",
"only showing top 20 rows\n",
"\n"
]
}
],
"source": [
"df_trip_fare.show()"
]
},
{
"cell_type": "code",
2023-05-31 15:49:12 +00:00
"execution_count": 8,
2023-05-30 15:52:00 +00:00
"id": "61e21d2a",
"metadata": {},
"outputs": [],
"source": [
"df_left = df_trip_data.filter(col('pickup_datetime').startswith(\"2013-01-01 \"))\n",
"df_right = df_trip_fare.filter(col('pickup_datetime').startswith(\"2013-01-01 \"))\n",
"\n",
"df_joined = df_left.join(df_right, ['medallion', 'pickup_datetime']).cache()"
]
},
{
"cell_type": "code",
2023-05-31 15:49:12 +00:00
"execution_count": 9,
2023-05-30 15:52:00 +00:00
"id": "d73ab313",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Stage 7:====================================================> (12 + 1) / 13]\r"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------------------+-------------------+--------------------+---------+---------+------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+--------------------+---------+------------+-----------+---------+-------+----------+------------+------------+\n",
"| medallion| pickup_datetime| hack_license|vendor_id|rate_code|store_and_fwd_flag| dropoff_datetime|passenger_count|trip_time_in_secs|trip_distance|pickup_longitude|pickup_latitude|dropoff_longitude|dropoff_latitude| hack_license|vendor_id|payment_type|fare_amount|surcharge|mta_tax|tip_amount|tolls_amount|total_amount|\n",
"+--------------------+-------------------+--------------------+---------+---------+------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+--------------------+---------+------------+-----------+---------+-------+----------+------------+------------+\n",
"|000318C2E3E638158...|2013-01-01 20:46:00|91CE3B3A2F548CD8A...| VTS| 1| null|2013-01-01 20:56:00| 5| 600| 1.35| -73.989677| 40.756554| -73.970673| 40.752541|91CE3B3A2F548CD8A...| VTS| CRD| 8.5| 0.5| 0.5| 1.8| 0.0| 11.3|\n",
"|00790C7BAD30B7A9E...|2013-01-01 04:26:00|3EF1ED607505C991D...| VTS| 1| null|2013-01-01 04:59:00| 1| 1980| 10.99| -73.996811| 40.716587| -73.949448| 40.827671|3EF1ED607505C991D...| VTS| CRD| 36.5| 0.5| 0.5| 9.25| 0.0| 46.75|\n",
"|00A1EA0E8CD47CE24...|2013-01-01 06:09:50|4FD770C068437BBA9...| CMT| 1| N|2013-01-01 06:29:03| 1| 1153| 5.8| -73.89653| 40.759472| -73.952698| 40.780788|4FD770C068437BBA9...| CMT| CRD| 20.5| 0.0| 0.5| 4.0| 0.0| 25.0|\n",
"+--------------------+-------------------+--------------------+---------+---------+------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+--------------------+---------+------------+-----------+---------+-------+----------+------------+------------+\n",
"only showing top 3 rows\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" \r"
]
}
],
"source": [
"df_joined.show(3)"
]
},
{
"cell_type": "markdown",
"id": "5f246287",
"metadata": {},
"source": [
"### Exercise 2\n",
"Consider only Manhattan, Bronx and Brooklyn districts. Then create a dataframe that shows the total number of trips *within* the same district and *across* all the other districts mentioned before.\n",
"\n",
"For example, for Manhattan borough you should consider the total number of the following trips:\n",
"- Manhattan → Manhattan\n",
"- Manhattan → Brooklyn\n",
"- Manhattan → Bronx\n",
"\n",
"You should then do the same for Bronx and Brooklyn boroughs."
]
},
{
"cell_type": "code",
2023-05-31 15:49:12 +00:00
"execution_count": 10,
2023-05-30 15:52:00 +00:00
"id": "97e35f13",
"metadata": {},
"outputs": [],
"source": [
"df_boroughs = spark.read \\\n",
" .option(\"multiline\", \"true\") \\\n",
" .json(r'data/nyc-boroughs.geojson')\n",
"\n",
"df_boroughs = df_boroughs.select(F.explode(df_boroughs.features).alias(\"feature\"))\n",
"\n",
"boroughs_list = df_boroughs.select( \\\n",
" df_boroughs.feature.properties.borough.alias(\"borough\"), \\\n",
" df_boroughs.feature.geometry.coordinates.alias(\"coordinates\")).collect()\n",
"\n",
"boroughs_list: list[tuple[str, list[Polygon]]] = \\\n",
" [(r.borough, [Polygon(shell=p) for p in r.coordinates]) for r in boroughs_list]\n",
"\n",
"@F.udf(returnType=T.StringType())\n",
"def get_borough(lon: float, lat: float) -> bool:\n",
" global boroughs_list\n",
"\n",
" if lon is None or lat is None:\n",
" return None\n",
"\n",
" point = Point(lon, lat)\n",
" \n",
" for b in boroughs_list:\n",
" for p in b[1]:\n",
" if p.contains(point):\n",
" return b[0]\n",
" return None"
]
},
{
"cell_type": "code",
2023-05-31 15:49:12 +00:00
"execution_count": 11,
2023-05-30 15:52:00 +00:00
"id": "b12aa2ec",
"metadata": {},
"outputs": [],
"source": [
"# use UDF as join condition\n",
"df_with_bor = df_joined \\\n",
" .withColumn(\"pickup_borough\", get_borough(\"pickup_longitude\", \"pickup_latitude\")) \\\n",
" .withColumn(\"dropoff_borough\", get_borough(\"dropoff_longitude\", \"dropoff_latitude\")) \\\n",
" .cache()"
]
},
{
"cell_type": "code",
2023-05-31 15:49:12 +00:00
"execution_count": 12,
2023-05-30 15:52:00 +00:00
"id": "9c14ad76-388a-454a-96c0-bf38765ce0dd",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
2023-05-31 15:49:12 +00:00
"[Stage 13:=====================================================>(199 + 1) / 200]\r"
2023-05-30 15:52:00 +00:00
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------------+---------------+------+\n",
"|pickup_borough|dropoff_borough|count |\n",
"+--------------+---------------+------+\n",
"|Bronx |Bronx |487 |\n",
"|Bronx |Brooklyn |6 |\n",
"|Bronx |Manhattan |284 |\n",
"|Brooklyn |Bronx |57 |\n",
"|Brooklyn |Brooklyn |10454 |\n",
"|Brooklyn |Manhattan |6408 |\n",
"|Manhattan |Bronx |2779 |\n",
"|Manhattan |Brooklyn |14396 |\n",
"|Manhattan |Manhattan |319706|\n",
"+--------------+---------------+------+\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" \r"
]
}
],
"source": [
"def isin(var, values):\n",
" cond = (var == values[0])\n",
" for i in range(0, len(values)):\n",
" cond = cond | (var == values[i])\n",
" return cond\n",
"\n",
"boroughs = [\"Manhattan\", \"Bronx\", \"Brooklyn\"]\n",
"df_ex2 = df_with_bor \\\n",
" .where((isin(df_with_bor.pickup_borough, boroughs)) & (isin(df_with_bor.dropoff_borough, boroughs))) \\\n",
" .groupBy(\"pickup_borough\", \"dropoff_borough\") \\\n",
" .count() \\\n",
" .orderBy(\"pickup_borough\", \"dropoff_borough\")\n",
"df_ex2.show(truncate=False)"
]
},
{
"cell_type": "markdown",
"id": "21bd4ac8",
"metadata": {},
"source": [
"### Exercise 3\n",
"Imagine you are a taxi driver and one day you can work only two hours. Assume the data is representative of a typical working day. Which hours of the day - retrieved from `pickup_datetime` - would you choose to work based on the fare and tip amount?"
]
},
{
"cell_type": "code",
2023-05-31 15:49:12 +00:00
"execution_count": 13,
2023-05-30 15:52:00 +00:00
"id": "46d191e1-fd13-4de3-8851-5e10a7319286",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
2023-05-31 15:49:12 +00:00
"[Stage 20:=================================================> (184 + 1) / 200]\r"
2023-05-30 15:52:00 +00:00
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-----------+------------------+\n",
"|pickup_hour|fare_and_tip_total|\n",
"+-----------+------------------+\n",
2023-05-31 15:49:12 +00:00
"| 1| 453700.23|\n",
"| 2| 418415.82|\n",
"| 0| 390741.27|\n",
"| 3| 367018.78|\n",
"| 14| 286852.68|\n",
"| 15| 278953.43|\n",
"| 4| 272856.05|\n",
"| 18| 269648.14|\n",
"| 13| 263915.72|\n",
"| 17| 258134.56|\n",
"| 16| 246552.73|\n",
"| 12| 238716.32|\n",
"| 19| 234377.86|\n",
"| 20| 211402.98|\n",
"| 21| 208110.83|\n",
"| 22| 204481.56|\n",
"| 11| 194952.87|\n",
"| 5| 180075.5|\n",
"| 23| 158957.41|\n",
"| 10| 146400.51|\n",
"| 6| 135810.97|\n",
"| 7| 118466.26|\n",
"| 9| 111925.58|\n",
"| 8| 99021.68|\n",
2023-05-30 15:52:00 +00:00
"+-----------+------------------+\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" \r"
]
}
],
"source": [
"df_ex3 = df_joined.select( \\\n",
" F.hour(F.from_utc_timestamp(df_joined.pickup_datetime, 'UTC')).alias('pickup_hour'), \\\n",
" F.col(\"fare_amount\"), \\\n",
" F.col(\"tip_amount\")) \\\n",
" .groupby(\"pickup_hour\") \\\n",
" .agg(F.round(F.sum(F.col(\"fare_amount\") + F.col(\"tip_amount\")), 2).alias('fare_and_tip_total')) \\\n",
" .select(\"pickup_hour\", \"fare_and_tip_total\") \\\n",
" .sort(F.desc(\"fare_and_tip_total\"))\n",
"\n",
2023-05-31 15:49:12 +00:00
"df_ex3.show(24)"
2023-05-30 15:52:00 +00:00
]
},
{
"cell_type": "markdown",
"id": "ffbbaf04-65b5-4fc2-879f-a2a8bcc87519",
"metadata": {},
"source": [
"Given the table above I would choose to work at **1 AM** and **2 AM** as they are the most profitable hours based on total fare and tip amount. This may be the case for the chosen date `2013-01-01` because of the new year celebrations."
]
},
{
"cell_type": "markdown",
"id": "b24e0922",
"metadata": {},
"source": [
"### Exercise 4\n",
"Provide a graphical representation to compare the average fare amount for trips _within_ and _across_ all the districts. You may want to have a look at: https://docs.bokeh.org/en/latest/docs/user_guide/topics/categorical.html#heatmaps."
]
},
{
"cell_type": "code",
2023-05-31 15:49:12 +00:00
"execution_count": 14,
2023-05-30 15:52:00 +00:00
"id": "0643d9e4",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
" \r"
]
}
],
"source": [
"ex4_data = df_with_bor \\\n",
" .withColumn(\"pickup_borough\", F.coalesce(F.col(\"pickup_borough\"), F.lit(\"Unknown\"))) \\\n",
" .withColumn(\"dropoff_borough\", F.coalesce(F.col(\"dropoff_borough\"), F.lit(\"Unknown\"))) \\\n",
" .groupBy(\"pickup_borough\", \"dropoff_borough\") \\\n",
" .agg(F.mean(F.col('fare_amount')).alias('mean_fare_amount')) \\\n",
" .collect()\n",
"\n",
"df_ex4 = pd.DataFrame()\n",
"for i, row in enumerate(ex4_data):\n",
" df_ex4.loc[i, 'pickup_borough'] = row.pickup_borough\n",
" df_ex4.loc[i, 'dropoff_borough'] = row.dropoff_borough\n",
" df_ex4.loc[i, 'mean_fare'] = row.mean_fare_amount"
]
},
{
"cell_type": "code",
2023-05-31 15:49:12 +00:00
"execution_count": 15,
2023-05-30 15:52:00 +00:00
"id": "2cba45e6-7ad1-4044-b9f0-81943c1cf547",
"metadata": {},
"outputs": [],
"source": [
"pickup = list(sorted(df_ex4['pickup_borough'].unique()))\n",
"dropoff = list(reversed(sorted(df_ex4['dropoff_borough'].unique())))\n",
"\n",
"colors = [\"#75968f\", \"#a5bab7\", \"#c9d9d3\", \"#e2e2e2\", \"#dfccce\", \"#ddb7b1\", \"#cc7878\", \"#933b41\", \"#550b1d\"]\n",
"\n",
"p = figure(title=f\"Mean NYC Taxi fares on 2013-01-01\",\n",
" x_range=pickup, y_range=dropoff,\n",
" x_axis_location=\"above\", width=900, height=900,\n",
" tools=\"hover,save,pan,box_zoom,reset,wheel_zoom\", toolbar_location='below',\n",
" tooltips=[ \\\n",
" ('Pickup Borough', '@pickup_borough'), \\\n",
" ('Dropoff Borough', '@dropoff_borough'), \\\n",
" ('Average Fare Amount', '$@mean_fare')])\n",
"\n",
"p.grid.grid_line_color = None\n",
"p.axis.axis_line_color = None\n",
"p.axis.major_tick_line_color = None\n",
"p.axis.major_label_text_font_size = \"14px\"\n",
"p.axis.major_label_standoff = 0\n",
"p.xaxis.major_label_orientation = pi / 3\n",
"\n",
"r = p.rect(x=\"pickup_borough\", y=\"dropoff_borough\", width=1, height=1, source=df_ex4,\n",
" fill_color=linear_cmap(\"mean_fare\", colors, low=df_ex4.mean_fare.min(), high=df_ex4.mean_fare.max()),\n",
" line_color=None)\n",
"\n",
"p.add_layout(r.construct_color_bar(\n",
" major_label_text_font_size=\"14px\",\n",
" ticker=BasicTicker(desired_num_ticks=len(colors)),\n",
" formatter=PrintfTickFormatter(format=\"$%d\"),\n",
" label_standoff=6,\n",
" border_line_color=None,\n",
" padding=5\n",
"), 'right')\n",
"\n",
"show(p)"
]
},
{
"cell_type": "markdown",
"id": "9b4a8445",
"metadata": {},
"source": [
"### Exercise 5\n",
"Find the average amount of tolls per hour for trips within the following districts: Manhattan, Bronx, Brooklyn, Queens. Show a graphical representation of the data and report if there is any trend or peak during the day. Overall which district has the largest amount of tolls?"
]
},
{
"cell_type": "code",
2023-05-31 15:49:12 +00:00
"execution_count": 16,
2023-05-30 15:52:00 +00:00
"id": "b80cbb2d",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
2023-05-31 15:49:12 +00:00
" \r"
2023-05-30 15:52:00 +00:00
]
},
{
2023-05-31 15:49:12 +00:00
"data": {
"text/plain": [
"<Axes: xlabel='Hour of day of 2013-01-01', ylabel='Mean toll amount'>"
]
},
"execution_count": 16,
"metadata": {},
"output_type": "execute_result"
2023-05-30 15:52:00 +00:00
},
{
2023-05-31 15:49:12 +00:00
"data": {
"image/png": "iVBORw0KGgoAAAANSUhEUgAABboAAAKnCAYAAABAjvvfAAAAOXRFWHRTb2Z0d2FyZQBNYXRwbG90bGliIHZlcnNpb24zLjcuMSwgaHR0cHM6Ly9tYXRwbG90bGliLm9yZy/bCgiHAAAACXBIWXMAAA9hAAAPYQGoP6dpAAEAAElEQVR4nOzdeXjU9b3+//szM0kme0I2khASdsIqaqW4EAE9aJVW7aketYJY7bF1wVrb4tdTrbbVY4/aDf35VauotWrbY/u11WoVBJRFQUQFw54EyB6yJ2Sd+f0x+QykbJkwM5+ZzPNxXXNdZNYXEQm585r7bbjdbrcAAAAAAAAAAAhTNqsHAAAAAAAAAADgVBB0AwAAAAAAAADCGkE3AAAAAAAAACCsEXQDAAAAAAAAAMIaQTcAAAAAAAAAIKwRdAMAAAAAAAAAwhpBNwAAAAAAAAAgrBF0AwAAAAAAAADCmsPqAYLN5XKpoqJCiYmJMgzD6nEAAAAAAAAAAMfgdrvV0tKinJwc2Wwn3tmOuKC7oqJCeXl5Vo8BAAAAAAAAABiA/fv3a8SIESe8T8QF3YmJiZI8n5ykpCSLpwEAAAAAAAAAHEtzc7Py8vK8me6JRFzQbdaVJCUlEXQDAAAAAAAAQIgbSAU1h1ECAAAAAAAAAMIaQTcAAAAAAAAAIKwRdAMAAAAAAAAAwlrEdXQDAAAAAAAACG1ut1s9PT3q7e21ehQEWFRUlOx2+yk/D0E3AAAAAAAAgJDR1dWlyspKtbe3Wz0KgsAwDI0YMUIJCQmn9DwE3QAAAAAAAABCgsvlUklJiex2u3JychQdHS3DMKweCwHidrtVW1urAwcOaNy4cae02U3QDQAAAAAAACAkdHV1yeVyKS8vT3FxcVaPgyDIyMhQaWmpuru7Tyno5jBKAAAAAAAAACHFZiO2jBT+2tjnTwwAAAAAAAAAIKwRdAMAAAAAAADAMZx//vm64447rB5j0MJ9fl8QdAMAAAAAAAAAwhpBNwAAAAAAAAAEQW9vr1wul9VjDEkE3QAAAAAAAABwHD09Pbr11luVnJys9PR0/fjHP5bb7ZYkNTQ0aOHChUpNTVVcXJwuvvhi7dq1y/vY5cuXKyUlRa+//romTZqkmJgY7du376SP+8lPfqLTTjut3xy/+tWvVFBQ0G+u22+/XSkpKUpLS9OPfvQjLVq0SJdddlm/x7lcLv3whz/UsGHDNHz4cP3kJz/x96coJBB0AwAAAAAAAMBxPP/883I4HProo4/061//Wo899pieeeYZSdL111+vTZs26fXXX9f69evldrv1la98Rd3d3d7Ht7e36+GHH9Yzzzyjbdu2KTMzc0CPO5mHH35YL730kp577jmtXbtWzc3N+utf/3rM+ePj4/Xhhx/qF7/4hR544AG98847p/x5CTUOqwcAAAAAAAAAgFCVl5enX/7ylzIMQxMmTNDnn3+uX/7ylzr//PP1+uuva+3atTr77LMlSS+99JLy8vL017/+Vd/4xjckSd3d3XriiSc0ffp0SdKuXbsG9LiT+e1vf6u7775bl19+uSRp2bJlevPNN4+637Rp03TfffdJksaNG6dly5ZpxYoVuvDCC0/tExNi2OgGAAAAAAAAgOP48pe/LMMwvB/PmjVLu3bt0hdffCGHw6GZM2d6b0tLS9OECRNUXFzsvS46OlrTpk3zflxcXDygx51IU1OTqqurddZZZ3mvs9vtOuOMM46675GvLUnZ2dmqqakZ0OuEE4JuAAAAAAAAAAiQ2NjYfkH5QNhsNm8PuMmXWpMjRUVF9fvYMIwheSAmQTcAAAAAAAAAHMeHH37Y7+MNGzZo3LhxmjRpknp6evrdfvDgQe3YsUOTJk067vMVFhae9HEZGRmqqqrqF3Zv2bLF++vk5GRlZWVp48aN3ut6e3u1efPmQf8+wx1BNwAAAAAAAAAcx759+3TnnXdqx44devnll/Xb3/5WS5Ys0bhx4/S1r31NN910kz744AN9+umn+uY3v6nc3Fx97WtfO+7zDeRx559/vmpra/WLX/xCe/bs0eOPP65//OMf/Z7ntttu00MPPaT/9//+n3bs2KElS5aooaHB5+3xoYKgGwAAAAAAAACOY+HChTp06JDOOuss3XLLLVqyZIm+/e1vS5Kee+45nXHGGbr00ks1a9Ysud1uvfnmm0fVhfyrkz2usLBQTzzxhB5//HFNnz5dH330ke66665+z/GjH/1IV199tRYuXKhZs2YpISFB8+fPl9PpDMwnIsQZ7n8texnimpublZycrKamJiUlJVk9DgAAAAAAAIA+HR0dKikp0ahRoyI2sB0sl8ulwsJCXXnllfrpT39q9TgDdqL/5r5kuY5ADgkAAAAAAAAA8L+ysjL985//VFFRkTo7O7Vs2TKVlJTommuusXo0S1BdAgAAAAAAAABhxmazafny5frSl76kc845R59//rneffddFRYWWj2aJdjoBgAAAAAAAIAwk5eXp7Vr11o9RshgoxsAAAAAAAAAENYIugEAAAAAAAAAYY2gGwAAAAAAAAAQ1gi6AQAAAAAAAABhjaAbAAAAAAAAABDWCLoBAAAAAAAAAGGNoBsAAAAAAAAAENYIugEAAAAAAADgFF1//fUyDMN7SUtL00UXXaTPPvvM6tEiAkE3AAAAAAAAAPjBRRddpMrKSlVWVmrFihVyOBy69NJLj3v/7u7uIE43tBF0AwAAAAAAAIAfxMTEaPjw4Ro+fLhOO+00LV26VPv371dtba1KS0tlGIZeffVVFRUVyel06qWXXpLL5dIDDzygESNGKCYmRqeddpreeust73Oaj3vttdc0Z84cxcXFafr06Vq/fr33PjfccIOmTZumzs5OSVJXV5dmzJihhQsXBv1zYBVLg+41a9ZowYIFysnJkWEY+utf/zrgx65du1YOh0OnnXZawOYDAAAAAAAAYC232632rh5LLm63e9Bzt7a26ve//73Gjh2rtLQ07/VLly7VkiVLVFxcrPnz5+vXv/61Hn30UT3yyCP67LPPNH/+fH31q1/Vrl27+j3fPffco7vuuktbtmzR+PHjdfXVV6unp0eS9Jvf/EZtbW1aunSp976NjY1atmzZoOcPNw4rX7ytrU3Tp0/XDTfcoCuuuGLAj2tsbNTChQs1b948VVdXB3BCAAAAAAAAAFY61N2rSfe+bclrf/HAfMVFDzxC/fvf/66EhARJnuwzOztbf//732WzHd43vuOOO/ploY888oh+9KMf6T/+4z8kSQ8//LDee+89/epXv9Ljjz/uvd9dd92lSy65RJJ0//33a/Lkydq9e7cmTpyohIQE/f73v1dRUZESExP1q1/9Su+9956SkpJO6fcfTiwNui+++GJdfPHFPj/u5ptv1jXXXCO73e7TFjgAAAAAAAAABMqcOXP0//1//58kqaGhQU888YQuvvhiffTRR977nHnmmd5fNzc3q6KiQuecc06/5znnnHP06aef9rtu2rRp3l9nZ2dLkmpqajRx4kRJ0qxZs3TXXXfppz/9qX70ox/p3HPP9e9vLsRZGnQPxnPPPae9e/fq97//vX72s5+d9P6dnZ3ebhrJ84cHAAAAAAAAQHiIjbLriwfmW/bavoiPj9fYsWO9Hz/zzDNKTk7W008/rRtvvNF7n8GIiory/towDEmSy+XyXudyubR27VrZ7Xbt3r17UK8RzsIq6N61a5eWLl2q999/Xw7HwEZ/6KGHdP/99wd4MgAAAAAAAACBYBiGT/UhocQwDNlsNh06dOiYtyclJSknJ0dr165VUVGR9/q1a9fqrLPO8um1/ud//kfbt2/X6tWrNX/+fD333HNavHjxKc0fTiw9jNIXvb29uuaaa3T//fdr/PjxA37c3XffraamJu9l//79AZwSAAAAAAAAQKTq7OxUVVWVqqqqVFxcrNtuu02tra1asGDBcR/zgx/8QA8//LBeffVV7dixQ0uXLtWWLVu0ZMmSAb/uJ598onvvvVfPPPOMzjnnHD322GNasmSJ9u7d64/fVlgImx+FtLS0aNOmTfrkk0906623SvKs47vdbjkcDv3
"text/plain": [
"<Figure size 1800x800 with 1 Axes>"
]
},
"metadata": {},
"output_type": "display_data"
2023-05-30 15:52:00 +00:00
}
],
"source": [
"boroughs_ex5 = [\"Manhattan\", \"Bronx\", \"Brooklyn\", \"Queens\"]\n",
"\n",
2023-05-31 15:49:12 +00:00
"ex5_data = df_with_bor \\\n",
" .where((isin(df_with_bor.pickup_borough, boroughs_ex5)) & (df_with_bor.pickup_borough == df_with_bor.dropoff_borough)) \\\n",
2023-05-30 15:52:00 +00:00
" .withColumn(\"hour\", F.hour(F.from_utc_timestamp(F.col(\"pickup_datetime\"), 'UTC'))) \\\n",
" .groupBy(\"pickup_borough\", \"dropoff_borough\", \"hour\") \\\n",
" .agg(F.mean(F.col('tolls_amount')).alias('mean_tolls_amount')) \\\n",
2023-05-31 15:49:12 +00:00
" .select(F.col('pickup_borough').alias('borough'), F.col('hour'), F.col('mean_tolls_amount')) \\\n",
" .orderBy(\"borough\", \"hour\") \\\n",
" .collect()\n",
2023-05-30 15:52:00 +00:00
"\n",
2023-05-31 15:49:12 +00:00
"df_ex5 = pd.DataFrame()\n",
"for i, row in enumerate(ex5_data):\n",
" df_ex5.loc[i, 'borough'] = row.borough\n",
" df_ex5.loc[i, 'hour'] = row.hour\n",
" df_ex5.loc[i, 'mean_tolls_amount'] = row.mean_tolls_amount\n",
"\n",
"# Initialize the matplotlib figure\n",
"f, ax = plt.subplots(figsize=(18, 8))\n",
"ax.set(ylabel=\"Mean toll amount\", ylim=[0, 1.5], xticks=range(24), \n",
" xlabel=\"Hour of day of 2013-01-01\")\n",
"sns.lineplot(data=df_ex5, x=\"hour\", y=\"mean_tolls_amount\", hue=\"borough\")"
]
},
{
"cell_type": "markdown",
"id": "a575ddfa-5b39-4871-ad02-e81439eb13e6",
"metadata": {},
"source": [
"For trips within _Bronx_, there are several toll amount peaks, namely in decreasing order of magnitude between 10 AM and 11 AM, at 5 PM, at 8 PM and at 5 AM. Trips within _Queens_ show a steady toll amount increase peaking at 4 PM and then decreasing again. "
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "d0153a6e-3da5-49a2-8772-488c7d364ac2",
"metadata": {},
"outputs": [
{
"data": {
"text/html": [
"<div>\n",
"<style scoped>\n",
" .dataframe tbody tr th:only-of-type {\n",
" vertical-align: middle;\n",
" }\n",
"\n",
" .dataframe tbody tr th {\n",
" vertical-align: top;\n",
" }\n",
"\n",
" .dataframe thead th {\n",
" text-align: right;\n",
" }\n",
"</style>\n",
"<table border=\"1\" class=\"dataframe\">\n",
" <thead>\n",
" <tr style=\"text-align: right;\">\n",
" <th></th>\n",
" <th>mean_tolls_amount</th>\n",
" </tr>\n",
" <tr>\n",
" <th>borough</th>\n",
" <th></th>\n",
" </tr>\n",
" </thead>\n",
" <tbody>\n",
" <tr>\n",
" <th>Bronx</th>\n",
" <td>4.465003</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Queens</th>\n",
" <td>2.672513</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Brooklyn</th>\n",
" <td>0.417684</td>\n",
" </tr>\n",
" <tr>\n",
" <th>Manhattan</th>\n",
" <td>0.145958</td>\n",
" </tr>\n",
" </tbody>\n",
"</table>\n",
"</div>"
],
"text/plain": [
" mean_tolls_amount\n",
"borough \n",
"Bronx 4.465003\n",
"Queens 2.672513\n",
"Brooklyn 0.417684\n",
"Manhattan 0.145958"
]
},
"execution_count": 17,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"df_ex5.groupby(\"borough\").sum().loc[:, [\"mean_tolls_amount\"]].sort_values(\"mean_tolls_amount\", ascending=False)"
]
},
{
"cell_type": "markdown",
"id": "501ef279-4b61-43d5-aa33-5c20d5354bb7",
"metadata": {},
"source": [
"As shown by the table above, _Bronx_ is the borough with the overall highest toll amounts for within-borough trips on 2013-01-01."
2023-05-30 15:52:00 +00:00
]
},
{
"cell_type": "markdown",
"id": "884b4cf9",
"metadata": {},
"source": [
"### Exercise 6\n",
"Create a dataframe that for each district shows the shortest and longest `trip_distance` starting and ending in the same district. What is the length of the longest and shortest trips in Manhattan?"
]
},
{
"cell_type": "code",
2023-05-31 15:49:12 +00:00
"execution_count": 18,
2023-05-30 15:52:00 +00:00
"id": "0aa8d795",
2023-05-31 15:49:12 +00:00
"metadata": {
"scrolled": true
},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Stage 50:===================================================> (189 + 1) / 200]\r"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+-------------+-----------------+-----------------+\n",
"|borough |min_trip_distance|max_trip_distance|\n",
"+-------------+-----------------+-----------------+\n",
"|Bronx |0.0 |20.0 |\n",
"|Brooklyn |0.0 |80.5 |\n",
"|Manhattan |0.0 |100.0 |\n",
"|Queens |0.0 |98.7 |\n",
"|Staten Island|0.0 |5.7 |\n",
"+-------------+-----------------+-----------------+\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" \r"
]
}
],
"source": [
"df_ex6 = df_with_bor \\\n",
" .where((df_with_bor.pickup_borough == df_with_bor.dropoff_borough) & (df_with_bor.pickup_borough.isNotNull())) \\\n",
" .groupBy(\"pickup_borough\") \\\n",
" .agg(F.min('trip_distance').alias('min_trip_distance'), F.max('trip_distance').alias('max_trip_distance')) \\\n",
" .withColumnRenamed(\"pickup_borough\", \"borough\") \\\n",
" .orderBy(\"borough\")\n",
"\n",
"df_ex6.show(truncate=False)"
]
},
{
"cell_type": "markdown",
"id": "7a903390-2ef0-45da-8d76-f992d43a53b1",
2023-05-30 15:52:00 +00:00
"metadata": {},
2023-05-31 15:49:12 +00:00
"source": [
"The shortest trip within _Manhattan_ has distance $= 0$ while the longest one has distance $= 100$."
]
2023-05-30 15:52:00 +00:00
},
{
"cell_type": "markdown",
"id": "756da7e4",
"metadata": {},
"source": [
"### Exercise 7\n",
"Consider only the trips _within_ districts. What are the first and second-most expensive\n",
"trips - based on `total_amount` - in every district?"
]
},
{
"cell_type": "code",
2023-05-31 15:49:12 +00:00
"execution_count": 26,
2023-05-30 15:52:00 +00:00
"id": "ca83556d",
"metadata": {},
2023-05-31 15:49:12 +00:00
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Stage 100:==================================================> (192 + 1) / 200]\r"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------------------+-------------------+--------------------+---------+---------+------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+--------------------+---------+------------+-----------+---------+-------+----------+------------+------------+-------------+---------------+----+\n",
"| medallion| pickup_datetime| hack_license|vendor_id|rate_code|store_and_fwd_flag| dropoff_datetime|passenger_count|trip_time_in_secs|trip_distance|pickup_longitude|pickup_latitude|dropoff_longitude|dropoff_latitude| hack_license|vendor_id|payment_type|fare_amount|surcharge|mta_tax|tip_amount|tolls_amount|total_amount| borough|dropoff_borough|rank|\n",
"+--------------------+-------------------+--------------------+---------+---------+------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+--------------------+---------+------------+-----------+---------+-------+----------+------------+------------+-------------+---------------+----+\n",
"|157E792C9A2041556...|2013-01-01 04:12:46|5AE2BD64DE046BC5C...| CMT| 5| N|2013-01-01 04:13:30| 3| 44| 0.0| -73.884644| 40.856674| -73.884636| 40.856693|5AE2BD64DE046BC5C...| CMT| CRD| 70.0| 0.0| 0.0| 14.0| 0.0| 84.0| Bronx| Bronx| 1|\n",
"|0984728E985ADC092...|2013-01-01 05:02:38|A3E9537FA108A49E4...| CMT| 5| N|2013-01-01 05:03:53| 2| 75| 0.2| -73.896759| 40.886013| -73.89904| 40.887779|A3E9537FA108A49E4...| CMT| CRD| 80.0| 0.0| 0.0| 0.0| 0.0| 80.0| Bronx| Bronx| 2|\n",
"|2D84EC6CD02550324...|2013-01-01 04:25:00|AC22E37790A7E433E...| VTS| 5| null|2013-01-01 04:25:00| 2| 0| 0.12| -73.939285| 40.723331| -73.939285| 40.723343|AC22E37790A7E433E...| VTS| CRD| 136.0| 0.0| 0.0| 34.0| 10.25| 180.25| Brooklyn| Brooklyn| 1|\n",
"|2A7C1AF76D40C1D22...|2013-01-01 03:56:00|7EAD01D87E93BA1E5...| VTS| 5| null|2013-01-01 03:56:00| 1| 0| 0.0| -73.983307| 40.679096| -73.983307| 40.6791|7EAD01D87E93BA1E5...| VTS| CRD| 100.0| 0.0| 0.0| 20.0| 10.25| 130.25| Brooklyn| Brooklyn| 2|\n",
"|152CBE18BB178155B...|2013-01-01 03:59:34|46B7AEDD5C8ECFF1E...| CMT| 5| N|2013-01-01 04:00:41| 3| 66| 0.0| -73.976433| 40.746506| -73.976433| 40.746506|46B7AEDD5C8ECFF1E...| CMT| DIS| 500.0| 0.0| 0.0| 0.0| 0.0| 500.0| Manhattan| Manhattan| 1|\n",
"|152CBE18BB178155B...|2013-01-01 04:03:32|46B7AEDD5C8ECFF1E...| CMT| 5| N|2013-01-01 04:04:51| 1| 79| 0.0| -73.976433| 40.746506| -73.976433| 40.746506|46B7AEDD5C8ECFF1E...| CMT| CSH| 475.0| 0.0| 0.0| 0.0| 0.0| 475.0| Manhattan| Manhattan| 2|\n",
"|FA189EABBB4058AC0...|2013-01-01 11:10:12|4E557EC0844425C75...| CMT| 5| N|2013-01-01 11:13:34| 2| 202| 2.2| -73.875206| 40.773304| -73.912048| 40.769394|4E557EC0844425C75...| CMT| CRD| 123.0| 0.0| 0.0| 15.0| 0.0| 138.0| Queens| Queens| 1|\n",
"|6B22AE697469CEA3D...|2013-01-01 03:53:00|52169A073CB4E5B1D...| VTS| 1| null|2013-01-01 04:35:00| 2| 2520| 17.24| -73.902206| 40.775982| -73.769829| 40.778721|52169A073CB4E5B1D...| VTS| CRD| 52.5| 0.5| 0.5| 62.5| 0.0| 116.0| Queens| Queens| 2|\n",
"|25C8D6B5EFFDE4FA5...|2013-01-01 02:41:48|70CD78D1142589EF0...| CMT| 5| N|2013-01-01 02:42:22| 1| 33| 0.0| -74.092506| 40.594997| -74.092484| 40.595036|70CD78D1142589EF0...| CMT| CRD| 73.0| 0.0| 0.0| 18.25| 0.0| 91.25|Staten Island| Staten Island| 1|\n",
"|B0B78CD05C8A1737E...|2013-01-01 03:42:19|B104BA3D279D230A0...| CMT| 5| N|2013-01-01 03:43:38| 2| 78| 0.0| -74.149422| 40.612503| -74.149399| 40.61248|B104BA3D279D230A0...| CMT| CRD| 89.6| 0.0| 0.0| 0.0| 0.0| 89.6|Staten Island| Staten Island| 2|\n",
"+--------------------+-------------------+--------------------+---------+---------+------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+--------------------+---------+------------+-----------+---------+-------+----------+------------+------------+-------------+---------------+----+\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" \r"
]
}
],
"source": [
"w = Window.partitionBy(\"borough\").orderBy(F.col(\"total_amount\").desc())\n",
"\n",
"df_ex7 = df_with_bor \\\n",
" .where((df_with_bor.pickup_borough == df_with_bor.dropoff_borough) & (df_with_bor.pickup_borough.isNotNull())) \\\n",
" .withColumnRenamed(\"pickup_borough\", \"borough\") \\\n",
" .withColumn(\"rank\", F.rank().over(w)) \\\n",
" .where(F.col(\"rank\") <= 2) \\\n",
"\n",
"df_ex7.show()"
]
},
{
"cell_type": "markdown",
"id": "6f88a475-1ef1-4b5d-829c-fb33e4b71d76",
"metadata": {},
"source": [
"The dataframe above shows the most expensive (with `rank` $=1$) and second most expensive (with `rank` $=2$) within-district trip data for each district."
]
2023-05-30 15:52:00 +00:00
},
{
"cell_type": "markdown",
"id": "4f1e0800",
"metadata": {},
"source": [
"### Exercise 8\n",
"Create a dataframe where each row represents a driver, and there is one column per district.\n",
"For each driver-district, the dataframe provides the maximum number of consecutive trips\n",
"for the given driver, within the given district. \n",
"\n",
"For example, if for driver A we have (sorted by time):\n",
"- Trip 1: Bronx → Bronx\n",
"- Trip 2: Bronx → Bronx\n",
"- Trip 3: Bronx → Manhattan\n",
"- Trip 4: Manhattan → Bronx.\n",
" \n",
"The maximum number of consecutive trips for Bronx is 2."
]
},
{
"cell_type": "code",
2023-05-31 15:49:12 +00:00
"execution_count": 50,
2023-05-30 15:52:00 +00:00
"id": "edde38bb",
"metadata": {},
2023-05-31 15:49:12 +00:00
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"[Stage 265:> (0 + 1) / 1]\r"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"+--------------------+-----+--------+---------+------+-------------+\n",
"| medallion|Bronx|Brooklyn|Manhattan|Queens|Staten Island|\n",
"+--------------------+-----+--------+---------+------+-------------+\n",
"|35E11D9D2AE5C8A80...| 0| 1| 15| 5| 0|\n",
"|DA350783B6954CC67...| 0| 0| 27| 0| 0|\n",
"|35B2F21FAF5E53F1E...| 0| 3| 8| 3| 0|\n",
"|6695FB6E06F7D99F5...| 0| 1| 21| 0| 0|\n",
"|36372627462019376...| 0| 0| 5| 2| 0|\n",
"|EF882BDAF03D41517...| 0| 0| 17| 1| 0|\n",
"|846DFE2D59F6E76EC...| 0| 1| 25| 0| 0|\n",
"|9B69C5971F62F151B...| 0| 0| 13| 1| 0|\n",
"|0F621E366CFE63044...| 0| 1| 11| 1| 0|\n",
"|87EB479F55B88D47C...| 0| 1| 19| 0| 0|\n",
"|4EE5F2532F57F2124...| 0| 0| 14| 1| 0|\n",
"|4F4CA97166A04A455...| 0| 0| 13| 1| 0|\n",
"|DB1964B903773868E...| 0| 0| 6| 0| 0|\n",
"|B01A3E26873C4B514...| 0| 3| 20| 1| 0|\n",
"|F49F752E7E9CAAE41...| 1| 0| 8| 2| 0|\n",
"|D72C164FE66ADFFFE...| 0| 1| 20| 1| 0|\n",
"|E1BD31C1BF8DDCFCB...| 0| 2| 3| 1| 0|\n",
"|80F732B990A7E3763...| 0| 0| 13| 2| 0|\n",
"|F9B3A00E6DDCA4F8B...| 0| 0| 15| 0| 0|\n",
"|1E8EDF1C2EF489B7A...| 0| 0| 2| 1| 0|\n",
"|6AFD7E44A278CFD00...| 0| 0| 4| 1| 0|\n",
"|27E7626D5A223B479...| 0| 0| 21| 0| 0|\n",
"|DDCBE3295F4678F61...| 0| 0| 4| 0| 0|\n",
"|EB6F0753E865DA0AB...| 0| 3| 13| 1| 0|\n",
"|963BEE5F306952D20...| 0| 0| 15| 1| 0|\n",
"|ADFCF211DDD6D7885...| 0| 0| 12| 2| 0|\n",
"|BF46B95E44ED3BE1B...| 0| 0| 17| 0| 0|\n",
"|DCE32B5E6CAD1AFEB...| 0| 2| 12| 2| 0|\n",
"|7D4F34EF0A251F3A6...| 0| 4| 6| 1| 0|\n",
"|764CA5AE502C0FEC9...| 0| 1| 13| 0| 0|\n",
"|4D0A5B1BD7C0B459D...| 0| 1| 12| 1| 0|\n",
"|ED9B774735449ABBE...| 0| 2| 9| 0| 0|\n",
"|198109D0AF980C5BC...| 0| 0| 16| 0| 0|\n",
"|F0BC746C7DD8C0BC9...| 0| 1| 6| 0| 0|\n",
"|223670562219093D6...| 1| 0| 9| 0| 0|\n",
"|59DF6039EC312EE6D...| 0| 2| 7| 0| 0|\n",
"|A02946A94C960AF04...| 0| 0| 7| 0| 0|\n",
"|15162141EA7436635...| 0| 0| 9| 1| 0|\n",
"|5803D6EAD49AEAA82...| 0| 0| 16| 1| 0|\n",
"|618BB39CEEAE5E9A6...| 0| 1| 12| 1| 0|\n",
"|B9E10026AAC457AA6...| 0| 1| 8| 0| 0|\n",
"|E7C49B0A85D992BF1...| 0| 0| 28| 0| 0|\n",
"|4E8142153D6520C41...| 0| 0| 14| 0| 0|\n",
"|72EAFBA3FB9F0507C...| 0| 1| 11| 0| 0|\n",
"|7550D0BD520A691EC...| 0| 0| 7| 0| 0|\n",
"|A5A2F3BDEA888D6A7...| 0| 0| 7| 2| 0|\n",
"|7F82F9083BCBA1011...| 0| 1| 6| 0| 0|\n",
"|586D9BD604B923DA3...| 0| 0| 28| 0| 0|\n",
"|06EAD4C8D98202F1E...| 0| 1| 17| 0| 0|\n",
"|D563F5CC514A87541...| 1| 0| 8| 2| 0|\n",
"|496036713FC662D71...| 0| 0| 13| 0| 0|\n",
"|595917A7813CC80DA...| 0| 1| 22| 0| 0|\n",
"|DAF60A90A00F8FE30...| 0| 0| 20| 0| 0|\n",
"|C251B99766928BB4A...| 0| 0| 7| 2| 0|\n",
"|B59C6B4E3CFAB9EDF...| 0| 0| 33| 0| 0|\n",
"|1109955CCAABCBCE1...| 0| 0| 3| 0| 0|\n",
"|56CF5E3DD6328847A...| 0| 0| 5| 1| 0|\n",
"|5CCB4924B158F945B...| 0| 0| 24| 0| 0|\n",
"|73039762E0F4B253E...| 0| 0| 22| 0| 0|\n",
"|C0D5941A4A93777E9...| 0| 0| 15| 1| 0|\n",
"|57E8E649531AB8807...| 0| 0| 5| 0| 0|\n",
"|911B6F71706854496...| 0| 0| 9| 0| 0|\n",
"|B2B089B939CB4A0A6...| 0| 0| 15| 0| 0|\n",
"|753BC0484097BB236...| 0| 0| 11| 0| 0|\n",
"|47D63452A91E1705F...| 0| 0| 6| 1| 0|\n",
"|BEA5A07E7B365D7F6...| 0| 0| 20| 0| 0|\n",
"|34CE2E3B6B1E89A38...| 0| 1| 9| 0| 0|\n",
"|5B9AB2A961429F558...| 0| 1| 24| 1| 0|\n",
"|72AAE2B8FF50AF611...| 0| 1| 24| 0| 0|\n",
"|4A9DED62DD8EA1E19...| 1| 0| 20| 0| 0|\n",
"|9771700E1AE5E87B2...| 0| 0| 7| 1| 0|\n",
"|C50532B1D6B517BCB...| 0| 2| 8| 1| 0|\n",
"|4F9B5CF4F0FC8835D...| 0| 2| 16| 0| 0|\n",
"|2B8C6434EB5875E58...| 0| 0| 18| 0| 0|\n",
"|BA57B240D0EEE2F43...| 0| 2| 36| 2| 0|\n",
"|4A17962CB3E106E57...| 0| 1| 12| 1| 0|\n",
"|286EFDDA8BBA68C50...| 0| 0| 22| 0| 0|\n",
"|98EDCE7D6FB0741BD...| 0| 7| 5| 0| 0|\n",
"|F0C30DB1889710471...| 0| 3| 7| 0| 0|\n",
"|B6585890F68EE0270...| 0| 0| 26| 1| 0|\n",
"|DC8694A18613057F7...| 0| 2| 9| 2| 0|\n",
"|41EB945E62B7F03D9...| 0| 0| 15| 1| 0|\n",
"|4CD65097EFB67A8D6...| 0| 0| 11| 0| 0|\n",
"|08E9F5633328D780C...| 0| 5| 18| 1| 0|\n",
"|8D708B5B292FB555F...| 1| 0| 21| 1| 0|\n",
"|F4BB93A9C7E2E0A47...| 0| 0| 8| 0| 0|\n",
"|9586875D692663562...| 0| 4| 4| 1| 0|\n",
"|552CCF061B871F717...| 0| 2| 10| 0| 0|\n",
"|167C661512D5AA2C5...| 0| 0| 13| 0| 0|\n",
"|1C8CB1A88201C4E83...| 0| 0| 16| 0| 0|\n",
"|D50F5974294A3AC41...| 0| 0| 4| 0| 0|\n",
"|5205D3FE7D57F5494...| 0| 0| 6| 0| 0|\n",
"|B50F660464E5D0649...| 0| 0| 5| 3| 0|\n",
"|2EBD87EE737D1AB90...| 0| 1| 12| 0| 0|\n",
"|5EE2C4D3BF57BDB45...| 0| 0| 13| 0| 0|\n",
"|C12F3B53D695B3195...| 0| 1| 17| 0| 0|\n",
"|3C698F44315B54EF2...| 0| 0| 19| 0| 0|\n",
"|652979D8BB6F2409F...| 0| 0| 6| 1| 0|\n",
"|667F9BBD97EADE903...| 0| 0| 16| 0| 0|\n",
"|725D9245A61E2C54D...| 0| 0| 31| 0| 0|\n",
"+--------------------+-----+--------+---------+------+-------------+\n",
"only showing top 100 rows\n",
"\n"
]
},
{
"name": "stderr",
"output_type": "stream",
"text": [
" \r"
]
}
],
"source": [
"w_ex8 = Window.partitionBy(\"medallion\").orderBy(F.col(\"pickup_datetime\"))\n",
"\n",
"@F.udf(returnType=T.IntegerType())\n",
"def max_consecutive_rank_seq_len(ranks: list[int]) -> int:\n",
" if len(ranks) <= 1:\n",
" return len(ranks)\n",
"\n",
" longest_len = 0\n",
" start = 0\n",
" \n",
" for i, rank in enumerate(ranks):\n",
" if i == 0:\n",
" continue\n",
" if rank - 1 != ranks[i - 1]:\n",
" longest_len = max(i - start, longest_len)\n",
" start = i\n",
" \n",
" longest_len = max(len(ranks) - start, longest_len) \n",
" return longest_len\n",
"\n",
"df_ex8 = df_with_bor \\\n",
" .select(\"medallion\", \"pickup_borough\", \"dropoff_borough\", \"pickup_datetime\") \\\n",
" .withColumn(\"tripNo\", F.rank().over(w_ex8)) \\\n",
" .where((F.col(\"pickup_borough\") == F.col(\"dropoff_borough\")) & (F.col(\"pickup_borough\").isNotNull())) \\\n",
" .select(F.col(\"medallion\"), F.col(\"pickup_borough\").alias(\"borough\"), F.col(\"tripNo\")) \\\n",
" .groupBy(\"medallion\", \"borough\").agg(max_consecutive_rank_seq_len(F.collect_list(\"tripNo\")).alias('maxTrips')) \\\n",
" .groupBy(\"medallion\").pivot(\"borough\").sum(\"maxTrips\") \\\n",
" .fillna(value=0)\n",
"\n",
"df_ex8.show(100)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2261d0e4-cf9d-4190-836b-32981b8ceb64",
"metadata": {},
2023-05-30 15:52:00 +00:00
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.3"
}
},
"nbformat": 4,
"nbformat_minor": 5
}