1107 lines
57 KiB
Text
1107 lines
57 KiB
Text
{
|
|
"cells": [
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "23b48f71",
|
|
"metadata": {},
|
|
"source": [
|
|
"# S&DE Atelier - Visual Analytics\n",
|
|
"\n",
|
|
"# Assignment 3\n",
|
|
"\n",
|
|
"**Due** June 2, 2023 @23:55\n",
|
|
"\n",
|
|
"**Contacts**: [marco.dambros@usi.ch](mailto:marco.dambros@usi.ch) - [carmen.armenti@usi.ch](mailto:carmen.armenti@usi.ch)\n",
|
|
"\n",
|
|
"---\n",
|
|
"\n",
|
|
"The goal of this assignment is to use Spark in Jupyter notebooks (PySpark). The files `trip_data.csv`, `trip_fare.csv` and `nyc_boroughs.geojson` can be found in the following folder: [Assignment3-data](https://usi365-my.sharepoint.com/:f:/g/personal/armenc_usi_ch/Ejp7sb8QAMROoWe0XUDcAkMBoqUFk-w2Vgroup025NhAww?e=TFG5CD). You should clean the data if needed. \n",
|
|
"\n",
|
|
"Note that you can use Spark [window functions](https://spark.apache.org/docs/latest/sql-ref-syntax-qry-select-window.html) whenever applicable. \n",
|
|
"\n",
|
|
"Please name your file as `SurnameName_Assignment3.ipynb`."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 1,
|
|
"id": "9f434eb8",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# Import the basic spark library\n",
|
|
"from pyspark.sql import SparkSession\n",
|
|
"from pyspark.sql.functions import col"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 2,
|
|
"id": "4a3188f4",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"#import sys\n",
|
|
"#!{sys.executable} -m pip install geospark"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 3,
|
|
"id": "b9a87a5c",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Setting default log level to \"WARN\".\n",
|
|
"To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).\n",
|
|
"23/05/30 15:56:43 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"# Create an entry point to the PySpark Application\n",
|
|
"spark = SparkSession.builder \\\n",
|
|
" .config(\"spark.driver.bindAddress\", \"127.0.0.1\") \\\n",
|
|
" .config(\"spark.driver.memory\", \"16g\") \\\n",
|
|
" .config(\"spark.executor.memory\", \"16g\") \\\n",
|
|
" .config(\"spark.executor.cores\", \"4\") \\\n",
|
|
" .config(\"spark.executor.memory\", \"16g\") \\\n",
|
|
" .master(\"local\") \\\n",
|
|
" .appName(\"MaggioniClaudio_Assignment3\") \\\n",
|
|
" .getOrCreate()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "536a6cc4",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Exercise 1\n",
|
|
"Join the `trip_data` and `trip_fare` dataframes into one and consider only data on 2013-01-01."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 4,
|
|
"id": "9fc094c8",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"def sanitize_column_names(df):\n",
|
|
" for original, renamed in [(x, x.strip().replace(\" \", \"_\"),) for x in df.columns]:\n",
|
|
" df = df.withColumnRenamed(original, renamed)\n",
|
|
" return df"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 5,
|
|
"id": "afe8000d",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
" \r"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"df_trip_data = spark.read \\\n",
|
|
" .option(\"header\", True) \\\n",
|
|
" .csv(\"data/trip_data.csv\", inferSchema=True)\n",
|
|
"\n",
|
|
"df_trip_data = sanitize_column_names(df_trip_data)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 6,
|
|
"id": "4dfe92f6",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
" \r"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"df_trip_fare = spark.read \\\n",
|
|
" .option(\"header\", True) \\\n",
|
|
" .csv(\"data/trip_fare.csv\", inferSchema=True)\n",
|
|
"\n",
|
|
"df_trip_fare = sanitize_column_names(df_trip_fare)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 7,
|
|
"id": "d76abc83",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"+--------------------+--------------------+---------+---------+------------------+-------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+\n",
|
|
"| medallion| hack_license|vendor_id|rate_code|store_and_fwd_flag| pickup_datetime| dropoff_datetime|passenger_count|trip_time_in_secs|trip_distance|pickup_longitude|pickup_latitude|dropoff_longitude|dropoff_latitude|\n",
|
|
"+--------------------+--------------------+---------+---------+------------------+-------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+\n",
|
|
"|89D227B655E5C82AE...|BA96DE419E711691B...| CMT| 1| N|2013-01-01 15:11:48|2013-01-01 15:18:10| 4| 382| 1.0| -73.978165| 40.757977| -73.989838| 40.751171|\n",
|
|
"|0BD7C8F5BA12B88E0...|9FD8F69F0804BDB55...| CMT| 1| N|2013-01-06 00:18:35|2013-01-06 00:22:54| 1| 259| 1.5| -74.006683| 40.731781| -73.994499| 40.75066|\n",
|
|
"|0BD7C8F5BA12B88E0...|9FD8F69F0804BDB55...| CMT| 1| N|2013-01-05 18:49:41|2013-01-05 18:54:23| 1| 282| 1.1| -74.004707| 40.73777| -74.009834| 40.726002|\n",
|
|
"|DFD2202EE08F7A8DC...|51EE87E3205C985EF...| CMT| 1| N|2013-01-07 23:54:15|2013-01-07 23:58:20| 2| 244| 0.7| -73.974602| 40.759945| -73.984734| 40.759388|\n",
|
|
"|DFD2202EE08F7A8DC...|51EE87E3205C985EF...| CMT| 1| N|2013-01-07 23:25:03|2013-01-07 23:34:24| 1| 560| 2.1| -73.97625| 40.748528| -74.002586| 40.747868|\n",
|
|
"|20D9ECB2CA0767CF7...|598CCE5B9C1918568...| CMT| 1| N|2013-01-07 15:27:48|2013-01-07 15:38:37| 1| 648| 1.7| -73.966743| 40.764252| -73.983322| 40.743763|\n",
|
|
"|496644932DF393260...|513189AD756FF14FE...| CMT| 1| N|2013-01-08 11:01:15|2013-01-08 11:08:14| 1| 418| 0.8| -73.995804| 40.743977| -74.007416| 40.744343|\n",
|
|
"|0B57B9633A2FECD3D...|CCD4367B417ED6634...| CMT| 1| N|2013-01-07 12:39:18|2013-01-07 13:10:56| 3| 1898| 10.7| -73.989937| 40.756775| -73.86525| 40.77063|\n",
|
|
"|2C0E91FF20A856C89...|1DA2F6543A62B8ED9...| CMT| 1| N|2013-01-07 18:15:47|2013-01-07 18:20:47| 1| 299| 0.8| -73.980072| 40.743137| -73.982712| 40.735336|\n",
|
|
"|2D4B95E2FA7B2E851...|CD2F522EEE1FF5F5A...| CMT| 1| N|2013-01-07 15:33:28|2013-01-07 15:49:26| 2| 957| 2.5| -73.977936| 40.786983| -73.952919| 40.80637|\n",
|
|
"|E12F6AF991172EAC3...|06918214E951FA000...| CMT| 1| N|2013-01-08 13:11:52|2013-01-08 13:19:50| 1| 477| 1.3| -73.982452| 40.773167| -73.964134| 40.773815|\n",
|
|
"|E12F6AF991172EAC3...|06918214E951FA000...| CMT| 1| N|2013-01-08 09:50:05|2013-01-08 10:02:54| 1| 768| 0.7| -73.99556| 40.749294| -73.988686| 40.759052|\n",
|
|
"|78FFD9CD0CDA541F3...|E949C583ECF62C8F0...| CMT| 1| N|2013-01-10 12:07:08|2013-01-10 12:17:29| 1| 620| 2.3| -73.971497| 40.791321| -73.964478| 40.775921|\n",
|
|
"|237F49C3ECC11F502...|93C363DDF8ED9385D...| CMT| 1| N|2013-01-07 07:35:47|2013-01-07 07:46:00| 1| 612| 2.3| -73.98851| 40.774307| -73.981094| 40.755325|\n",
|
|
"|3349F919AA8AE5DC9...|7CE849FEF67514F08...| CMT| 1| N|2013-01-10 15:42:29|2013-01-10 16:04:02| 1| 1293| 3.2| -73.994911| 40.723221| -73.971558| 40.761612|\n",
|
|
"|3349F919AA8AE5DC9...|7CE849FEF67514F08...| CMT| 1| N|2013-01-10 14:27:28|2013-01-10 14:45:21| 1| 1073| 4.4| -74.010391| 40.708702| -73.987846| 40.756104|\n",
|
|
"|4C005EEBAA7BF26B8...|351BE7D984BE17DB2...| CMT| 1| N|2013-01-07 22:09:59|2013-01-07 22:19:50| 1| 591| 1.7| -73.973732| 40.756287| -73.998413| 40.756832|\n",
|
|
"|7D99C30FCE69B1A9D...|460C3F57DD9CB2265...| CMT| 1| N|2013-01-07 17:18:16|2013-01-07 17:20:55| 1| 158| 0.7| -73.968925| 40.767704| -73.96199| 40.776566|\n",
|
|
"|E6FBF80668FE0611A...|36773E80775F26CD1...| CMT| 1| N|2013-01-07 06:08:51|2013-01-07 06:13:14| 1| 262| 1.7| -73.96212| 40.769737| -73.979561| 40.75539|\n",
|
|
"|0C5296F3C8B16E702...|D2363240A9295EF57...| CMT| 1| N|2013-01-07 22:25:46|2013-01-07 22:36:56| 1| 669| 2.3| -73.989708| 40.756714| -73.977615| 40.787575|\n",
|
|
"+--------------------+--------------------+---------+---------+------------------+-------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+\n",
|
|
"only showing top 20 rows\n",
|
|
"\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"df_trip_data.show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 8,
|
|
"id": "3c7ccbd4",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"+--------------------+--------------------+---------+-------------------+------------+-----------+---------+-------+----------+------------+------------+\n",
|
|
"| medallion| hack_license|vendor_id| pickup_datetime|payment_type|fare_amount|surcharge|mta_tax|tip_amount|tolls_amount|total_amount|\n",
|
|
"+--------------------+--------------------+---------+-------------------+------------+-----------+---------+-------+----------+------------+------------+\n",
|
|
"|89D227B655E5C82AE...|BA96DE419E711691B...| CMT|2013-01-01 15:11:48| CSH| 6.5| 0.0| 0.5| 0.0| 0.0| 7.0|\n",
|
|
"|0BD7C8F5BA12B88E0...|9FD8F69F0804BDB55...| CMT|2013-01-06 00:18:35| CSH| 6.0| 0.5| 0.5| 0.0| 0.0| 7.0|\n",
|
|
"|0BD7C8F5BA12B88E0...|9FD8F69F0804BDB55...| CMT|2013-01-05 18:49:41| CSH| 5.5| 1.0| 0.5| 0.0| 0.0| 7.0|\n",
|
|
"|DFD2202EE08F7A8DC...|51EE87E3205C985EF...| CMT|2013-01-07 23:54:15| CSH| 5.0| 0.5| 0.5| 0.0| 0.0| 6.0|\n",
|
|
"|DFD2202EE08F7A8DC...|51EE87E3205C985EF...| CMT|2013-01-07 23:25:03| CSH| 9.5| 0.5| 0.5| 0.0| 0.0| 10.5|\n",
|
|
"|20D9ECB2CA0767CF7...|598CCE5B9C1918568...| CMT|2013-01-07 15:27:48| CSH| 9.5| 0.0| 0.5| 0.0| 0.0| 10.0|\n",
|
|
"|496644932DF393260...|513189AD756FF14FE...| CMT|2013-01-08 11:01:15| CSH| 6.0| 0.0| 0.5| 0.0| 0.0| 6.5|\n",
|
|
"|0B57B9633A2FECD3D...|CCD4367B417ED6634...| CMT|2013-01-07 12:39:18| CSH| 34.0| 0.0| 0.5| 0.0| 4.8| 39.3|\n",
|
|
"|2C0E91FF20A856C89...|1DA2F6543A62B8ED9...| CMT|2013-01-07 18:15:47| CSH| 5.5| 1.0| 0.5| 0.0| 0.0| 7.0|\n",
|
|
"|2D4B95E2FA7B2E851...|CD2F522EEE1FF5F5A...| CMT|2013-01-07 15:33:28| CSH| 13.0| 0.0| 0.5| 0.0| 0.0| 13.5|\n",
|
|
"|E12F6AF991172EAC3...|06918214E951FA000...| CMT|2013-01-08 13:11:52| CSH| 7.5| 0.0| 0.5| 0.0| 0.0| 8.0|\n",
|
|
"|E12F6AF991172EAC3...|06918214E951FA000...| CMT|2013-01-08 09:50:05| CSH| 9.0| 0.0| 0.5| 0.0| 0.0| 9.5|\n",
|
|
"|78FFD9CD0CDA541F3...|E949C583ECF62C8F0...| CMT|2013-01-10 12:07:08| CSH| 9.5| 0.0| 0.5| 0.0| 0.0| 10.0|\n",
|
|
"|237F49C3ECC11F502...|93C363DDF8ED9385D...| CMT|2013-01-07 07:35:47| CSH| 10.0| 0.0| 0.5| 0.0| 0.0| 10.5|\n",
|
|
"|3349F919AA8AE5DC9...|7CE849FEF67514F08...| CMT|2013-01-10 15:42:29| CSH| 15.5| 0.0| 0.5| 0.0| 0.0| 16.0|\n",
|
|
"|3349F919AA8AE5DC9...|7CE849FEF67514F08...| CMT|2013-01-10 14:27:28| CSH| 16.5| 0.0| 0.5| 0.0| 0.0| 17.0|\n",
|
|
"|4C005EEBAA7BF26B8...|351BE7D984BE17DB2...| CMT|2013-01-07 22:09:59| CSH| 9.0| 0.5| 0.5| 0.0| 0.0| 10.0|\n",
|
|
"|7D99C30FCE69B1A9D...|460C3F57DD9CB2265...| CMT|2013-01-07 17:18:16| CSH| 4.5| 1.0| 0.5| 0.0| 0.0| 6.0|\n",
|
|
"|E6FBF80668FE0611A...|36773E80775F26CD1...| CMT|2013-01-07 06:08:51| CSH| 7.0| 0.0| 0.5| 0.0| 0.0| 7.5|\n",
|
|
"|0C5296F3C8B16E702...|D2363240A9295EF57...| CMT|2013-01-07 22:25:46| CSH| 10.5| 0.5| 0.5| 0.0| 0.0| 11.5|\n",
|
|
"+--------------------+--------------------+---------+-------------------+------------+-----------+---------+-------+----------+------------+------------+\n",
|
|
"only showing top 20 rows\n",
|
|
"\n"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"df_trip_fare.show()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 9,
|
|
"id": "61e21d2a",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"df_left = df_trip_data.filter(col('pickup_datetime').startswith(\"2013-01-01 \"))\n",
|
|
"df_right = df_trip_fare.filter(col('pickup_datetime').startswith(\"2013-01-01 \"))\n",
|
|
"\n",
|
|
"df_joined = df_left.join(df_right, ['medallion', 'pickup_datetime']).cache()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 10,
|
|
"id": "d73ab313",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"[Stage 7:====================================================> (12 + 1) / 13]\r"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"+--------------------+-------------------+--------------------+---------+---------+------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+--------------------+---------+------------+-----------+---------+-------+----------+------------+------------+\n",
|
|
"| medallion| pickup_datetime| hack_license|vendor_id|rate_code|store_and_fwd_flag| dropoff_datetime|passenger_count|trip_time_in_secs|trip_distance|pickup_longitude|pickup_latitude|dropoff_longitude|dropoff_latitude| hack_license|vendor_id|payment_type|fare_amount|surcharge|mta_tax|tip_amount|tolls_amount|total_amount|\n",
|
|
"+--------------------+-------------------+--------------------+---------+---------+------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+--------------------+---------+------------+-----------+---------+-------+----------+------------+------------+\n",
|
|
"|000318C2E3E638158...|2013-01-01 20:46:00|91CE3B3A2F548CD8A...| VTS| 1| null|2013-01-01 20:56:00| 5| 600| 1.35| -73.989677| 40.756554| -73.970673| 40.752541|91CE3B3A2F548CD8A...| VTS| CRD| 8.5| 0.5| 0.5| 1.8| 0.0| 11.3|\n",
|
|
"|00790C7BAD30B7A9E...|2013-01-01 04:26:00|3EF1ED607505C991D...| VTS| 1| null|2013-01-01 04:59:00| 1| 1980| 10.99| -73.996811| 40.716587| -73.949448| 40.827671|3EF1ED607505C991D...| VTS| CRD| 36.5| 0.5| 0.5| 9.25| 0.0| 46.75|\n",
|
|
"|00A1EA0E8CD47CE24...|2013-01-01 06:09:50|4FD770C068437BBA9...| CMT| 1| N|2013-01-01 06:29:03| 1| 1153| 5.8| -73.89653| 40.759472| -73.952698| 40.780788|4FD770C068437BBA9...| CMT| CRD| 20.5| 0.0| 0.5| 4.0| 0.0| 25.0|\n",
|
|
"+--------------------+-------------------+--------------------+---------+---------+------------------+-------------------+---------------+-----------------+-------------+----------------+---------------+-----------------+----------------+--------------------+---------+------------+-----------+---------+-------+----------+------------+------------+\n",
|
|
"only showing top 3 rows\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
" \r"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"df_joined.show(3)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "5f246287",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Exercise 2\n",
|
|
"Consider only Manhattan, Bronx and Brooklyn districts. Then create a dataframe that shows the total number of trips *within* the same district and *across* all the other districts mentioned before.\n",
|
|
"\n",
|
|
"For example, for Manhattan borough you should consider the total number of the following trips:\n",
|
|
"- Manhattan → Manhattan\n",
|
|
"- Manhattan → Brooklyn\n",
|
|
"- Manhattan → Bronx\n",
|
|
"\n",
|
|
"You should then do the same for Bronx and Brooklyn boroughs."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 11,
|
|
"id": "97e35f13",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from pyspark.sql import types as T\n",
|
|
"from pyspark.sql import functions as F\n",
|
|
"from shapely.geometry import Polygon, Point\n",
|
|
"from typing import Tuple, List\n",
|
|
"\n",
|
|
"df_boroughs = spark.read \\\n",
|
|
" .option(\"multiline\", \"true\") \\\n",
|
|
" .json(r'data/nyc-boroughs.geojson')\n",
|
|
"\n",
|
|
"df_boroughs = df_boroughs.select(F.explode(df_boroughs.features).alias(\"feature\"))\n",
|
|
"\n",
|
|
"boroughs_list = df_boroughs.select( \\\n",
|
|
" df_boroughs.feature.properties.borough.alias(\"borough\"), \\\n",
|
|
" df_boroughs.feature.geometry.coordinates.alias(\"coordinates\")).collect()\n",
|
|
"\n",
|
|
"boroughs_list: list[tuple[str, list[Polygon]]] = \\\n",
|
|
" [(r.borough, [Polygon(shell=p) for p in r.coordinates]) for r in boroughs_list]\n",
|
|
"\n",
|
|
"@F.udf(returnType=T.StringType())\n",
|
|
"def get_borough(lon: float, lat: float) -> bool:\n",
|
|
" global boroughs_list\n",
|
|
"\n",
|
|
" if lon is None or lat is None:\n",
|
|
" return None\n",
|
|
"\n",
|
|
" point = Point(lon, lat)\n",
|
|
" \n",
|
|
" for b in boroughs_list:\n",
|
|
" for p in b[1]:\n",
|
|
" if p.contains(point):\n",
|
|
" return b[0]\n",
|
|
" return None"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 12,
|
|
"id": "b12aa2ec",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"# use UDF as join condition\n",
|
|
"df_with_bor = df_joined \\\n",
|
|
" .withColumn(\"pickup_borough\", get_borough(\"pickup_longitude\", \"pickup_latitude\")) \\\n",
|
|
" .withColumn(\"dropoff_borough\", get_borough(\"dropoff_longitude\", \"dropoff_latitude\")) \\\n",
|
|
" .cache()"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 13,
|
|
"id": "9d386ada-5bd0-4db5-9ac7-e675f371682c",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
" \r"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"Before borough join: 412630\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"[Stage 20:=====================================================>(199 + 1) / 200]\r"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"After borough join:412630\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
" \r"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"print(\"Before borough join: \" + str(df_joined.count())) \n",
|
|
"print(\"After borough join:\" + str(df_with_bor.count()))"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 20,
|
|
"id": "9c14ad76-388a-454a-96c0-bf38765ce0dd",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"[Stage 55:=================================================> (185 + 1) / 200]\r"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"+--------------+---------------+------+\n",
|
|
"|pickup_borough|dropoff_borough|count |\n",
|
|
"+--------------+---------------+------+\n",
|
|
"|Bronx |Bronx |487 |\n",
|
|
"|Bronx |Brooklyn |6 |\n",
|
|
"|Bronx |Manhattan |284 |\n",
|
|
"|Brooklyn |Bronx |57 |\n",
|
|
"|Brooklyn |Brooklyn |10454 |\n",
|
|
"|Brooklyn |Manhattan |6408 |\n",
|
|
"|Manhattan |Bronx |2779 |\n",
|
|
"|Manhattan |Brooklyn |14396 |\n",
|
|
"|Manhattan |Manhattan |319706|\n",
|
|
"+--------------+---------------+------+\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
" \r"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"def isin(var, values):\n",
|
|
" cond = (var == values[0])\n",
|
|
" for i in range(0, len(values)):\n",
|
|
" cond = cond | (var == values[i])\n",
|
|
" return cond\n",
|
|
"\n",
|
|
"boroughs = [\"Manhattan\", \"Bronx\", \"Brooklyn\"]\n",
|
|
"df_ex2 = df_with_bor \\\n",
|
|
" .where((isin(df_with_bor.pickup_borough, boroughs)) & (isin(df_with_bor.dropoff_borough, boroughs))) \\\n",
|
|
" .groupBy(\"pickup_borough\", \"dropoff_borough\") \\\n",
|
|
" .count() \\\n",
|
|
" .orderBy(\"pickup_borough\", \"dropoff_borough\")\n",
|
|
"df_ex2.show(truncate=False)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "21bd4ac8",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Exercise 3\n",
|
|
"Imagine you are a taxi driver and one day you can work only two hours. Assume the data is representative of a typical working day. Which hours of the day - retrieved from `pickup_datetime` - would you choose to work based on the fare and tip amount?"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 47,
|
|
"id": "46d191e1-fd13-4de3-8851-5e10a7319286",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"[Stage 215:===============================================> (181 + 1) / 200]\r"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"+-----------+------------------+\n",
|
|
"|pickup_hour|fare_and_tip_total|\n",
|
|
"+-----------+------------------+\n",
|
|
"|1 |453700.23 |\n",
|
|
"|2 |418415.82 |\n",
|
|
"|0 |390741.27 |\n",
|
|
"|3 |367018.78 |\n",
|
|
"|14 |286852.68 |\n",
|
|
"|15 |278953.43 |\n",
|
|
"|4 |272856.05 |\n",
|
|
"|18 |269648.14 |\n",
|
|
"|13 |263915.72 |\n",
|
|
"|17 |258134.56 |\n",
|
|
"|16 |246552.73 |\n",
|
|
"|12 |238716.32 |\n",
|
|
"|19 |234377.86 |\n",
|
|
"|20 |211402.98 |\n",
|
|
"|21 |208110.83 |\n",
|
|
"|22 |204481.56 |\n",
|
|
"|11 |194952.87 |\n",
|
|
"|5 |180075.5 |\n",
|
|
"|23 |158957.41 |\n",
|
|
"|10 |146400.51 |\n",
|
|
"+-----------+------------------+\n",
|
|
"only showing top 20 rows\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
" \r"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"df_ex3 = df_joined.select( \\\n",
|
|
" F.hour(F.from_utc_timestamp(df_joined.pickup_datetime, 'UTC')).alias('pickup_hour'), \\\n",
|
|
" F.col(\"fare_amount\"), \\\n",
|
|
" F.col(\"tip_amount\")) \\\n",
|
|
" .groupby(\"pickup_hour\") \\\n",
|
|
" .agg(F.round(F.sum(F.col(\"fare_amount\") + F.col(\"tip_amount\")), 2).alias('fare_and_tip_total')) \\\n",
|
|
" .select(\"pickup_hour\", \"fare_and_tip_total\") \\\n",
|
|
" .sort(F.desc(\"fare_and_tip_total\"))\n",
|
|
"\n",
|
|
"df_ex3.show(truncate=False)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "ffbbaf04-65b5-4fc2-879f-a2a8bcc87519",
|
|
"metadata": {},
|
|
"source": [
|
|
"Given the table above I would choose to work at **1 AM** and **2 AM** as they are the most profitable hours based on total fare and tip amount. This may be the case for the chosen date `2013-01-01` because of the new year celebrations."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "b24e0922",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Exercise 4\n",
|
|
"Provide a graphical representation to compare the average fare amount for trips _within_ and _across_ all the districts. You may want to have a look at: https://docs.bokeh.org/en/latest/docs/user_guide/topics/categorical.html#heatmaps."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 72,
|
|
"id": "0643d9e4",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
" \r"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"import pandas as pd\n",
|
|
"\n",
|
|
"ex4_data = df_with_bor \\\n",
|
|
" .withColumn(\"pickup_borough\", F.coalesce(F.col(\"pickup_borough\"), F.lit(\"Unknown\"))) \\\n",
|
|
" .withColumn(\"dropoff_borough\", F.coalesce(F.col(\"dropoff_borough\"), F.lit(\"Unknown\"))) \\\n",
|
|
" .groupBy(\"pickup_borough\", \"dropoff_borough\") \\\n",
|
|
" .agg(F.mean(F.col('fare_amount')).alias('mean_fare_amount')) \\\n",
|
|
" .collect()\n",
|
|
"\n",
|
|
"df_ex4 = pd.DataFrame()\n",
|
|
"for i, row in enumerate(ex4_data):\n",
|
|
" df_ex4.loc[i, 'pickup_borough'] = row.pickup_borough\n",
|
|
" df_ex4.loc[i, 'dropoff_borough'] = row.dropoff_borough\n",
|
|
" df_ex4.loc[i, 'mean_fare'] = row.mean_fare_amount"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 79,
|
|
"id": "2cba45e6-7ad1-4044-b9f0-81943c1cf547",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": [
|
|
"from math import pi\n",
|
|
"from bokeh.models import BasicTicker, PrintfTickFormatter\n",
|
|
"from bokeh.plotting import figure, show\n",
|
|
"from bokeh.transform import linear_cmap\n",
|
|
"\n",
|
|
"pickup = list(sorted(df_ex4['pickup_borough'].unique()))\n",
|
|
"dropoff = list(reversed(sorted(df_ex4['dropoff_borough'].unique())))\n",
|
|
"\n",
|
|
"colors = [\"#75968f\", \"#a5bab7\", \"#c9d9d3\", \"#e2e2e2\", \"#dfccce\", \"#ddb7b1\", \"#cc7878\", \"#933b41\", \"#550b1d\"]\n",
|
|
"\n",
|
|
"p = figure(title=f\"Mean NYC Taxi fares on 2013-01-01\",\n",
|
|
" x_range=pickup, y_range=dropoff,\n",
|
|
" x_axis_location=\"above\", width=900, height=900,\n",
|
|
" tools=\"hover,save,pan,box_zoom,reset,wheel_zoom\", toolbar_location='below',\n",
|
|
" tooltips=[ \\\n",
|
|
" ('Pickup Borough', '@pickup_borough'), \\\n",
|
|
" ('Dropoff Borough', '@dropoff_borough'), \\\n",
|
|
" ('Average Fare Amount', '$@mean_fare')])\n",
|
|
"\n",
|
|
"p.grid.grid_line_color = None\n",
|
|
"p.axis.axis_line_color = None\n",
|
|
"p.axis.major_tick_line_color = None\n",
|
|
"p.axis.major_label_text_font_size = \"14px\"\n",
|
|
"p.axis.major_label_standoff = 0\n",
|
|
"p.xaxis.major_label_orientation = pi / 3\n",
|
|
"\n",
|
|
"r = p.rect(x=\"pickup_borough\", y=\"dropoff_borough\", width=1, height=1, source=df_ex4,\n",
|
|
" fill_color=linear_cmap(\"mean_fare\", colors, low=df_ex4.mean_fare.min(), high=df_ex4.mean_fare.max()),\n",
|
|
" line_color=None)\n",
|
|
"\n",
|
|
"p.add_layout(r.construct_color_bar(\n",
|
|
" major_label_text_font_size=\"14px\",\n",
|
|
" ticker=BasicTicker(desired_num_ticks=len(colors)),\n",
|
|
" formatter=PrintfTickFormatter(format=\"$%d\"),\n",
|
|
" label_standoff=6,\n",
|
|
" border_line_color=None,\n",
|
|
" padding=5\n",
|
|
"), 'right')\n",
|
|
"\n",
|
|
"show(p)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "9b4a8445",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Exercise 5\n",
|
|
"Find the average amount of tolls per hour for trips within the following districts: Manhattan, Bronx, Brooklyn, Queens. Show a graphical representation of the data and report if there is any trend or peak during the day. Overall which district has the largest amount of tolls?"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": 88,
|
|
"id": "b80cbb2d",
|
|
"metadata": {},
|
|
"outputs": [
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"[Stage 313:====================================================>(197 + 1) / 200]\r"
|
|
]
|
|
},
|
|
{
|
|
"name": "stdout",
|
|
"output_type": "stream",
|
|
"text": [
|
|
"+--------------+---------------+----+--------------------+\n",
|
|
"|pickup_borough|dropoff_borough|hour| mean_tolls_amount|\n",
|
|
"+--------------+---------------+----+--------------------+\n",
|
|
"| Bronx| Bronx| 0| 0.0|\n",
|
|
"| Bronx| Bronx| 1| 0.0|\n",
|
|
"| Bronx| Bronx| 2| 0.0|\n",
|
|
"| Bronx| Bronx| 3| 0.0|\n",
|
|
"| Bronx| Bronx| 4| 0.0|\n",
|
|
"| Bronx| Bronx| 5| 0.14545454545454545|\n",
|
|
"| Bronx| Bronx| 6| 0.0|\n",
|
|
"| Bronx| Bronx| 7| 0.0|\n",
|
|
"| Bronx| Bronx| 8| 0.0|\n",
|
|
"| Bronx| Bronx| 9| 0.0|\n",
|
|
"| Bronx| Bronx| 10| 1.1388888888888888|\n",
|
|
"| Bronx| Bronx| 11| 0.6857142857142857|\n",
|
|
"| Bronx| Bronx| 12| 0.0|\n",
|
|
"| Bronx| Bronx| 13| 0.0|\n",
|
|
"| Bronx| Bronx| 14| 0.0|\n",
|
|
"| Bronx| Bronx| 15| 0.0|\n",
|
|
"| Bronx| Bronx| 16| 0.0|\n",
|
|
"| Bronx| Bronx| 17| 0.96|\n",
|
|
"| Bronx| Bronx| 18| 0.48|\n",
|
|
"| Bronx| Bronx| 19| 0.0|\n",
|
|
"| Bronx| Bronx| 20| 0.6857142857142857|\n",
|
|
"| Bronx| Bronx| 21| 0.3692307692307692|\n",
|
|
"| Bronx| Bronx| 22| 0.0|\n",
|
|
"| Bronx| Bronx| 23| 0.0|\n",
|
|
"| Bronx| Brooklyn| 1| 4.8|\n",
|
|
"| Bronx| Brooklyn| 2| 0.0|\n",
|
|
"| Bronx| Brooklyn| 6| 0.0|\n",
|
|
"| Bronx| Brooklyn| 12| 0.0|\n",
|
|
"| Bronx| Brooklyn| 17| 4.8|\n",
|
|
"| Bronx| Manhattan| 0| 0.0|\n",
|
|
"| Bronx| Manhattan| 1| 0.0|\n",
|
|
"| Bronx| Manhattan| 2| 0.0|\n",
|
|
"| Bronx| Manhattan| 3| 0.0|\n",
|
|
"| Bronx| Manhattan| 4| 0.18333333333333335|\n",
|
|
"| Bronx| Manhattan| 5| 0.0|\n",
|
|
"| Bronx| Manhattan| 6| 0.0|\n",
|
|
"| Bronx| Manhattan| 7| 0.0|\n",
|
|
"| Bronx| Manhattan| 8| 0.0|\n",
|
|
"| Bronx| Manhattan| 9| 0.0|\n",
|
|
"| Bronx| Manhattan| 10| 0.0|\n",
|
|
"| Bronx| Manhattan| 11| 0.0|\n",
|
|
"| Bronx| Manhattan| 12| 0.0|\n",
|
|
"| Bronx| Manhattan| 13| 0.0|\n",
|
|
"| Bronx| Manhattan| 14| 0.0|\n",
|
|
"| Bronx| Manhattan| 15| 0.0|\n",
|
|
"| Bronx| Manhattan| 16| 0.0|\n",
|
|
"| Bronx| Manhattan| 17| 0.0|\n",
|
|
"| Bronx| Manhattan| 18| 0.0|\n",
|
|
"| Bronx| Manhattan| 20| 0.0|\n",
|
|
"| Bronx| Manhattan| 21| 0.0|\n",
|
|
"| Bronx| Manhattan| 22| 0.0|\n",
|
|
"| Bronx| Manhattan| 23| 0.0|\n",
|
|
"| Bronx| Queens| 0| 4.8|\n",
|
|
"| Bronx| Queens| 1| 2.4|\n",
|
|
"| Bronx| Queens| 2| 4.8|\n",
|
|
"| Bronx| Queens| 3| 4.8|\n",
|
|
"| Bronx| Queens| 5| 3.5999999999999996|\n",
|
|
"| Bronx| Queens| 6| 2.4|\n",
|
|
"| Bronx| Queens| 7| 4.8|\n",
|
|
"| Bronx| Queens| 12| 4.8|\n",
|
|
"| Bronx| Queens| 15| 4.8|\n",
|
|
"| Brooklyn| Bronx| 0| 1.92|\n",
|
|
"| Brooklyn| Bronx| 1| 2.742857142857143|\n",
|
|
"| Brooklyn| Bronx| 2| 1.3499999999999999|\n",
|
|
"| Brooklyn| Bronx| 3| 1.3833333333333335|\n",
|
|
"| Brooklyn| Bronx| 4| 1.5999999999999999|\n",
|
|
"| Brooklyn| Bronx| 5| 2.4|\n",
|
|
"| Brooklyn| Bronx| 6| 1.5999999999999999|\n",
|
|
"| Brooklyn| Bronx| 7| 1.2|\n",
|
|
"| Brooklyn| Bronx| 8| 0.0|\n",
|
|
"| Brooklyn| Bronx| 10| 0.0|\n",
|
|
"| Brooklyn| Bronx| 11| 4.8|\n",
|
|
"| Brooklyn| Bronx| 12| 2.2|\n",
|
|
"| Brooklyn| Bronx| 18| 0.0|\n",
|
|
"| Brooklyn| Bronx| 23| 0.0|\n",
|
|
"| Brooklyn| Brooklyn| 0| 0.0|\n",
|
|
"| Brooklyn| Brooklyn| 1|0.005357142857142857|\n",
|
|
"| Brooklyn| Brooklyn| 2| 0.0|\n",
|
|
"| Brooklyn| Brooklyn| 3|0.019872701555869874|\n",
|
|
"| Brooklyn| Brooklyn| 4|0.009352189781021899|\n",
|
|
"| Brooklyn| Brooklyn| 5| 0.0|\n",
|
|
"| Brooklyn| Brooklyn| 6| 0.0|\n",
|
|
"| Brooklyn| Brooklyn| 7| 0.0|\n",
|
|
"| Brooklyn| Brooklyn| 8| 0.0|\n",
|
|
"| Brooklyn| Brooklyn| 9| 0.11851851851851851|\n",
|
|
"| Brooklyn| Brooklyn| 10| 0.0|\n",
|
|
"| Brooklyn| Brooklyn| 11| 0.04|\n",
|
|
"| Brooklyn| Brooklyn| 12| 0.0|\n",
|
|
"| Brooklyn| Brooklyn| 13| 0.0|\n",
|
|
"| Brooklyn| Brooklyn| 14| 0.02711864406779661|\n",
|
|
"| Brooklyn| Brooklyn| 15|0.028402366863905324|\n",
|
|
"| Brooklyn| Brooklyn| 16| 0.0|\n",
|
|
"| Brooklyn| Brooklyn| 17| 0.02711864406779661|\n",
|
|
"| Brooklyn| Brooklyn| 18|0.020600858369098713|\n",
|
|
"| Brooklyn| Brooklyn| 19|0.021052631578947368|\n",
|
|
"| Brooklyn| Brooklyn| 20| 0.0|\n",
|
|
"| Brooklyn| Brooklyn| 21| 0.04324324324324324|\n",
|
|
"| Brooklyn| Brooklyn| 22| 0.05704697986577181|\n",
|
|
"| Brooklyn| Brooklyn| 23| 0.0|\n",
|
|
"| Brooklyn| Manhattan| 0| 0.04419889502762431|\n",
|
|
"| Brooklyn| Manhattan| 1| 0.0632860040567951|\n",
|
|
"| Brooklyn| Manhattan| 2| 0.05387755102040815|\n",
|
|
"| Brooklyn| Manhattan| 3| 0.07449748743718591|\n",
|
|
"| Brooklyn| Manhattan| 4|0.038554216867469876|\n",
|
|
"| Brooklyn| Manhattan| 5|0.018532818532818532|\n",
|
|
"| Brooklyn| Manhattan| 6| 0.08372093023255812|\n",
|
|
"| Brooklyn| Manhattan| 7| 0.0|\n",
|
|
"| Brooklyn| Manhattan| 8| 0.0|\n",
|
|
"| Brooklyn| Manhattan| 9| 0.05581395348837209|\n",
|
|
"| Brooklyn| Manhattan| 10| 0.04403669724770642|\n",
|
|
"| Brooklyn| Manhattan| 11| 0.07218045112781954|\n",
|
|
"| Brooklyn| Manhattan| 12| 0.0|\n",
|
|
"| Brooklyn| Manhattan| 13| 0.02981366459627329|\n",
|
|
"| Brooklyn| Manhattan| 14| 0.05962732919254658|\n",
|
|
"| Brooklyn| Manhattan| 15| 0.0|\n",
|
|
"| Brooklyn| Manhattan| 16| 0.11290322580645161|\n",
|
|
"| Brooklyn| Manhattan| 17| 0.12314102564102562|\n",
|
|
"| Brooklyn| Manhattan| 18| 0.0|\n",
|
|
"| Brooklyn| Manhattan| 19| 0.0|\n",
|
|
"| Brooklyn| Manhattan| 20| 0.04|\n",
|
|
"| Brooklyn| Manhattan| 21| 0.08495575221238938|\n",
|
|
"| Brooklyn| Manhattan| 22| 0.04033613445378151|\n",
|
|
"| Brooklyn| Manhattan| 23| 0.0|\n",
|
|
"| Brooklyn| Queens| 0| 0.0|\n",
|
|
"| Brooklyn| Queens| 1|0.010526315789473684|\n",
|
|
"| Brooklyn| Queens| 2| 0.02513089005235602|\n",
|
|
"| Brooklyn| Queens| 3|0.026666666666666665|\n",
|
|
"| Brooklyn| Queens| 4|0.012413793103448275|\n",
|
|
"| Brooklyn| Queens| 5| 0.12258064516129033|\n",
|
|
"| Brooklyn| Queens| 6| 0.02857142857142857|\n",
|
|
"| Brooklyn| Queens| 7| 0.0|\n",
|
|
"| Brooklyn| Queens| 8| 0.0|\n",
|
|
"| Brooklyn| Queens| 9| 0.0|\n",
|
|
"| Brooklyn| Queens| 10| 0.0|\n",
|
|
"| Brooklyn| Queens| 11| 0.0|\n",
|
|
"| Brooklyn| Queens| 12| 0.1846153846153846|\n",
|
|
"| Brooklyn| Queens| 13| 0.0|\n",
|
|
"| Brooklyn| Queens| 14| 0.0|\n",
|
|
"| Brooklyn| Queens| 15| 0.11707317073170731|\n",
|
|
"| Brooklyn| Queens| 16| 0.0|\n",
|
|
"| Brooklyn| Queens| 17| 0.0|\n",
|
|
"| Brooklyn| Queens| 18| 0.0|\n",
|
|
"| Brooklyn| Queens| 19| 0.0|\n",
|
|
"| Brooklyn| Queens| 20| 0.0|\n",
|
|
"| Brooklyn| Queens| 21| 0.0|\n",
|
|
"| Brooklyn| Queens| 22| 0.0|\n",
|
|
"| Brooklyn| Queens| 23| 0.0|\n",
|
|
"| Manhattan| Bronx| 0| 0.2533333333333334|\n",
|
|
"| Manhattan| Bronx| 1| 0.2715277777777779|\n",
|
|
"| Manhattan| Bronx| 2| 0.2100628930817611|\n",
|
|
"| Manhattan| Bronx| 3| 0.2696428571428573|\n",
|
|
"| Manhattan| Bronx| 4| 0.15384615384615388|\n",
|
|
"| Manhattan| Bronx| 5| 0.05527638190954774|\n",
|
|
"| Manhattan| Bronx| 6| 0.08096590909090909|\n",
|
|
"| Manhattan| Bronx| 7| 0.1333333333333333|\n",
|
|
"| Manhattan| Bronx| 8| 0.18133333333333335|\n",
|
|
"| Manhattan| Bronx| 9| 0.165|\n",
|
|
"| Manhattan| Bronx| 10| 0.3578947368421052|\n",
|
|
"| Manhattan| Bronx| 11| 0.3674418604651163|\n",
|
|
"| Manhattan| Bronx| 12| 0.43902439024390244|\n",
|
|
"| Manhattan| Bronx| 13| 0.22999999999999998|\n",
|
|
"| Manhattan| Bronx| 14| 0.2619047619047619|\n",
|
|
"| Manhattan| Bronx| 15| 0.2490566037735849|\n",
|
|
"| Manhattan| Bronx| 16| 0.5236170212765957|\n",
|
|
"| Manhattan| Bronx| 17| 0.23749999999999996|\n",
|
|
"| Manhattan| Bronx| 18| 0.2925925925925926|\n",
|
|
"| Manhattan| Bronx| 19| 0.1543859649122807|\n",
|
|
"| Manhattan| Bronx| 20| 0.14666666666666667|\n",
|
|
"| Manhattan| Bronx| 21| 0.20909090909090908|\n",
|
|
"| Manhattan| Bronx| 22| 0.29|\n",
|
|
"| Manhattan| Bronx| 23| 0.13609999999999997|\n",
|
|
"| Manhattan| Brooklyn| 0| 0.20921052631578962|\n",
|
|
"| Manhattan| Brooklyn| 1| 0.24647709320695127|\n",
|
|
"| Manhattan| Brooklyn| 2| 0.2537931034482761|\n",
|
|
"| Manhattan| Brooklyn| 3| 0.168358208955224|\n",
|
|
"| Manhattan| Brooklyn| 4| 0.14059939301972688|\n",
|
|
"| Manhattan| Brooklyn| 5| 0.11757188498402552|\n",
|
|
"| Manhattan| Brooklyn| 6| 0.1429467084639498|\n",
|
|
"| Manhattan| Brooklyn| 7| 0.12403433476394847|\n",
|
|
"| Manhattan| Brooklyn| 8| 0.1471264367816092|\n",
|
|
"| Manhattan| Brooklyn| 9| 0.16633663366336635|\n",
|
|
"| Manhattan| Brooklyn| 10| 0.11267605633802817|\n",
|
|
"| Manhattan| Brooklyn| 11| 0.18585657370517925|\n",
|
|
"| Manhattan| Brooklyn| 12| 0.19136212624584714|\n",
|
|
"| Manhattan| Brooklyn| 13| 0.15789473684210523|\n",
|
|
"| Manhattan| Brooklyn| 14| 0.2719999999999999|\n",
|
|
"| Manhattan| Brooklyn| 15| 0.2133333333333333|\n",
|
|
"| Manhattan| Brooklyn| 16| 0.2842105263157894|\n",
|
|
"| Manhattan| Brooklyn| 17| 0.2565139949109414|\n",
|
|
"| Manhattan| Brooklyn| 18| 0.18093126385809308|\n",
|
|
"| Manhattan| Brooklyn| 19| 0.1438972162740899|\n",
|
|
"| Manhattan| Brooklyn| 20| 0.13136842105263155|\n",
|
|
"| Manhattan| Brooklyn| 21| 0.1684405458089668|\n",
|
|
"| Manhattan| Brooklyn| 22| 0.16958041958041953|\n",
|
|
"| Manhattan| Brooklyn| 23| 0.09829351535836177|\n",
|
|
"| Manhattan| Manhattan| 0|0.002124846378776963|\n",
|
|
"| Manhattan| Manhattan| 1|0.003388822829964328|\n",
|
|
"| Manhattan| Manhattan| 2|0.002282543352601...|\n",
|
|
"| Manhattan| Manhattan| 3|6.617317182593092E-4|\n",
|
|
"| Manhattan| Manhattan| 4| 0.00711096245505477|\n",
|
|
"| Manhattan| Manhattan| 5|0.004739558892538714|\n",
|
|
"| Manhattan| Manhattan| 6|0.008770792827824583|\n",
|
|
"| Manhattan| Manhattan| 7| 0.01721972031287035|\n",
|
|
"| Manhattan| Manhattan| 8|0.007416208104052026|\n",
|
|
"| Manhattan| Manhattan| 9|0.008730447435431065|\n",
|
|
"| Manhattan| Manhattan| 10|0.007606766828344964|\n",
|
|
"| Manhattan| Manhattan| 11|0.003766874141136529|\n",
|
|
"| Manhattan| Manhattan| 12|0.002688551972247...|\n",
|
|
"| Manhattan| Manhattan| 13|0.002815919789692486|\n",
|
|
"| Manhattan| Manhattan| 14|0.003850092535471...|\n",
|
|
"| Manhattan| Manhattan| 15|0.008035703139629235|\n",
|
|
"| Manhattan| Manhattan| 16| 0.0056893032117583|\n",
|
|
"| Manhattan| Manhattan| 17|0.009296927493738926|\n",
|
|
"| Manhattan| Manhattan| 18|0.006115517819238...|\n",
|
|
"| Manhattan| Manhattan| 19|0.006486187125358352|\n",
|
|
"| Manhattan| Manhattan| 20|0.008908519239407095|\n",
|
|
"| Manhattan| Manhattan| 21|0.004213675213675213|\n",
|
|
"| Manhattan| Manhattan| 22|0.005885259631490787|\n",
|
|
"| Manhattan| Manhattan| 23|0.008152764067127342|\n",
|
|
"| Manhattan| Queens| 0| 0.8684324324324318|\n",
|
|
"| Manhattan| Queens| 1| 0.8232996323529406|\n",
|
|
"| Manhattan| Queens| 2| 0.8496747967479669|\n",
|
|
"| Manhattan| Queens| 3| 0.920373626373625|\n",
|
|
"| Manhattan| Queens| 4| 0.9509571209800902|\n",
|
|
"| Manhattan| Queens| 5| 1.2870841487279827|\n",
|
|
"| Manhattan| Queens| 6| 1.7025057208237966|\n",
|
|
"| Manhattan| Queens| 7| 2.1997175866495486|\n",
|
|
"| Manhattan| Queens| 8| 2.7828251121076213|\n",
|
|
"| Manhattan| Queens| 9| 2.6930985915492927|\n",
|
|
"| Manhattan| Queens| 10| 2.625207296849084|\n",
|
|
"| Manhattan| Queens| 11| 2.9828428571428574|\n",
|
|
"| Manhattan| Queens| 12| 3.070651685393257|\n",
|
|
"| Manhattan| Queens| 13| 2.920602536997886|\n",
|
|
"| Manhattan| Queens| 14| 3.059551760939169|\n",
|
|
"| Manhattan| Queens| 15| 3.2354977876106217|\n",
|
|
"| Manhattan| Queens| 16| 2.8950213371265985|\n",
|
|
"| Manhattan| Queens| 17| 2.6199999999999966|\n",
|
|
"| Manhattan| Queens| 18| 2.130339321357284|\n",
|
|
"| Manhattan| Queens| 19| 1.8387186629526464|\n",
|
|
"| Manhattan| Queens| 20| 1.0089171974522302|\n",
|
|
"| Manhattan| Queens| 21| 0.8297852760736203|\n",
|
|
"| Manhattan| Queens| 22| 0.6545454545454548|\n",
|
|
"| Manhattan| Queens| 23| 0.5005434782608698|\n",
|
|
"| Queens| Bronx| 0| 4.547368421052631|\n",
|
|
"| Queens| Bronx| 1| 2.9999999999999996|\n",
|
|
"| Queens| Bronx| 2| 2.742857142857143|\n",
|
|
"| Queens| Bronx| 3| 2.8799999999999994|\n",
|
|
"| Queens| Bronx| 4| 3.1999999999999997|\n",
|
|
"| Queens| Bronx| 5| 3.2842105263157886|\n",
|
|
"| Queens| Bronx| 6| 3.1999999999999997|\n",
|
|
"| Queens| Bronx| 7| 3.519999999999999|\n",
|
|
"| Queens| Bronx| 8| 3.756521739130434|\n",
|
|
"| Queens| Bronx| 9| 4.799999999999999|\n",
|
|
"| Queens| Bronx| 10| 4.26611111111111|\n",
|
|
"| Queens| Bronx| 11| 3.899999999999999|\n",
|
|
"| Queens| Bronx| 12| 4.8|\n",
|
|
"| Queens| Bronx| 13| 4.499999999999999|\n",
|
|
"| Queens| Bronx| 14| 4.718749999999999|\n",
|
|
"| Queens| Bronx| 15| 4.669999999999999|\n",
|
|
"| Queens| Bronx| 16| 4.114285714285713|\n",
|
|
"| Queens| Bronx| 17| 4.799999999999998|\n",
|
|
"| Queens| Bronx| 18| 4.44090909090909|\n",
|
|
"| Queens| Bronx| 19| 4.235294117647058|\n",
|
|
"| Queens| Bronx| 20| 4.457142857142856|\n",
|
|
"| Queens| Bronx| 21| 4.199999999999999|\n",
|
|
"| Queens| Bronx| 22| 4.477272727272726|\n",
|
|
"| Queens| Bronx| 23| 4.319999999999999|\n",
|
|
"| Queens| Brooklyn| 0| 0.0|\n",
|
|
"| Queens| Brooklyn| 1| 0.0|\n",
|
|
"| Queens| Brooklyn| 2| 0.0|\n",
|
|
"| Queens| Brooklyn| 3| 0.0|\n",
|
|
"| Queens| Brooklyn| 4| 0.0|\n",
|
|
"| Queens| Brooklyn| 5| 0.0|\n",
|
|
"| Queens| Brooklyn| 6| 0.0|\n",
|
|
"| Queens| Brooklyn| 7| 0.0|\n",
|
|
"| Queens| Brooklyn| 8| 0.0|\n",
|
|
"| Queens| Brooklyn| 9| 0.0|\n",
|
|
"| Queens| Brooklyn| 10| 0.0|\n",
|
|
"| Queens| Brooklyn| 11| 0.0|\n",
|
|
"| Queens| Brooklyn| 12| 0.0|\n",
|
|
"| Queens| Brooklyn| 13| 0.0|\n",
|
|
"| Queens| Brooklyn| 14| 0.0|\n",
|
|
"| Queens| Brooklyn| 15| 0.0|\n",
|
|
"| Queens| Brooklyn| 16| 0.0|\n",
|
|
"| Queens| Brooklyn| 17| 0.0|\n",
|
|
"| Queens| Brooklyn| 18| 0.01846153846153846|\n",
|
|
"| Queens| Brooklyn| 19| 0.0|\n",
|
|
"| Queens| Brooklyn| 20| 0.0|\n",
|
|
"| Queens| Brooklyn| 21| 0.0|\n",
|
|
"| Queens| Brooklyn| 22| 0.0|\n",
|
|
"| Queens| Brooklyn| 23|0.019433198380566803|\n",
|
|
"| Queens| Manhattan| 0| 1.9786259541984754|\n",
|
|
"| Queens| Manhattan| 1| 0.9882352941176481|\n",
|
|
"| Queens| Manhattan| 2| 0.6832740213523135|\n",
|
|
"| Queens| Manhattan| 3| 0.672689075630252|\n",
|
|
"| Queens| Manhattan| 4| 0.8727272727272726|\n",
|
|
"| Queens| Manhattan| 5| 2.020737327188942|\n",
|
|
"| Queens| Manhattan| 6| 1.513492063492065|\n",
|
|
"| Queens| Manhattan| 7| 2.2232824427480935|\n",
|
|
"| Queens| Manhattan| 8| 2.3165217391304362|\n",
|
|
"| Queens| Manhattan| 9| 2.2579770992366432|\n",
|
|
"| Queens| Manhattan| 10| 2.782300884955749|\n",
|
|
"| Queens| Manhattan| 11| 3.039658848614068|\n",
|
|
"| Queens| Manhattan| 12| 3.084337349397588|\n",
|
|
"| Queens| Manhattan| 13| 3.301075268817201|\n",
|
|
"| Queens| Manhattan| 14| 3.456075808249721|\n",
|
|
"| Queens| Manhattan| 15| 3.4173983739837372|\n",
|
|
"| Queens| Manhattan| 16| 3.3323693803159182|\n",
|
|
"| Queens| Manhattan| 17| 3.358361774744028|\n",
|
|
"| Queens| Manhattan| 18| 3.2230088495575226|\n",
|
|
"| Queens| Manhattan| 19| 3.1127427184466017|\n",
|
|
"| Queens| Manhattan| 20| 3.1380410022779053|\n",
|
|
"| Queens| Manhattan| 21| 3.19478935698448|\n",
|
|
"| Queens| Manhattan| 22| 3.0503001200480195|\n",
|
|
"| Queens| Manhattan| 23| 2.954719764011798|\n",
|
|
"| Queens| Queens| 0| 0.03692307692307692|\n",
|
|
"| Queens| Queens| 1|0.010015174506828527|\n",
|
|
"| Queens| Queens| 2|0.012598425196850394|\n",
|
|
"| Queens| Queens| 3|0.005755395683453...|\n",
|
|
"| Queens| Queens| 4| 0.07384937238493722|\n",
|
|
"| Queens| Queens| 5| 0.04725274725274725|\n",
|
|
"| Queens| Queens| 6| 0.08010471204188482|\n",
|
|
"| Queens| Queens| 7| 0.096|\n",
|
|
"| Queens| Queens| 8| 0.07384615384615384|\n",
|
|
"| Queens| Queens| 9| 0.13531746031746034|\n",
|
|
"| Queens| Queens| 10| 0.14015444015444017|\n",
|
|
"| Queens| Queens| 11| 0.16809338521400777|\n",
|
|
"| Queens| Queens| 12| 0.09125475285171103|\n",
|
|
"| Queens| Queens| 13| 0.12818991097922847|\n",
|
|
"| Queens| Queens| 14| 0.15837563451776646|\n",
|
|
"| Queens| Queens| 15| 0.16179775280898873|\n",
|
|
"| Queens| Queens| 16| 0.3113513513513512|\n",
|
|
"| Queens| Queens| 17| 0.21036585365853655|\n",
|
|
"| Queens| Queens| 18| 0.15960451977401127|\n",
|
|
"| Queens| Queens| 19| 0.20064308681672022|\n",
|
|
"| Queens| Queens| 20| 0.12923076923076923|\n",
|
|
"| Queens| Queens| 21| 0.10892307692307691|\n",
|
|
"| Queens| Queens| 22|0.061224489795918366|\n",
|
|
"| Queens| Queens| 23| 0.07164179104477612|\n",
|
|
"+--------------+---------------+----+--------------------+\n",
|
|
"\n"
|
|
]
|
|
},
|
|
{
|
|
"name": "stderr",
|
|
"output_type": "stream",
|
|
"text": [
|
|
" \r"
|
|
]
|
|
}
|
|
],
|
|
"source": [
|
|
"boroughs_ex5 = [\"Manhattan\", \"Bronx\", \"Brooklyn\", \"Queens\"]\n",
|
|
"\n",
|
|
"df_ex5 = df_with_bor \\\n",
|
|
" .where((isin(df_with_bor.pickup_borough, boroughs_ex5)) & (isin(df_with_bor.dropoff_borough, boroughs_ex5))) \\\n",
|
|
" .withColumn(\"hour\", F.hour(F.from_utc_timestamp(F.col(\"pickup_datetime\"), 'UTC'))) \\\n",
|
|
" .groupBy(\"pickup_borough\", \"dropoff_borough\", \"hour\") \\\n",
|
|
" .agg(F.mean(F.col('tolls_amount')).alias('mean_tolls_amount')) \\\n",
|
|
" .select(F.col('pickup_borough'), F.col('dropoff_borough'), F.col('hour'), F.col('mean_tolls_amount')) \\\n",
|
|
" .orderBy(\"pickup_borough\", \"dropoff_borough\", \"hour\") \n",
|
|
"\n",
|
|
"df_ex5.show(25 * 24 * 2)"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "884b4cf9",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Exercise 6\n",
|
|
"Create a dataframe that for each district shows the shortest and longest `trip_distance` starting and ending in the same district. What is the length of the longest and shortest trips in Manhattan?"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "0aa8d795",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "756da7e4",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Exercise 7\n",
|
|
"Consider only the trips _within_ districts. What are the first and second-most expensive\n",
|
|
"trips - based on `total_amount` - in every district?"
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "ca83556d",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
},
|
|
{
|
|
"cell_type": "markdown",
|
|
"id": "4f1e0800",
|
|
"metadata": {},
|
|
"source": [
|
|
"### Exercise 8\n",
|
|
"Create a dataframe where each row represents a driver, and there is one column per district.\n",
|
|
"For each driver-district, the dataframe provides the maximum number of consecutive trips\n",
|
|
"for the given driver, within the given district. \n",
|
|
"\n",
|
|
"For example, if for driver A we have (sorted by time):\n",
|
|
"- Trip 1: Bronx → Bronx\n",
|
|
"- Trip 2: Bronx → Bronx\n",
|
|
"- Trip 3: Bronx → Manhattan\n",
|
|
"- Trip 4: Manhattan → Bronx.\n",
|
|
" \n",
|
|
"The maximum number of consecutive trips for Bronx is 2."
|
|
]
|
|
},
|
|
{
|
|
"cell_type": "code",
|
|
"execution_count": null,
|
|
"id": "edde38bb",
|
|
"metadata": {},
|
|
"outputs": [],
|
|
"source": []
|
|
}
|
|
],
|
|
"metadata": {
|
|
"kernelspec": {
|
|
"display_name": "Python 3 (ipykernel)",
|
|
"language": "python",
|
|
"name": "python3"
|
|
},
|
|
"language_info": {
|
|
"codemirror_mode": {
|
|
"name": "ipython",
|
|
"version": 3
|
|
},
|
|
"file_extension": ".py",
|
|
"mimetype": "text/x-python",
|
|
"name": "python",
|
|
"nbconvert_exporter": "python",
|
|
"pygments_lexer": "ipython3",
|
|
"version": "3.11.3"
|
|
}
|
|
},
|
|
"nbformat": 4,
|
|
"nbformat_minor": 5
|
|
}
|