hw2 (part 1): done

2023-04-24 23:18:10 +02:00 · 2023-04-24 23:18:10 +02:00 · 4a5a71cb28
commit 4a5a71cb28
parent b961d91fa6
5 changed files with 129 additions and 22 deletions
--- a/Assignment2_part1/queries/query2a.http
+++ b/Assignment2_part1/queries/query2a.http
@ -12,12 +12,12 @@ GET /restaurants/_search
      "should": [
        {
          "match": {
-            "ratingText": "Very Good"
+            "ratingText.keyword": "Very Good"
          }
        },
        {
          "match": {
-            "ratingText": "Excellent"
+            "ratingText.keyword": "Excellent"
          }
        }
      ],
@ -30,4 +30,4 @@ GET /restaurants/_search
      ]
    }
  }
-}
+}
--- a/Assignment2_part1/report/.gitignore
+++ b/Assignment2_part1/report/.gitignore
@ -0,0 +1 @@
+_tmp.md
--- a/Assignment2_part1/report/build.sh
+++ b/Assignment2_part1/report/build.sh
@ -5,4 +5,5 @@ set -e
 SCRIPT_DIR=$(cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd)

 cd "$SCRIPT_DIR"
-pandoc main.md -o main.pdf
+m4 -I"$SCRIPT_DIR" main.md > _tmp.md
+pandoc _tmp.md -o main.pdf
--- a/Assignment2_part1/report/main.md
+++ b/Assignment2_part1/report/main.md
@ -4,6 +4,8 @@ title: Visual Analytics -- Assignment 2 -- Part 1
 geometry: margin=2cm,bottom=3cm
 ---

+changequote(`{{', `}}')
+
 # Indexing

 The first step of indexing is to convert the given CSV dataset (stored in 
@ -13,27 +15,13 @@ HTTP request body of Elasticsearch document insertion requests.
 The conversion is performed by the script `./convert.sh`. The converted file 
 is stored in `data/restaurants.jsonl`.

-The gist of the conversion script is the following invocation of the _jq_ tool:
+The sources of `./convert.sh` are listed below:

 ```shell
-jq -s --raw-input --raw-output \
-  'split("\n") | .[1:-1] | map(split(",")) |
-    map({
-      "id": .[0],
-      "name": .[1],
-      "city": .[2],
-      "location": {
-        "lon": .[8] | sub("^\"\\["; "") | sub("\\s*"; "") | tonumber,
-        "lat": .[9] | sub("\\]\"$"; "") | sub("\\s*"; "") | tonumber,
-      },
-      "averageCostForTwo": .[3],
-      "aggregateRating": .[4],
-      "ratingText": .[5],
-      "votes": .[6],
-      "date": .[7]
-    })' "$input"
+include({{../convert.sh}})
 ```

+The gist of the conversion script is the following invocation of the _jq_ tool.
 Here the CSV file is read as raw text, splitted into lines, has its first and 
 last line discarded (as they are respectively the CSV header and a terminating
 blank line), splitted into columns by the `,` (comma) delimiter character,
@ -69,8 +57,125 @@ The upload script, in order:
    document is `POST`ed at the URI `/restaurants/_doc/{id}` where `{id}` is 
    the value of the `id` field for the document/line.

+The sources of the upload script are listen below:
+
+```shell
+include({{../upload.sh}})
+```
+
 The mappings map the `id` field to type `long`, all other numeric fields to type
 `float`, the `location` field to type `geo_point`, and the `date` field to type
 `date` by using non-strict ISO 8601 with optional time as a parsing format. All 
 string fields are stored as type `text`, while also defining a `.keyword` alias
-for each to allow exact match queries on each field. 
+for each to allow exact match queries on each field. 
+
+9499 documents are imported.
+
+# Queries
+
+## We would like to get those restaurants that have 'pizza' in the name and not 'pasta'. Get only the restaurants that have been reviewed at least as ‘Very Good’.
+
+This query is implemented with the following Dev Tools command:
+
+```
+include({{../queries/query2a.http}})
+```
+
+The query searches for all documents with the word "pizza" in the bag-of-words of each `name` text field, 
+that either have the exact string "Very Good" or "Excellent" (only rating above "Very Good") in their `ratingText`
+field, and that whose name does not match with "pasta" in text search. 
+
+239 hits are returned.
+
+## Which are the 5 most expensive restaurants whose reviews were done in 2018? We are interested in reviews which refer only to places within 20 km from Athens (33.9259, -83.3389) and would like to look at the 5 most expensive.
+
+The following query answers the question:
+
+```
+include({{../queries/query2b.http}})
+```
+
+The 5 most expensive restaurants are in descending order:
+
+- "Five & Ten" (id 108)
+- "The National" (id 114)
+- "DePalma's Italian Cafe - East Side" (id 107)
+- "Shokitini" (id 112)
+- "Choo Choo Eastside" (id 102).
+
+## Get all restaurants which contain the substring 'pizz' in the restaurant name but that do not contain neither 'pizza' nor 'pizzeria'.
+
+The following query answers the question:
+
+```
+include({{../queries/query2c.http}})
+```
+
+The query specifies the required constraints using regular expressions instead of a plain `match` constraints in order to search for letter sequences
+within words, instead of only searching in the bag-of-words.
+
+Only one restaurant is returned: "[Pizzoccheri](https://youtu.be/Mq0IqiFXIZQ)" (id 2237).
+
+# Aggregations
+
+## Show the number of restaurants reviewed as 'Good' aggregated by number of votes. Please consider the following ranges: from 0 to 250, from 250 to 500, from 500 to 750, from 750 to 1000. For each bucket we would like to know the minimum and maximum value of the average cost per 2.
+
+This query outputs the answer:
+
+```
+include({{../queries/query3a.http}})
+```
+
+The query does not face the shard size problem as the types of aggregations used do not face the problem.
+This is because bucket division combined with computation of a minimum and maximum value is trivially
+correct without approximation when implemented in a distributed MapReduce-like workflow.
+
+The answer found is:
+
+- Range $[0, 250]$: document count $=2060$, minimum cost $=0$, maximum cost $=350000$
+- Range $[250, 500]$: document count $=583$, minimum cost $=10$, maximum cost $=450000$
+- Range $[500, 750]$: document count $=217$, minimum cost $=10$, maximum cost $=5000$
+- Range $[750, 1000]$: document count $=99$, minimum cost $=10$, maximum cost $=200000$
+
+## We are interested in cities which have not less than 10 restaurants and restaurants that have at least 100 votes. Which are the 7 cities with the highest average restaurant price (cost for two)?
+
+This query implements a way to fetch the answer:
+
+```
+include({{../queries/query3b.http}})
+```
+
+The document count output field `terms` aggregator is vulnerable to the shard size problem,
+thus to ensure correct counts we specify `shard_size` as an upper bound of the total document count.
+I can verify the output is correct as the output returns a `doc_count_error_upper_bound` of 0. Additionally,
+the `shard` parameter of the `terms` aggregator is set to the same value to not limit the number of buckets
+computed by the aggregator.
+
+The cities found, in decreasing average `averageCostForTwo` are:
+
+- "Jakarta", document count $=16$, average price $=308437.5$
+- "Colombo", document count $=14$, average price $=2535.714285714286$
+- "Hyderabad", document count $=17$, average price $=1358.8235294117646$
+- "Pune", document count $=20$, average price $=1337.5$
+- "Jaipur", document count $=18$, average price $=1316.6666666666667$
+- "Kolkata", document count $=20$, average price $=1272.5$
+- "Bangalore", document count $=20$, average price $=1232.5$
+
+## Show the highest number of votes for different rating types in descending order. You should consider only restaurants that are within 9000 km of New Dehli (28.642449499999998, 77.10684570000001).
+
+This query provides the solution:
+
+```
+include({{../queries/query3c.http}})
+```
+
+The `terms` aggregator in this query is subjected to the same shard size problem described for the previous question.
+
+The results are:
+
+- Rating "Excellent", document count $=196$, maximum vote count $=10934$
+- Rating "Very Good", document count $=834$, maximum vote count $=7931$
+- Rating "Good", document count $=1902$, maximum vote count $=4914$
+- Rating "Average", document count $=3683$, maxiumum vote count $=2460$
+- Rating "Poor", document count $=180$, maximum vote count $=2412$
+- Rating "Not rated", document count $=2135$, maximum vote count $=3$
--- a/Assignment2_part1/report/main.pdf
+++ b/Assignment2_part1/report/main.pdf