hw2 (part 1): done, report done for ex1
This commit is contained in:
parent
92b7029703
commit
b961d91fa6
4 changed files with 120 additions and 0 deletions
36
Assignment2_part1/queries/query3c.http
Normal file
36
Assignment2_part1/queries/query3c.http
Normal file
|
@ -0,0 +1,36 @@
|
|||
GET /restaurants/_search
|
||||
{
|
||||
"size": 0,
|
||||
"query": {
|
||||
"geo_distance": {
|
||||
"distance": "9000km",
|
||||
"location": {
|
||||
"lat": 28.642449499999998,
|
||||
"lon": 77.10684570000001
|
||||
}
|
||||
}
|
||||
},
|
||||
"aggs": {
|
||||
"city_term": {
|
||||
"terms": {
|
||||
"field": "ratingText.keyword",
|
||||
"size": 10000,
|
||||
"shard_size": 10000
|
||||
},
|
||||
"aggs": {
|
||||
"max_vote_count": {
|
||||
"max": {
|
||||
"field": "votes"
|
||||
}
|
||||
},
|
||||
"max_vote_bucket_sort": {
|
||||
"bucket_sort": {
|
||||
"sort": [
|
||||
{ "max_vote_count": { "order": "desc" } }
|
||||
]
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
||||
}
|
8
Assignment2_part1/report/build.sh
Executable file
8
Assignment2_part1/report/build.sh
Executable file
|
@ -0,0 +1,8 @@
|
|||
#!/bin/bash
|
||||
|
||||
set -e
|
||||
|
||||
SCRIPT_DIR=$(cd -- "$( dirname -- "${BASH_SOURCE[0]}" )" &> /dev/null && pwd)
|
||||
|
||||
cd "$SCRIPT_DIR"
|
||||
pandoc main.md -o main.pdf
|
76
Assignment2_part1/report/main.md
Normal file
76
Assignment2_part1/report/main.md
Normal file
|
@ -0,0 +1,76 @@
|
|||
---
|
||||
author: Claudio Maggioni
|
||||
title: Visual Analytics -- Assignment 2 -- Part 1
|
||||
geometry: margin=2cm,bottom=3cm
|
||||
---
|
||||
|
||||
# Indexing
|
||||
|
||||
The first step of indexing is to convert the given CSV dataset (stored in
|
||||
`data/restaurants.csv`) into a JSON-lines file which can be directly used as the
|
||||
HTTP request body of Elasticsearch document insertion requests.
|
||||
|
||||
The conversion is performed by the script `./convert.sh`. The converted file
|
||||
is stored in `data/restaurants.jsonl`.
|
||||
|
||||
The gist of the conversion script is the following invocation of the _jq_ tool:
|
||||
|
||||
```shell
|
||||
jq -s --raw-input --raw-output \
|
||||
'split("\n") | .[1:-1] | map(split(",")) |
|
||||
map({
|
||||
"id": .[0],
|
||||
"name": .[1],
|
||||
"city": .[2],
|
||||
"location": {
|
||||
"lon": .[8] | sub("^\"\\["; "") | sub("\\s*"; "") | tonumber,
|
||||
"lat": .[9] | sub("\\]\"$"; "") | sub("\\s*"; "") | tonumber,
|
||||
},
|
||||
"averageCostForTwo": .[3],
|
||||
"aggregateRating": .[4],
|
||||
"ratingText": .[5],
|
||||
"votes": .[6],
|
||||
"date": .[7]
|
||||
})' "$input"
|
||||
```
|
||||
|
||||
Here the CSV file is read as raw text, splitted into lines, has its first and
|
||||
last line discarded (as they are respectively the CSV header and a terminating
|
||||
blank line), splitted into columns by the `,` (comma) delimiter character,
|
||||
and each line is converted into a JSON object by _jq_. Note that
|
||||
_jq_ is invoked in `slurp` mode so that the output is elaborated in one go.
|
||||
|
||||
Location coordinate strings are represented in the CSV with the pattern:
|
||||
|
||||
```
|
||||
"[{longitude}, {latitude}]"
|
||||
```
|
||||
|
||||
(with `{longitude}` and `{latitude}` being two JSON formatted `float`s).
|
||||
Therefore, the comma split performed by _jq_ divides each cell value in two pieces.
|
||||
I exploit this side effect
|
||||
by simply removing the spurious non-numeric characters (like `[]"` and space),
|
||||
converting the obtained strings into `float`s and storing them in the `lon` and `lat`
|
||||
properties of `location`.
|
||||
|
||||
After the conversion, the JSON-lines dataset is uploaded as an _Elasticsearch_ index
|
||||
named `restaurants` by the script `upload.sh`. The script assumes _Elasticsearch_ is
|
||||
deployed locally, uses HTTPS authentication and has HTTP basic authentication turned
|
||||
on. Installation parameters for my machine are hardcoded in variables at the start
|
||||
of the script and may be adapted to the local installation to run it.
|
||||
|
||||
The upload script, in order:
|
||||
|
||||
- Tries to `DELETE` (ignoring failures, e.g. if the index does not exist) and
|
||||
`POST`s the `/restaurants` index, which will be used to store the documents.
|
||||
- Field mappings are `POST`ed at the URI `/restaurants/_mappings/`. Mappings
|
||||
are defined in the `mappings.json` file.
|
||||
- The lines of the dataset are read one-by-one, and then the correspoding
|
||||
document is `POST`ed at the URI `/restaurants/_doc/{id}` where `{id}` is
|
||||
the value of the `id` field for the document/line.
|
||||
|
||||
The mappings map the `id` field to type `long`, all other numeric fields to type
|
||||
`float`, the `location` field to type `geo_point`, and the `date` field to type
|
||||
`date` by using non-strict ISO 8601 with optional time as a parsing format. All
|
||||
string fields are stored as type `text`, while also defining a `.keyword` alias
|
||||
for each to allow exact match queries on each field.
|
BIN
Assignment2_part1/report/main.pdf
Normal file
BIN
Assignment2_part1/report/main.pdf
Normal file
Binary file not shown.
Reference in a new issue