76 lines
3.1 KiB
Markdown
76 lines
3.1 KiB
Markdown
|
---
|
||
|
author: Claudio Maggioni
|
||
|
title: Visual Analytics -- Assignment 2 -- Part 1
|
||
|
geometry: margin=2cm,bottom=3cm
|
||
|
---
|
||
|
|
||
|
# Indexing
|
||
|
|
||
|
The first step of indexing is to convert the given CSV dataset (stored in
|
||
|
`data/restaurants.csv`) into a JSON-lines file which can be directly used as the
|
||
|
HTTP request body of Elasticsearch document insertion requests.
|
||
|
|
||
|
The conversion is performed by the script `./convert.sh`. The converted file
|
||
|
is stored in `data/restaurants.jsonl`.
|
||
|
|
||
|
The gist of the conversion script is the following invocation of the _jq_ tool:
|
||
|
|
||
|
```shell
|
||
|
jq -s --raw-input --raw-output \
|
||
|
'split("\n") | .[1:-1] | map(split(",")) |
|
||
|
map({
|
||
|
"id": .[0],
|
||
|
"name": .[1],
|
||
|
"city": .[2],
|
||
|
"location": {
|
||
|
"lon": .[8] | sub("^\"\\["; "") | sub("\\s*"; "") | tonumber,
|
||
|
"lat": .[9] | sub("\\]\"$"; "") | sub("\\s*"; "") | tonumber,
|
||
|
},
|
||
|
"averageCostForTwo": .[3],
|
||
|
"aggregateRating": .[4],
|
||
|
"ratingText": .[5],
|
||
|
"votes": .[6],
|
||
|
"date": .[7]
|
||
|
})' "$input"
|
||
|
```
|
||
|
|
||
|
Here the CSV file is read as raw text, splitted into lines, has its first and
|
||
|
last line discarded (as they are respectively the CSV header and a terminating
|
||
|
blank line), splitted into columns by the `,` (comma) delimiter character,
|
||
|
and each line is converted into a JSON object by _jq_. Note that
|
||
|
_jq_ is invoked in `slurp` mode so that the output is elaborated in one go.
|
||
|
|
||
|
Location coordinate strings are represented in the CSV with the pattern:
|
||
|
|
||
|
```
|
||
|
"[{longitude}, {latitude}]"
|
||
|
```
|
||
|
|
||
|
(with `{longitude}` and `{latitude}` being two JSON formatted `float`s).
|
||
|
Therefore, the comma split performed by _jq_ divides each cell value in two pieces.
|
||
|
I exploit this side effect
|
||
|
by simply removing the spurious non-numeric characters (like `[]"` and space),
|
||
|
converting the obtained strings into `float`s and storing them in the `lon` and `lat`
|
||
|
properties of `location`.
|
||
|
|
||
|
After the conversion, the JSON-lines dataset is uploaded as an _Elasticsearch_ index
|
||
|
named `restaurants` by the script `upload.sh`. The script assumes _Elasticsearch_ is
|
||
|
deployed locally, uses HTTPS authentication and has HTTP basic authentication turned
|
||
|
on. Installation parameters for my machine are hardcoded in variables at the start
|
||
|
of the script and may be adapted to the local installation to run it.
|
||
|
|
||
|
The upload script, in order:
|
||
|
|
||
|
- Tries to `DELETE` (ignoring failures, e.g. if the index does not exist) and
|
||
|
`POST`s the `/restaurants` index, which will be used to store the documents.
|
||
|
- Field mappings are `POST`ed at the URI `/restaurants/_mappings/`. Mappings
|
||
|
are defined in the `mappings.json` file.
|
||
|
- The lines of the dataset are read one-by-one, and then the correspoding
|
||
|
document is `POST`ed at the URI `/restaurants/_doc/{id}` where `{id}` is
|
||
|
the value of the `id` field for the document/line.
|
||
|
|
||
|
The mappings map the `id` field to type `long`, all other numeric fields to type
|
||
|
`float`, the `location` field to type `geo_point`, and the `date` field to type
|
||
|
`date` by using non-strict ISO 8601 with optional time as a parsing format. All
|
||
|
string fields are stored as type `text`, while also defining a `.keyword` alias
|
||
|
for each to allow exact match queries on each field.
|