This repository has been archived on 2020-12-10. You can view files and clone it, but cannot push or open issues or pull requests.
IRProject/report/report.tex

475 lines
21 KiB
TeX
Raw Normal View History

2020-12-07 17:45:46 +00:00
% vim: set ts=2 sw=2 et tw=80:
\documentclass{scrartcl}
\usepackage{hyperref}
\usepackage{parskip}
\usepackage{minted}
\usepackage[utf8]{inputenc}
\usepackage{subcaption}
\usepackage{graphicx}
2020-12-07 17:45:46 +00:00
\setlength{\parindent}{0pt}
\usepackage[margin=2.5cm]{geometry}
\title{\textit{Image Search IR System} \\\vspace{0.3cm}
\Large{WS2020-21 Information Retrieval Project}}
\author{Claudio Maggioni}
\begin{document}
\maketitle
\tableofcontents
\listoffigures
2020-12-07 17:45:46 +00:00
\newpage
\section{Introduction}
This report is a summary of the work I have done to create the ``Image Search IR
system'', a proof-of-concept IR system implementation implementing the ``Image
Search Engine'' project (project \#13).
The project is built on a simple
\textit{Scrapy}-\textit{Solr}-\textit{HTML5+CSS+JS} stack. Installation
instructions, an in-depth look to the project components for scraping, indexing,
and displaying the results, and finally the user evaluation report, can all be
found in the following sections.
\section{Installation instructions}
\subsection{Project repository}
The project Git repository is located here:
\url{https://git.maggioni.xyz/maggicl/IRProject}.
\subsection{Solr installation}
The installation of the project and population of the test collection with the
scraped documents is automated by a single script. The script requires you have
downloaded \textit{Solr} version 8.6.2. as a ZIP file, i.e.\ the same
\textit{Solr} ZIP we had to download during lab lectures. Should you need to
2020-12-07 17:54:22 +00:00
download a copy of the ZIP file, you can find it here: \url{https://maggioni.xyz/solr-8.6.2.zip}.
2020-12-07 17:45:46 +00:00
Clone the project's git repository and position yourself with a shell on the
project's root directory. Then execute this command:
% linenos
\begin{minted}[frame=lines,framesep=2mm]{bash}
./solr_install.sh {ZIP path}
\end{minted}
2020-12-07 17:54:22 +00:00
where \texttt{\{ZIP path\}} is the path of the ZIP file mentioned earlier. This
2020-12-07 17:45:46 +00:00
will install, start, and update \textit{Solr} with the test collection.
\subsection{UI installation}
In order to start the UI, open with your browser of choice the file
\texttt{ui/index.html}. In order to use the UI, it is necessary to bypass
\texttt{Cross Origin Resource Sharing} security checks by downloading and
enabling a ``CORS everywhere'' extension. I suggest
\href{https://addons.mozilla.org/en-US/firefox/addon/cors-everywhere/}{this one} for
Mozilla Firefox and derivatives.
\subsection{Run the website scrapers}
A prerequisite to run the Flickr crawler is to have a working Scrapy Splash
instance listening on port \texttt{localhost:8050}. This can be achieved by
executing this Docker command, should a Docker installation be available:
\begin{minted}[frame=lines,framesep=2mm]{bash}
docker run -p 8050:8050 scrapinghub/scrapy
\end{minted}
In order to all the website scrapers, run the script \texttt{./scrape.sh} with
no arguments.
\section{Scraping}
2020-12-07 23:03:00 +00:00
The chosen three website to be scraped were \url{flickr.com}, a user-centric
image sharing service service aimed at photography amatures and professionals,
\url{123rf.com}, a stock image website, and \url{shutterstock.com}, another
stock image website.
2020-12-08 11:27:03 +00:00
The stock photo websites were scraped with standard scraping technology using
2020-12-07 23:03:00 +00:00
plain \texttt{scrapy}, while \textit{Flickr} was scraped using browser emulation
technology using \texttt{scrapy-splash} in order to execute Javascript code and
scrape infinite-scroll paginated data.
2020-12-08 11:27:03 +00:00
I would like to point out that in order to save space I scraped only image
links, and not the images themselves. Should any content that I scraped be deleted from the
services listed above, some results might not be correct as they could have been
deleted.
As a final note, since some websites are not so kind in their
\texttt{robots.txt} file to bots (\textit{Flickr} in particular blocks all
bots except Google), ``robots.txt compliance'' has been turned off for all
scrapers and the user agent has been changed to mimick a normal browser.
All scraper implementations and related files are located in the directory
\texttt{photo\_scraper/spiders}.
\subsection{\textit{Flickr}}
\subsubsection{Simulated browser technology \textit{Splash}}
2020-12-07 23:03:00 +00:00
As mentioned before, the implementation of the \textit{Flickr} scraper uses
\textit{Splash}, a browser emulation that supports Javascript execution and
simulated user interaction. This component is essential to allow for the website
to load correctly and to load as many photos as possible in the photo list
pagest scraped through emulation of the user performing an ``infinite'' scroll
down.
Here is the Lua script used by splash to emulate infinite scrolling. These exact
contents can be found in file
2020-12-08 11:27:03 +00:00
\texttt{infinite\_scroll.lua}.
2020-12-07 23:03:00 +00:00
\begin{minted}[linenos,frame=lines,framesep=2mm]{lua}
function main(splash)
local num_scrolls = 20
local scroll_delay = 0.8
local scroll_to = splash:jsfunc("window.scrollTo")
local get_body_height = splash:jsfunc(
"function() {return document.body.scrollHeight;}"
)
assert(splash:go(splash.args.url))
splash:wait(splash.args.wait)
for _ = 1, num_scrolls do
scroll_to(0, get_body_height())
splash:wait(scroll_delay)
end
return splash:html()
end
\end{minted}
Line 13 contains the instruction that scrolls down one page height. This
instruction runs in the loop of lines 12-15, which runs the scroll instruction
\texttt{num\_scrolls} by also waiting \texttt{scroll\_delay} seconds after every
execution.
After this operation is done, the resulting HTML markup is returned and normal
crawling tecniques can work on this intermediate result.
2020-12-08 11:27:03 +00:00
\subsubsection{Scraper implementation}
The Python implementation of the \textit{Flickr} scraper can be found under
\texttt{flickr.py}.
Sadly \textit{Flickr}, other than a recently posted gallery of images, offers no
curated list of image content or categorization that can allow for finding
images other than querying for them.
I therefore had to use the \textit{Flickr}
search engine to query for some common words (including the list of the 100 most
common english verbs). Then, each search result page is fed through
\textit{Splash} and the resulting markup is searched for image links. Each link
is opened to scrape the image link and its metadata.
\subsection{Implementation for \textit{123rf} and \textit{Shutterstock}}
The \textit{123rf} and \textit{Shutterstock} website do not require the use of
\textit{Splash} to be scraped and, as stock image websites, offer several
precompiled catalogs of images that can be easily scraped. The crawler
implementations, that can respectively be found in \texttt{stock123rf.py} and
\texttt{shutterstock.py} are pretty straightfoward, and
navigate from the list of categories, to each category's photo list, and then
to the individual photo page to scrape the image link and metadata.
2020-12-07 17:45:46 +00:00
\section{Indexing and \textit{Solr} configuration}
2020-12-08 11:27:03 +00:00
Solr configuration was probably the trickiest part of this project. I am not an
expert of Solr XML configuration quirks, and I am certainly have not become one
by implementng this project. However, I managed to assemble a configuration that
has both a tailored collection schema defined as XML and a custom Solr
controller to handle result clustering.
Configuration files for Solr can be found under the directory
\texttt{solr\_config} this directory is symlinked by the
\texttt{solr\_install.sh} installation script to appear as a folder named
\texttt{server/solr/photo} in the \texttt{solr} folder containing the Solr
installation. Therefore, the entire directory corresponds to the configuration
and data storage for the collection \texttt{photo}, the only collection present
in this project.
Please note that the \texttt{solr\_config/data} folder is
ignored by Git and thus not present in a freshly cloned repository: this is done
to preserve only the configuration files, and not the somewhat temporary
collection data. The collection data is uploaded every time
\texttt{solr\_install.sh} is used from CSV files located in the \texttt{scraped}
folder and produced by Scrapy.
2020-12-08 11:32:33 +00:00
The configuration was derived from the \texttt{techproducts} Solr example by
changing the collection schema and removing any non-needed controller.
\subsection{Solr schema}
As some minor edits were made using Solr's web interface, the relevant XML
schema to analyse is the file \texttt{solr\_config/conf/managed-schema}. This
files also stores the edits done through the UI. An extract of the relevant
lines is shown below:
\begin{minted}[linenos,frame=lines,framesep=2mm]{xml}
<?xml version="1.0" encoding="UTF-8"?>
<!-- Solr managed schema - automatically generated - DO NOT EDIT -->
<schema name="example" version="1.6">
<uniqueKey>id</uniqueKey>
<!-- Omitted field type definitions and default fields added by Solr -->
<field name="id" type="string" multiValued="false" indexed="true"
required="true" stored="true"/>
<field name="date" type="text_general" indexed="true" stored="true"/>
<field name="img_url" type="text_general" indexed="true" stored="true"/>
<field name="t_author" type="text_general" indexed="true" stored="true"/>
<field name="t_description" type="text_general" indexed="true"
stored="true"/>
<field name="t_title" type="text_general" indexed="true" stored="true"/>
<field name="tags" type="text_general" indexed="true" stored="true"/>
<field name="text" type="text_general" uninvertible="true" multiValued="true"
indexed="true" stored="true"/>
<!-- Omitted unused default dynamicField fields added by Solr -->
<copyField source="t_*" dest="text"/>
</schema>
\end{minted}
All fields have type \texttt{text-general}. Fields with a name starting by
``\texttt{t\_}'' are included in the \texttt{text} copy field, which is used as
the default field for document similarity when searching an clustering.
The \texttt{id} field is of type \texttt{string}, but in actuality it is always
a positive integer. This field's values do not come from data scraped from the
website, but it is computed as a auto-incremented progressive identified when
uploading the collection on solr using \texttt{solr\_install.sh}. Shown below is
the \texttt{awk}-based piped command included in the installation script
that performs this task and uploads the collection.
\begin{minted}[linenos,frame=lines,framesep=2mm]{bash}
# at this point in the script, `pwd` is the repository root directory
cd scraped
# POST scraped data
tail -q -n +2 photos.csv 123rf.csv shutterstock.csv | \
awk "{print NR-1 ',' \$0}" | \
awk 'BEGIN {print "id,t_author,t_title,t_description,date,img_url,tags"}
{print}' | \
../solr/bin/post -c photo -type text/csv -out yes -d
\end{minted}
Line 6 strips the heading line of the listed CSV files and concatenates them;
Line 7 adds ``\{id\},'' at the beginning of each line, where \{id\} corresponds
to the line number. Line 8 and 9 finally add the correct CSV heading, including
the ``id'' field. Line 10 reads the processed data and posts it to Solr.
\subsection{Clustering configuration}
Clustering configuration was performed by using the \texttt{solrconfig.xml} file
from the \texttt{techproducts} Solr example and adapting it to the ``photo''
collection schema.
Here is the XML configuration relevant to the clustering controller. It can be
found at approximately line 900 of the \texttt{solrconfig.xml} file:
\begin{minted}[linenos,frame=lines,framesep=2mm]{xml}
<requestHandler name="/clustering"
startup="lazy"
enable="true"
class="solr.SearchHandler">
<lst name="defaults">
<bool name="clustering">true</bool>
<bool name="clustering.results">true</bool>
<!-- Field name with the logical "title" of a each document (optional) -->
<str name="carrot.title">t_title</str>
<!-- Field name with the logical "URL" of a each document (optional) -->
<str name="carrot.url">img_url</str>
<!-- Field name with the logical "content" of a each document (optional) -->
<str name="carrot.snippet">t_description</str>
<!-- Apply highlighter to the title/ content and use this for clustering. -->
<bool name="carrot.produceSummary">true</bool>
<!-- the maximum number of labels per cluster -->
<!--<int name="carrot.numDescriptions">5</int>-->
<!-- produce sub clusters -->
<bool name="carrot.outputSubClusters">false</bool>
<!-- Configure the remaining request handler parameters. -->
<str name="defType">edismax</str>
<str name="df">text</str>
<str name="q.alt">*:*</str>
<str name="rows">100</str>
<str name="fl">*,score</str>
</lst>
<arr name="last-components">
<str>clustering</str>
</arr>
</requestHandler>
\end{minted}
2020-12-08 11:32:33 +00:00
This clustering controller uses Carrot2 technology to perform ``shallow'' one
level clustering (Line 19 disables sub-clusters). \texttt{t\_title} is used as
the ``title'' field for each document, \texttt{img\_url} as the ``document
location'' field and \texttt{t\_description} the ``description'' field (See
respectively lines 9, 11, and 13 of the configuration).
2020-12-08 11:27:03 +00:00
This controller replaces the normal \texttt{/select} controller, and thus one
single request will generate search results and clustering data. Defaults for
search are a 100 results limit and the use of \texttt{t\_*} fields to match
documents (lines 25 and 23 -- remember the definition of the \texttt{text} field).
2020-12-08 11:27:03 +00:00
2020-12-07 17:45:46 +00:00
\section{User interface}
\subsection{UI flow}
Figure \ref{fig:ui} illustrates the IR system's UI showing its features.
\begin{figure}[H]
\begin{subfigure}{\textwidth}
\centering
\includegraphics[width=0.8\textwidth]{ui_start.png}
\caption{The UI, when opened, prompts to insert a query in the input field
and press Enter. Here the user typed ``Lugano''.}
\vspace{0.5cm}
\end{subfigure}
\begin{subfigure}{0.45\textwidth}
\centering
\includegraphics[width=\textwidth]{ui.png}
\caption{After the user inputs a query and presses Enter, resulting images are shown on the
right. Found clusters are shown on the left using FoamTree.}
\end{subfigure}
\hspace{0.1\textwidth}
\begin{subfigure}{0.45\textwidth}
\centering
\includegraphics[width=\textwidth]{ui_cl.png}
\caption{When a user clicks a cluster, results are filtered depending on the
cluster clicked. If the user clicks again on the cluster, filtering is
removed.}
\end{subfigure}
\caption{The UI and its various states.}
\label{fig:ui}
\end{figure}
\subsection{Technical details}
The UI has been implemented using HTML5, vanilla CSS and vanilla JS, with the
exception of the \textit{FoamTree} library from the Carrot2 project to handle
displaying the clustering bar to the left of search results.
This is a single page application, i.e. all updates to the UI happen without
making the page refresh. This was achieved by using AJAX requests to interact
with Solr.
All UI files can be found under the \texttt{ui} directory in the repository root
directory. In order to run the UI, a ``CORS Everywhere'' extension must be
installed on the viewing browser. See the installation instructions for details.
2020-12-07 17:45:46 +00:00
\subsection{Clustering Component}
Event handlers offered by \textit{FoamTree} allowed for the implementation of
the results filtering feature when a clustering in filtered.
2020-12-07 17:45:46 +00:00
\section{User evaluation}
The user evaluation was conducted remotely using Microsoft Teams by selecting
three of my colleagues and making them install and run my project on their local
system. The examination approximately took 20 minutes for each test subject,
including installation.
2020-12-09 10:07:11 +00:00
\subsection{User evaluation implementation}
The questionnaire was implemented using USI's Qualtrics instance.
Data for the evaluation was collected using a questionnaire with a ``before
test'' and an ``after test'' section.
In the ``before test'' section, users expressed their agreement to test
procedures and stated their level of familiarity with image search text
retrieval systems, stating in particular if they ever searched for user created
images or stock photos. All participants stated they were mostly familiar with
TR image search systems, and they had the chance to search for user created
images. Only one participant never searched for stock photos.
2020-12-09 10:07:11 +00:00
\begin{figure}[h]
\begin{subfigure}{1\textwidth}
\begin{tabular}{p{4.5cm}|p{2.9cm}|p{3.5cm}|p{3.2cm}}
Question abbreviation & Subject 1 & Subject 2 & Subject 3 \\
\hline
\textsc{metadata:} Start time & 2020-12-06 14:52 & 2020-12-07 13:35 & 2020-12-07 13:54\\
\textsc{metadata:} End time & 2020-12-06 14:58 & 2020-12-07 13:48 & 2020-12-07 14:05\\
Familiarity with image search TR systems & 4 & 4 & 5 \\
Has searched for user images & Yes & Yes & Yes \\
Has searched for stock photos & Yes & No & Yes \\
The UI was easy to use & 6 & 6 & 7 \\
2020-12-09 10:07:11 +00:00
``find a person sneezing'' task & 6 & 7 & 7 \\
``find Varenna'' task & 5 & 7 & 7 \\
Personal task description & ``Churchill Pfeil'' & Eiffel tower from query ``France'' &
``Italian traditional masks'' \\
2020-12-09 10:07:11 +00:00
Personal task & 7 & 7 & 7 \\
Clustering was helpful & 6 & 7 & 7 \\
Irrelevant results were a lot and distracting & 4 & 5 & 2 \\
Suggestions & Missing Search button & Did not understand clustering was a
2020-12-09 10:07:11 +00:00
filter & \textit{Great survey background image}\footnote{The background
image for the Qualtrics survey was this:
\url{https://usi.qualtrics.com/CP/Graphic.php?IM=IM_9Bmxolx0D6iGvUp}} \\
\end{tabular}
\caption{Data collected from the questionnaire.}
\label{fig:qs}
2020-12-09 10:07:11 +00:00
\end{subfigure}
\begin{subfigure}{1\textwidth}
\vspace{0.3cm}
\begin{tabular}{p{4cm} | p{11.3cm}}
Question abbreviation & Actual question presented to test subject \\
\hline
\textsc{initial disclaimer} & By proceeding with this user evaluation you
consent to be recorded and to participate in the user evaluation of the
"Image Search" Text retrieval system. The whole procedure will take at
most 15 minutes, any you are allowed to take a break or forfeit at any
time by first notifying verbally the examiner. \\
Familiarity with image search TR systems & How much are you familiar with
image search text retrieval systems (such as google images)? \\
Has searched for user images & Have you ever searched for an image in
user-created images sites such as Imgur, Pinterest, \ldots? \\
Has searched for stock photos & Have you ever searched for images in stock
image sites such as Shutterstock, 123rf, \ldots ? \\
The UI was easy to use & The UI of the ``Image Search'' TR system was easy to
use. \\
``find a person sneezing'' task & The search results for the
``find pictures of a person sneezing'' task felt accurate. \\
``find Varenna'' task & The search results for the ``
find pictures of Varenna, knowing Varenna is a town in the Lecco area``
task felt accurate. \\
Personal task description & Please describe briefly the task of personal choice you
selected. \\
Personal task & The search results for the task of
personal choice felt accurate. \\
Clustering was helpful & The results clustering feature made finding
relevant images easier. \\
Irrelevant results were a lot and distracting & The presence of irrelevant
results was significant and distracting from my search task. \\
Suggestions & Something to add? \\
\end{tabular}
\caption{Actual questions presented to the test subjects.}
\label{fig:qsa}
\end{subfigure}
\caption{The questionnaire}
\end{figure}
Figure \ref{fig:qs} illustrates the data gathered from the questionnaire.
Numeric values represent a 5-tier likert scale for the ``Familiarity with image
TR systems question (from 1 to 5, ``Not familiar at all'', ``Slightly
familiar'', ``Moderately familiar'', ``Very familiar'', ``Extremely familiar'')
and a 7-tier likert scale on all other questions (from 1 to 7, ``Strongly
Disagree'', ``Disagree'', ``Slightly disagree'', ``Neither agree or disagree'',
``Somewhat agree'', ``Agree'', ``Strongly agree''). The start and end times are
expressed in CEST without DST (corresponding to the local time of participants).
All participants started the questionnaire before they started effectively using
the IR system, so this data can also be used to measure the experiment's length.
2020-12-09 10:07:11 +00:00
Figure \ref{fig:qsa} illustrates the actual questions presented in the exact
wording test subject read them.
2020-12-09 10:07:57 +00:00
\subsection{Evaluation results}
2020-12-09 10:07:11 +00:00
Results from the user evalutation appear promising. The users are generally
satisfied by the system, with the only notable exception being a general feeling
of unsatisfactory precision of the TR system (as evinced by the answers to the
``Inaccurate results'' question).
From the general suggestions emerges the need for two additional features:
\begin{itemize}
\item The need for a search button next to the search box. At the moment,
users must press Enter to search, and this may be unclear due to the fact
that this breaks the average expectation of unexperienced users and this
quirk is only described in the placeholder of the search box;
\item The need to label and explain the clustering feature to users. Subject
2 in particular was confused by the current presentation of clusters and did
not understand that this feature could be used to filter results. This issue
can be solved by either adding a textual description of the feature or by
having a ``tutorial'' of sorts when the user first opens the interface.
\end{itemize}
2020-12-07 17:45:46 +00:00
\end{document}