2020-12-07 17:45:46 +00:00
|
|
|
% vim: set ts=2 sw=2 et tw=80:
|
|
|
|
|
|
|
|
\documentclass{scrartcl}
|
|
|
|
\usepackage{hyperref}
|
|
|
|
\usepackage{parskip}
|
|
|
|
\usepackage{minted}
|
|
|
|
\usepackage[utf8]{inputenc}
|
2020-12-08 13:25:09 +00:00
|
|
|
\usepackage{subcaption}
|
|
|
|
\usepackage{graphicx}
|
2020-12-07 17:45:46 +00:00
|
|
|
|
|
|
|
\setlength{\parindent}{0pt}
|
|
|
|
|
|
|
|
\usepackage[margin=2.5cm]{geometry}
|
|
|
|
|
|
|
|
\title{\textit{Image Search IR System} \\\vspace{0.3cm}
|
|
|
|
\Large{WS2020-21 Information Retrieval Project}}
|
|
|
|
\author{Claudio Maggioni}
|
|
|
|
|
|
|
|
\begin{document}
|
|
|
|
\maketitle
|
|
|
|
\tableofcontents
|
|
|
|
\newpage
|
|
|
|
|
|
|
|
\section{Introduction}
|
|
|
|
This report is a summary of the work I have done to create the ``Image Search IR
|
|
|
|
system'', a proof-of-concept IR system implementation implementing the ``Image
|
|
|
|
Search Engine'' project (project \#13).
|
|
|
|
|
|
|
|
The project is built on a simple
|
|
|
|
\textit{Scrapy}-\textit{Solr}-\textit{HTML5+CSS+JS} stack. Installation
|
|
|
|
instructions, an in-depth look to the project components for scraping, indexing,
|
|
|
|
and displaying the results, and finally the user evaluation report, can all be
|
|
|
|
found in the following sections.
|
|
|
|
|
|
|
|
\section{Installation instructions}
|
|
|
|
|
|
|
|
\subsection{Project repository}
|
|
|
|
The project Git repository is located here:
|
|
|
|
\url{https://git.maggioni.xyz/maggicl/IRProject}.
|
|
|
|
|
|
|
|
\subsection{Solr installation}
|
|
|
|
The installation of the project and population of the test collection with the
|
|
|
|
scraped documents is automated by a single script. The script requires you have
|
|
|
|
downloaded \textit{Solr} version 8.6.2. as a ZIP file, i.e.\ the same
|
|
|
|
\textit{Solr} ZIP we had to download during lab lectures. Should you need to
|
2020-12-07 17:54:22 +00:00
|
|
|
download a copy of the ZIP file, you can find it here: \url{https://maggioni.xyz/solr-8.6.2.zip}.
|
2020-12-07 17:45:46 +00:00
|
|
|
|
|
|
|
Clone the project's git repository and position yourself with a shell on the
|
|
|
|
project's root directory. Then execute this command:
|
|
|
|
|
|
|
|
% linenos
|
|
|
|
\begin{minted}[frame=lines,framesep=2mm]{bash}
|
|
|
|
./solr_install.sh {ZIP path}
|
|
|
|
\end{minted}
|
|
|
|
|
2020-12-07 17:54:22 +00:00
|
|
|
where \texttt{\{ZIP path\}} is the path of the ZIP file mentioned earlier. This
|
2020-12-07 17:45:46 +00:00
|
|
|
will install, start, and update \textit{Solr} with the test collection.
|
|
|
|
|
|
|
|
\subsection{UI installation}
|
|
|
|
In order to start the UI, open with your browser of choice the file
|
|
|
|
\texttt{ui/index.html}. In order to use the UI, it is necessary to bypass
|
|
|
|
\texttt{Cross Origin Resource Sharing} security checks by downloading and
|
|
|
|
enabling a ``CORS everywhere'' extension. I suggest
|
|
|
|
\href{https://addons.mozilla.org/en-US/firefox/addon/cors-everywhere/}{this one} for
|
|
|
|
Mozilla Firefox and derivatives.
|
|
|
|
|
|
|
|
\subsection{Run the website scrapers}
|
|
|
|
A prerequisite to run the Flickr crawler is to have a working Scrapy Splash
|
|
|
|
instance listening on port \texttt{localhost:8050}. This can be achieved by
|
|
|
|
executing this Docker command, should a Docker installation be available:
|
|
|
|
|
|
|
|
\begin{minted}[frame=lines,framesep=2mm]{bash}
|
|
|
|
docker run -p 8050:8050 scrapinghub/scrapy
|
|
|
|
\end{minted}
|
|
|
|
|
|
|
|
In order to all the website scrapers, run the script \texttt{./scrape.sh} with
|
|
|
|
no arguments.
|
|
|
|
|
|
|
|
\section{Scraping}
|
|
|
|
|
2020-12-07 23:03:00 +00:00
|
|
|
The chosen three website to be scraped were \url{flickr.com}, a user-centric
|
|
|
|
image sharing service service aimed at photography amatures and professionals,
|
|
|
|
\url{123rf.com}, a stock image website, and \url{shutterstock.com}, another
|
|
|
|
stock image website.
|
|
|
|
|
2020-12-08 11:27:03 +00:00
|
|
|
The stock photo websites were scraped with standard scraping technology using
|
2020-12-07 23:03:00 +00:00
|
|
|
plain \texttt{scrapy}, while \textit{Flickr} was scraped using browser emulation
|
|
|
|
technology using \texttt{scrapy-splash} in order to execute Javascript code and
|
|
|
|
scrape infinite-scroll paginated data.
|
|
|
|
|
2020-12-08 11:27:03 +00:00
|
|
|
I would like to point out that in order to save space I scraped only image
|
|
|
|
links, and not the images themselves. Should any content that I scraped be deleted from the
|
|
|
|
services listed above, some results might not be correct as they could have been
|
|
|
|
deleted.
|
|
|
|
|
|
|
|
As a final note, since some websites are not so kind in their
|
|
|
|
\texttt{robots.txt} file to bots (\textit{Flickr} in particular blocks all
|
|
|
|
bots except Google), ``robots.txt compliance'' has been turned off for all
|
|
|
|
scrapers and the user agent has been changed to mimick a normal browser.
|
|
|
|
|
|
|
|
All scraper implementations and related files are located in the directory
|
|
|
|
\texttt{photo\_scraper/spiders}.
|
|
|
|
|
|
|
|
\subsection{\textit{Flickr}}
|
|
|
|
\subsubsection{Simulated browser technology \textit{Splash}}
|
2020-12-07 23:03:00 +00:00
|
|
|
As mentioned before, the implementation of the \textit{Flickr} scraper uses
|
|
|
|
\textit{Splash}, a browser emulation that supports Javascript execution and
|
|
|
|
simulated user interaction. This component is essential to allow for the website
|
|
|
|
to load correctly and to load as many photos as possible in the photo list
|
|
|
|
pagest scraped through emulation of the user performing an ``infinite'' scroll
|
|
|
|
down.
|
|
|
|
|
|
|
|
Here is the Lua script used by splash to emulate infinite scrolling. These exact
|
|
|
|
contents can be found in file
|
2020-12-08 11:27:03 +00:00
|
|
|
\texttt{infinite\_scroll.lua}.
|
2020-12-07 23:03:00 +00:00
|
|
|
|
|
|
|
\begin{minted}[linenos,frame=lines,framesep=2mm]{lua}
|
|
|
|
function main(splash)
|
|
|
|
local num_scrolls = 20
|
|
|
|
local scroll_delay = 0.8
|
|
|
|
|
|
|
|
local scroll_to = splash:jsfunc("window.scrollTo")
|
|
|
|
local get_body_height = splash:jsfunc(
|
|
|
|
"function() {return document.body.scrollHeight;}"
|
|
|
|
)
|
|
|
|
assert(splash:go(splash.args.url))
|
|
|
|
splash:wait(splash.args.wait)
|
|
|
|
|
|
|
|
for _ = 1, num_scrolls do
|
|
|
|
scroll_to(0, get_body_height())
|
|
|
|
splash:wait(scroll_delay)
|
|
|
|
end
|
|
|
|
return splash:html()
|
|
|
|
end
|
|
|
|
\end{minted}
|
|
|
|
|
|
|
|
Line 13 contains the instruction that scrolls down one page height. This
|
|
|
|
instruction runs in the loop of lines 12-15, which runs the scroll instruction
|
|
|
|
\texttt{num\_scrolls} by also waiting \texttt{scroll\_delay} seconds after every
|
|
|
|
execution.
|
|
|
|
|
|
|
|
After this operation is done, the resulting HTML markup is returned and normal
|
|
|
|
crawling tecniques can work on this intermediate result.
|
|
|
|
|
2020-12-08 11:27:03 +00:00
|
|
|
\subsubsection{Scraper implementation}
|
|
|
|
The Python implementation of the \textit{Flickr} scraper can be found under
|
|
|
|
\texttt{flickr.py}.
|
|
|
|
|
|
|
|
Sadly \textit{Flickr}, other than a recently posted gallery of images, offers no
|
|
|
|
curated list of image content or categorization that can allow for finding
|
|
|
|
images other than querying for them.
|
|
|
|
|
|
|
|
I therefore had to use the \textit{Flickr}
|
|
|
|
search engine to query for some common words (including the list of the 100 most
|
|
|
|
common english verbs). Then, each search result page is fed through
|
|
|
|
\textit{Splash} and the resulting markup is searched for image links. Each link
|
|
|
|
is opened to scrape the image link and its metadata.
|
|
|
|
|
|
|
|
\subsection{Implementation for \textit{123rf} and \textit{Shutterstock}}
|
|
|
|
The \textit{123rf} and \textit{Shutterstock} website do not require the use of
|
|
|
|
\textit{Splash} to be scraped and, as stock image websites, offer several
|
|
|
|
precompiled catalogs of images that can be easily scraped. The crawler
|
|
|
|
implementations, that can respectively be found in \texttt{stock123rf.py} and
|
|
|
|
\texttt{shutterstock.py} are pretty straightfoward, and
|
|
|
|
navigate from the list of categories, to each category's photo list, and then
|
|
|
|
to the individual photo page to scrape the image link and metadata.
|
|
|
|
|
2020-12-07 17:45:46 +00:00
|
|
|
\section{Indexing and \textit{Solr} configuration}
|
|
|
|
|
2020-12-08 11:27:03 +00:00
|
|
|
Solr configuration was probably the trickiest part of this project. I am not an
|
|
|
|
expert of Solr XML configuration quirks, and I am certainly have not become one
|
|
|
|
by implementng this project. However, I managed to assemble a configuration that
|
|
|
|
has both a tailored collection schema defined as XML and a custom Solr
|
|
|
|
controller to handle result clustering.
|
|
|
|
|
|
|
|
Configuration files for Solr can be found under the directory
|
|
|
|
\texttt{solr\_config} this directory is symlinked by the
|
|
|
|
\texttt{solr\_install.sh} installation script to appear as a folder named
|
|
|
|
\texttt{server/solr/photo} in the \texttt{solr} folder containing the Solr
|
|
|
|
installation. Therefore, the entire directory corresponds to the configuration
|
|
|
|
and data storage for the collection \texttt{photo}, the only collection present
|
|
|
|
in this project.
|
|
|
|
|
|
|
|
Please note that the \texttt{solr\_config/data} folder is
|
|
|
|
ignored by Git and thus not present in a freshly cloned repository: this is done
|
|
|
|
to preserve only the configuration files, and not the somewhat temporary
|
|
|
|
collection data. The collection data is uploaded every time
|
|
|
|
\texttt{solr\_install.sh} is used from CSV files located in the \texttt{scraped}
|
|
|
|
folder and produced by Scrapy.
|
|
|
|
|
2020-12-08 11:32:33 +00:00
|
|
|
The configuration was derived from the \texttt{techproducts} Solr example by
|
|
|
|
changing the collection schema and removing any non-needed controller.
|
|
|
|
|
|
|
|
\subsection{Solr schema}
|
2020-12-08 12:37:53 +00:00
|
|
|
As some minor edits were made using Solr's web interface, the relevant XML
|
|
|
|
schema to analyse is the file \texttt{solr\_config/conf/managed-schema}. This
|
|
|
|
files also stores the edits done through the UI. An extract of the relevant
|
|
|
|
lines is shown below:
|
|
|
|
|
|
|
|
\begin{minted}[linenos,frame=lines,framesep=2mm]{xml}
|
|
|
|
<?xml version="1.0" encoding="UTF-8"?>
|
|
|
|
<!-- Solr managed schema - automatically generated - DO NOT EDIT -->
|
|
|
|
<schema name="example" version="1.6">
|
|
|
|
<uniqueKey>id</uniqueKey>
|
|
|
|
|
|
|
|
<!-- Omitted field type definitions and default fields added by Solr -->
|
|
|
|
|
|
|
|
<field name="id" type="string" multiValued="false" indexed="true"
|
|
|
|
required="true" stored="true"/>
|
|
|
|
<field name="date" type="text_general" indexed="true" stored="true"/>
|
|
|
|
<field name="img_url" type="text_general" indexed="true" stored="true"/>
|
|
|
|
<field name="t_author" type="text_general" indexed="true" stored="true"/>
|
|
|
|
<field name="t_description" type="text_general" indexed="true"
|
|
|
|
stored="true"/>
|
|
|
|
<field name="t_title" type="text_general" indexed="true" stored="true"/>
|
|
|
|
<field name="tags" type="text_general" indexed="true" stored="true"/>
|
|
|
|
|
|
|
|
<field name="text" type="text_general" uninvertible="true" multiValued="true"
|
|
|
|
indexed="true" stored="true"/>
|
|
|
|
|
|
|
|
<!-- Omitted unused default dynamicField fields added by Solr -->
|
|
|
|
|
|
|
|
<copyField source="t_*" dest="text"/>
|
|
|
|
</schema>
|
|
|
|
\end{minted}
|
|
|
|
|
|
|
|
All fields have type \texttt{text-general}. Fields with a name starting by
|
|
|
|
``\texttt{t\_}'' are included in the \texttt{text} copy field, which is used as
|
|
|
|
the default field for document similarity when searching an clustering.
|
|
|
|
|
|
|
|
The \texttt{id} field is of type \texttt{string}, but in actuality it is always
|
|
|
|
a positive integer. This field's values do not come from data scraped from the
|
|
|
|
website, but it is computed as a auto-incremented progressive identified when
|
|
|
|
uploading the collection on solr using \texttt{solr\_install.sh}. Shown below is
|
|
|
|
the \texttt{awk}-based piped command included in the installation script
|
|
|
|
that performs this task and uploads the collection.
|
|
|
|
|
|
|
|
\begin{minted}[linenos,frame=lines,framesep=2mm]{bash}
|
|
|
|
# at this point in the script, `pwd` is the repository root directory
|
|
|
|
|
|
|
|
cd scraped
|
|
|
|
|
|
|
|
# POST scraped data
|
|
|
|
tail -q -n +2 photos.csv 123rf.csv shutterstock.csv | \
|
|
|
|
awk "{print NR-1 ',' \$0}" | \
|
|
|
|
awk 'BEGIN {print "id,t_author,t_title,t_description,date,img_url,tags"}
|
|
|
|
{print}' | \
|
|
|
|
../solr/bin/post -c photo -type text/csv -out yes -d
|
|
|
|
\end{minted}
|
|
|
|
|
|
|
|
Line 6 strips the heading line of the listed CSV files and concatenates them;
|
|
|
|
Line 7 adds ``\{id\},'' at the beginning of each line, where \{id\} corresponds
|
|
|
|
to the line number. Line 8 and 9 finally add the correct CSV heading, including
|
|
|
|
the ``id'' field. Line 10 reads the processed data and posts it to Solr.
|
|
|
|
|
|
|
|
\subsection{Clustering configuration}
|
|
|
|
Clustering configuration was performed by using the \texttt{solrconfig.xml} file
|
|
|
|
from the \texttt{techproducts} Solr example and adapting it to the ``photo''
|
|
|
|
collection schema.
|
|
|
|
|
|
|
|
Here is the XML configuration relevant to the clustering controller. It can be
|
|
|
|
found at approximately line 900 of the \texttt{solrconfig.xml} file:
|
|
|
|
|
|
|
|
\begin{minted}[linenos,frame=lines,framesep=2mm]{xml}
|
|
|
|
<requestHandler name="/clustering"
|
|
|
|
startup="lazy"
|
|
|
|
enable="true"
|
|
|
|
class="solr.SearchHandler">
|
|
|
|
<lst name="defaults">
|
|
|
|
<bool name="clustering">true</bool>
|
|
|
|
<bool name="clustering.results">true</bool>
|
|
|
|
<!-- Field name with the logical "title" of a each document (optional) -->
|
|
|
|
<str name="carrot.title">t_title</str>
|
|
|
|
<!-- Field name with the logical "URL" of a each document (optional) -->
|
|
|
|
<str name="carrot.url">img_url</str>
|
|
|
|
<!-- Field name with the logical "content" of a each document (optional) -->
|
|
|
|
<str name="carrot.snippet">t_description</str>
|
|
|
|
<!-- Apply highlighter to the title/ content and use this for clustering. -->
|
|
|
|
<bool name="carrot.produceSummary">true</bool>
|
|
|
|
<!-- the maximum number of labels per cluster -->
|
|
|
|
<!--<int name="carrot.numDescriptions">5</int>-->
|
|
|
|
<!-- produce sub clusters -->
|
|
|
|
<bool name="carrot.outputSubClusters">false</bool>
|
|
|
|
|
|
|
|
<!-- Configure the remaining request handler parameters. -->
|
|
|
|
<str name="defType">edismax</str>
|
|
|
|
<str name="df">text</str>
|
|
|
|
<str name="q.alt">*:*</str>
|
|
|
|
<str name="rows">100</str>
|
|
|
|
<str name="fl">*,score</str>
|
|
|
|
</lst>
|
|
|
|
<arr name="last-components">
|
|
|
|
<str>clustering</str>
|
|
|
|
</arr>
|
|
|
|
</requestHandler>
|
|
|
|
\end{minted}
|
2020-12-08 11:32:33 +00:00
|
|
|
|
2020-12-08 12:37:53 +00:00
|
|
|
This clustering controller uses Carrot2 technology to perform ``shallow'' one
|
|
|
|
level clustering (Line 19 disables sub-clusters). \texttt{t\_title} is used as
|
|
|
|
the ``title'' field for each document, \texttt{img\_url} as the ``document
|
|
|
|
location'' field and \texttt{t\_description} the ``description'' field (See
|
|
|
|
respectively lines 9, 11, and 13 of the configuration).
|
2020-12-08 11:27:03 +00:00
|
|
|
|
2020-12-08 12:37:53 +00:00
|
|
|
This controller replaces the normal \texttt{/select} controller, and thus one
|
|
|
|
single request will generate search results and clustering data. Defaults for
|
|
|
|
search are a 100 results limit and the use of \texttt{t\_*} fields to match
|
|
|
|
documents (lines 25 and 23 -- remember the definition of the \texttt{text} field).
|
2020-12-08 11:27:03 +00:00
|
|
|
|
2020-12-07 17:45:46 +00:00
|
|
|
\section{User interface}
|
2020-12-08 13:25:09 +00:00
|
|
|
Figure \ref{fig:ui} illustrates the IR system's UI showing its features.
|
|
|
|
|
|
|
|
\begin{figure}[H]
|
|
|
|
\begin{subfigure}{\textwidth}
|
|
|
|
\centering
|
|
|
|
\includegraphics[width=0.8\textwidth]{ui_start.png}
|
|
|
|
\caption{The UI, when opened, prompts to insert a query in the input field
|
|
|
|
and press Enter. Here the user typed ``Lugano''.}
|
|
|
|
\vspace{0.5cm}
|
|
|
|
\end{subfigure}
|
|
|
|
\begin{subfigure}{0.45\textwidth}
|
|
|
|
\centering
|
|
|
|
\includegraphics[width=\textwidth]{ui.png}
|
|
|
|
\caption{After the user inputs a query and presses Enter, resulting images are shown on the
|
|
|
|
right. Found clusters are shown on the left using FoamTree.}
|
|
|
|
\end{subfigure}
|
|
|
|
\hspace{0.1\textwidth}
|
|
|
|
\begin{subfigure}{0.45\textwidth}
|
|
|
|
\centering
|
|
|
|
\includegraphics[width=\textwidth]{ui_cl.png}
|
|
|
|
\caption{When a user clicks a cluster, results are filtered depending on the
|
|
|
|
cluster clicked. If the user clicks again on the cluster, filtering is
|
|
|
|
removed.}
|
|
|
|
\end{subfigure}
|
|
|
|
\caption{The UI and its various states.}
|
|
|
|
\label{fig:ui}
|
|
|
|
\end{figure}
|
|
|
|
|
|
|
|
The UI has been implemented using HTML5, vanilla CSS and vanilla JS, with the
|
|
|
|
exception of the \textit{FoamTree} library from the Carrot2 project to handle
|
|
|
|
displaying the clustering bar to the left of search results.
|
|
|
|
|
|
|
|
Event handlers offered by \textit{FoamTree} allowed for the implementation of
|
|
|
|
the results filtering feature when a clustering in filtered.
|
|
|
|
|
|
|
|
This is a single page application, i.e. all updates to the UI happen without
|
|
|
|
making the page refresh. This was achieved by using AJAX requests to interact
|
|
|
|
with Solr.
|
|
|
|
|
|
|
|
All UI files can be found under the \texttt{ui} directory in the repository root
|
|
|
|
directory. In order to run the UI, a ``CORS Everywhere'' extension must be
|
|
|
|
installed on the viewing browser. See the installation instructions for details.
|
2020-12-07 17:45:46 +00:00
|
|
|
|
|
|
|
\section{User evaluation}
|
|
|
|
\end{document}
|