2020-12-07 17:45:46 +00:00
|
|
|
% vim: set ts=2 sw=2 et tw=80:
|
|
|
|
|
|
|
|
\documentclass{scrartcl}
|
|
|
|
\usepackage{hyperref}
|
|
|
|
\usepackage{parskip}
|
|
|
|
\usepackage{minted}
|
|
|
|
\usepackage[utf8]{inputenc}
|
|
|
|
|
|
|
|
\setlength{\parindent}{0pt}
|
|
|
|
|
|
|
|
\usepackage[margin=2.5cm]{geometry}
|
|
|
|
|
|
|
|
\title{\textit{Image Search IR System} \\\vspace{0.3cm}
|
|
|
|
\Large{WS2020-21 Information Retrieval Project}}
|
|
|
|
\author{Claudio Maggioni}
|
|
|
|
|
|
|
|
\begin{document}
|
|
|
|
\maketitle
|
|
|
|
\tableofcontents
|
|
|
|
\newpage
|
|
|
|
|
|
|
|
\section{Introduction}
|
|
|
|
This report is a summary of the work I have done to create the ``Image Search IR
|
|
|
|
system'', a proof-of-concept IR system implementation implementing the ``Image
|
|
|
|
Search Engine'' project (project \#13).
|
|
|
|
|
|
|
|
The project is built on a simple
|
|
|
|
\textit{Scrapy}-\textit{Solr}-\textit{HTML5+CSS+JS} stack. Installation
|
|
|
|
instructions, an in-depth look to the project components for scraping, indexing,
|
|
|
|
and displaying the results, and finally the user evaluation report, can all be
|
|
|
|
found in the following sections.
|
|
|
|
|
|
|
|
\section{Installation instructions}
|
|
|
|
|
|
|
|
\subsection{Project repository}
|
|
|
|
The project Git repository is located here:
|
|
|
|
\url{https://git.maggioni.xyz/maggicl/IRProject}.
|
|
|
|
|
|
|
|
\subsection{Solr installation}
|
|
|
|
The installation of the project and population of the test collection with the
|
|
|
|
scraped documents is automated by a single script. The script requires you have
|
|
|
|
downloaded \textit{Solr} version 8.6.2. as a ZIP file, i.e.\ the same
|
|
|
|
\textit{Solr} ZIP we had to download during lab lectures. Should you need to
|
2020-12-07 17:54:22 +00:00
|
|
|
download a copy of the ZIP file, you can find it here: \url{https://maggioni.xyz/solr-8.6.2.zip}.
|
2020-12-07 17:45:46 +00:00
|
|
|
|
|
|
|
Clone the project's git repository and position yourself with a shell on the
|
|
|
|
project's root directory. Then execute this command:
|
|
|
|
|
|
|
|
% linenos
|
|
|
|
\begin{minted}[frame=lines,framesep=2mm]{bash}
|
|
|
|
./solr_install.sh {ZIP path}
|
|
|
|
\end{minted}
|
|
|
|
|
2020-12-07 17:54:22 +00:00
|
|
|
where \texttt{\{ZIP path\}} is the path of the ZIP file mentioned earlier. This
|
2020-12-07 17:45:46 +00:00
|
|
|
will install, start, and update \textit{Solr} with the test collection.
|
|
|
|
|
|
|
|
\subsection{UI installation}
|
|
|
|
In order to start the UI, open with your browser of choice the file
|
|
|
|
\texttt{ui/index.html}. In order to use the UI, it is necessary to bypass
|
|
|
|
\texttt{Cross Origin Resource Sharing} security checks by downloading and
|
|
|
|
enabling a ``CORS everywhere'' extension. I suggest
|
|
|
|
\href{https://addons.mozilla.org/en-US/firefox/addon/cors-everywhere/}{this one} for
|
|
|
|
Mozilla Firefox and derivatives.
|
|
|
|
|
|
|
|
\subsection{Run the website scrapers}
|
|
|
|
A prerequisite to run the Flickr crawler is to have a working Scrapy Splash
|
|
|
|
instance listening on port \texttt{localhost:8050}. This can be achieved by
|
|
|
|
executing this Docker command, should a Docker installation be available:
|
|
|
|
|
|
|
|
\begin{minted}[frame=lines,framesep=2mm]{bash}
|
|
|
|
docker run -p 8050:8050 scrapinghub/scrapy
|
|
|
|
\end{minted}
|
|
|
|
|
|
|
|
In order to all the website scrapers, run the script \texttt{./scrape.sh} with
|
|
|
|
no arguments.
|
|
|
|
|
|
|
|
\section{Scraping}
|
|
|
|
|
2020-12-07 23:03:00 +00:00
|
|
|
The chosen three website to be scraped were \url{flickr.com}, a user-centric
|
|
|
|
image sharing service service aimed at photography amatures and professionals,
|
|
|
|
\url{123rf.com}, a stock image website, and \url{shutterstock.com}, another
|
|
|
|
stock image website.
|
|
|
|
|
|
|
|
The stock photo website were scraped with standard scraping technology using
|
|
|
|
plain \texttt{scrapy}, while \textit{Flickr} was scraped using browser emulation
|
|
|
|
technology using \texttt{scrapy-splash} in order to execute Javascript code and
|
|
|
|
scrape infinite-scroll paginated data.
|
|
|
|
|
|
|
|
\subsection{\textit{Flickr} and the simulated browser technology
|
|
|
|
\textit{Splash}}
|
|
|
|
As mentioned before, the implementation of the \textit{Flickr} scraper uses
|
|
|
|
\textit{Splash}, a browser emulation that supports Javascript execution and
|
|
|
|
simulated user interaction. This component is essential to allow for the website
|
|
|
|
to load correctly and to load as many photos as possible in the photo list
|
|
|
|
pagest scraped through emulation of the user performing an ``infinite'' scroll
|
|
|
|
down.
|
|
|
|
|
|
|
|
Here is the Lua script used by splash to emulate infinite scrolling. These exact
|
|
|
|
contents can be found in file
|
|
|
|
\texttt{photo\_scraper/spiders/infinite\_scroll.lua}.
|
|
|
|
|
|
|
|
\begin{minted}[linenos,frame=lines,framesep=2mm]{lua}
|
|
|
|
function main(splash)
|
|
|
|
local num_scrolls = 20
|
|
|
|
local scroll_delay = 0.8
|
|
|
|
|
|
|
|
local scroll_to = splash:jsfunc("window.scrollTo")
|
|
|
|
local get_body_height = splash:jsfunc(
|
|
|
|
"function() {return document.body.scrollHeight;}"
|
|
|
|
)
|
|
|
|
assert(splash:go(splash.args.url))
|
|
|
|
splash:wait(splash.args.wait)
|
|
|
|
|
|
|
|
for _ = 1, num_scrolls do
|
|
|
|
scroll_to(0, get_body_height())
|
|
|
|
splash:wait(scroll_delay)
|
|
|
|
end
|
|
|
|
return splash:html()
|
|
|
|
end
|
|
|
|
\end{minted}
|
|
|
|
|
|
|
|
Line 13 contains the instruction that scrolls down one page height. This
|
|
|
|
instruction runs in the loop of lines 12-15, which runs the scroll instruction
|
|
|
|
\texttt{num\_scrolls} by also waiting \texttt{scroll\_delay} seconds after every
|
|
|
|
execution.
|
|
|
|
|
|
|
|
After this operation is done, the resulting HTML markup is returned and normal
|
|
|
|
crawling tecniques can work on this intermediate result.
|
|
|
|
|
2020-12-07 17:45:46 +00:00
|
|
|
\section{Indexing and \textit{Solr} configuration}
|
|
|
|
|
|
|
|
\section{User interface}
|
|
|
|
|
|
|
|
\section{User evaluation}
|
|
|
|
\end{document}
|