2020-12-07 17:45:46 +00:00
|
|
|
% vim: set ts=2 sw=2 et tw=80:
|
|
|
|
|
|
|
|
\documentclass{scrartcl}
|
|
|
|
\usepackage{hyperref}
|
|
|
|
\usepackage{parskip}
|
|
|
|
\usepackage{minted}
|
|
|
|
\usepackage[utf8]{inputenc}
|
|
|
|
|
|
|
|
\setlength{\parindent}{0pt}
|
|
|
|
|
|
|
|
\usepackage[margin=2.5cm]{geometry}
|
|
|
|
|
|
|
|
\title{\textit{Image Search IR System} \\\vspace{0.3cm}
|
|
|
|
\Large{WS2020-21 Information Retrieval Project}}
|
|
|
|
\author{Claudio Maggioni}
|
|
|
|
|
|
|
|
\begin{document}
|
|
|
|
\maketitle
|
|
|
|
\tableofcontents
|
|
|
|
\newpage
|
|
|
|
|
|
|
|
\section{Introduction}
|
|
|
|
This report is a summary of the work I have done to create the ``Image Search IR
|
|
|
|
system'', a proof-of-concept IR system implementation implementing the ``Image
|
|
|
|
Search Engine'' project (project \#13).
|
|
|
|
|
|
|
|
The project is built on a simple
|
|
|
|
\textit{Scrapy}-\textit{Solr}-\textit{HTML5+CSS+JS} stack. Installation
|
|
|
|
instructions, an in-depth look to the project components for scraping, indexing,
|
|
|
|
and displaying the results, and finally the user evaluation report, can all be
|
|
|
|
found in the following sections.
|
|
|
|
|
|
|
|
\section{Installation instructions}
|
|
|
|
|
|
|
|
\subsection{Project repository}
|
|
|
|
The project Git repository is located here:
|
|
|
|
\url{https://git.maggioni.xyz/maggicl/IRProject}.
|
|
|
|
|
|
|
|
\subsection{Solr installation}
|
|
|
|
The installation of the project and population of the test collection with the
|
|
|
|
scraped documents is automated by a single script. The script requires you have
|
|
|
|
downloaded \textit{Solr} version 8.6.2. as a ZIP file, i.e.\ the same
|
|
|
|
\textit{Solr} ZIP we had to download during lab lectures. Should you need to
|
2020-12-07 17:54:22 +00:00
|
|
|
download a copy of the ZIP file, you can find it here: \url{https://maggioni.xyz/solr-8.6.2.zip}.
|
2020-12-07 17:45:46 +00:00
|
|
|
|
|
|
|
Clone the project's git repository and position yourself with a shell on the
|
|
|
|
project's root directory. Then execute this command:
|
|
|
|
|
|
|
|
% linenos
|
|
|
|
\begin{minted}[frame=lines,framesep=2mm]{bash}
|
|
|
|
./solr_install.sh {ZIP path}
|
|
|
|
\end{minted}
|
|
|
|
|
2020-12-07 17:54:22 +00:00
|
|
|
where \texttt{\{ZIP path\}} is the path of the ZIP file mentioned earlier. This
|
2020-12-07 17:45:46 +00:00
|
|
|
will install, start, and update \textit{Solr} with the test collection.
|
|
|
|
|
|
|
|
\subsection{UI installation}
|
|
|
|
In order to start the UI, open with your browser of choice the file
|
|
|
|
\texttt{ui/index.html}. In order to use the UI, it is necessary to bypass
|
|
|
|
\texttt{Cross Origin Resource Sharing} security checks by downloading and
|
|
|
|
enabling a ``CORS everywhere'' extension. I suggest
|
|
|
|
\href{https://addons.mozilla.org/en-US/firefox/addon/cors-everywhere/}{this one} for
|
|
|
|
Mozilla Firefox and derivatives.
|
|
|
|
|
|
|
|
\subsection{Run the website scrapers}
|
|
|
|
A prerequisite to run the Flickr crawler is to have a working Scrapy Splash
|
|
|
|
instance listening on port \texttt{localhost:8050}. This can be achieved by
|
|
|
|
executing this Docker command, should a Docker installation be available:
|
|
|
|
|
|
|
|
\begin{minted}[frame=lines,framesep=2mm]{bash}
|
|
|
|
docker run -p 8050:8050 scrapinghub/scrapy
|
|
|
|
\end{minted}
|
|
|
|
|
|
|
|
In order to all the website scrapers, run the script \texttt{./scrape.sh} with
|
|
|
|
no arguments.
|
|
|
|
|
|
|
|
\section{Scraping}
|
|
|
|
|
2020-12-07 23:03:00 +00:00
|
|
|
The chosen three website to be scraped were \url{flickr.com}, a user-centric
|
|
|
|
image sharing service service aimed at photography amatures and professionals,
|
|
|
|
\url{123rf.com}, a stock image website, and \url{shutterstock.com}, another
|
|
|
|
stock image website.
|
|
|
|
|
2020-12-08 11:27:03 +00:00
|
|
|
The stock photo websites were scraped with standard scraping technology using
|
2020-12-07 23:03:00 +00:00
|
|
|
plain \texttt{scrapy}, while \textit{Flickr} was scraped using browser emulation
|
|
|
|
technology using \texttt{scrapy-splash} in order to execute Javascript code and
|
|
|
|
scrape infinite-scroll paginated data.
|
|
|
|
|
2020-12-08 11:27:03 +00:00
|
|
|
I would like to point out that in order to save space I scraped only image
|
|
|
|
links, and not the images themselves. Should any content that I scraped be deleted from the
|
|
|
|
services listed above, some results might not be correct as they could have been
|
|
|
|
deleted.
|
|
|
|
|
|
|
|
As a final note, since some websites are not so kind in their
|
|
|
|
\texttt{robots.txt} file to bots (\textit{Flickr} in particular blocks all
|
|
|
|
bots except Google), ``robots.txt compliance'' has been turned off for all
|
|
|
|
scrapers and the user agent has been changed to mimick a normal browser.
|
|
|
|
|
|
|
|
All scraper implementations and related files are located in the directory
|
|
|
|
\texttt{photo\_scraper/spiders}.
|
|
|
|
|
|
|
|
\subsection{\textit{Flickr}}
|
|
|
|
\subsubsection{Simulated browser technology \textit{Splash}}
|
2020-12-07 23:03:00 +00:00
|
|
|
As mentioned before, the implementation of the \textit{Flickr} scraper uses
|
|
|
|
\textit{Splash}, a browser emulation that supports Javascript execution and
|
|
|
|
simulated user interaction. This component is essential to allow for the website
|
|
|
|
to load correctly and to load as many photos as possible in the photo list
|
|
|
|
pagest scraped through emulation of the user performing an ``infinite'' scroll
|
|
|
|
down.
|
|
|
|
|
|
|
|
Here is the Lua script used by splash to emulate infinite scrolling. These exact
|
|
|
|
contents can be found in file
|
2020-12-08 11:27:03 +00:00
|
|
|
\texttt{infinite\_scroll.lua}.
|
2020-12-07 23:03:00 +00:00
|
|
|
|
|
|
|
\begin{minted}[linenos,frame=lines,framesep=2mm]{lua}
|
|
|
|
function main(splash)
|
|
|
|
local num_scrolls = 20
|
|
|
|
local scroll_delay = 0.8
|
|
|
|
|
|
|
|
local scroll_to = splash:jsfunc("window.scrollTo")
|
|
|
|
local get_body_height = splash:jsfunc(
|
|
|
|
"function() {return document.body.scrollHeight;}"
|
|
|
|
)
|
|
|
|
assert(splash:go(splash.args.url))
|
|
|
|
splash:wait(splash.args.wait)
|
|
|
|
|
|
|
|
for _ = 1, num_scrolls do
|
|
|
|
scroll_to(0, get_body_height())
|
|
|
|
splash:wait(scroll_delay)
|
|
|
|
end
|
|
|
|
return splash:html()
|
|
|
|
end
|
|
|
|
\end{minted}
|
|
|
|
|
|
|
|
Line 13 contains the instruction that scrolls down one page height. This
|
|
|
|
instruction runs in the loop of lines 12-15, which runs the scroll instruction
|
|
|
|
\texttt{num\_scrolls} by also waiting \texttt{scroll\_delay} seconds after every
|
|
|
|
execution.
|
|
|
|
|
|
|
|
After this operation is done, the resulting HTML markup is returned and normal
|
|
|
|
crawling tecniques can work on this intermediate result.
|
|
|
|
|
2020-12-08 11:27:03 +00:00
|
|
|
\subsubsection{Scraper implementation}
|
|
|
|
The Python implementation of the \textit{Flickr} scraper can be found under
|
|
|
|
\texttt{flickr.py}.
|
|
|
|
|
|
|
|
Sadly \textit{Flickr}, other than a recently posted gallery of images, offers no
|
|
|
|
curated list of image content or categorization that can allow for finding
|
|
|
|
images other than querying for them.
|
|
|
|
|
|
|
|
I therefore had to use the \textit{Flickr}
|
|
|
|
search engine to query for some common words (including the list of the 100 most
|
|
|
|
common english verbs). Then, each search result page is fed through
|
|
|
|
\textit{Splash} and the resulting markup is searched for image links. Each link
|
|
|
|
is opened to scrape the image link and its metadata.
|
|
|
|
|
|
|
|
\subsection{Implementation for \textit{123rf} and \textit{Shutterstock}}
|
|
|
|
The \textit{123rf} and \textit{Shutterstock} website do not require the use of
|
|
|
|
\textit{Splash} to be scraped and, as stock image websites, offer several
|
|
|
|
precompiled catalogs of images that can be easily scraped. The crawler
|
|
|
|
implementations, that can respectively be found in \texttt{stock123rf.py} and
|
|
|
|
\texttt{shutterstock.py} are pretty straightfoward, and
|
|
|
|
navigate from the list of categories, to each category's photo list, and then
|
|
|
|
to the individual photo page to scrape the image link and metadata.
|
|
|
|
|
2020-12-07 17:45:46 +00:00
|
|
|
\section{Indexing and \textit{Solr} configuration}
|
|
|
|
|
2020-12-08 11:27:03 +00:00
|
|
|
Solr configuration was probably the trickiest part of this project. I am not an
|
|
|
|
expert of Solr XML configuration quirks, and I am certainly have not become one
|
|
|
|
by implementng this project. However, I managed to assemble a configuration that
|
|
|
|
has both a tailored collection schema defined as XML and a custom Solr
|
|
|
|
controller to handle result clustering.
|
|
|
|
|
|
|
|
Configuration files for Solr can be found under the directory
|
|
|
|
\texttt{solr\_config} this directory is symlinked by the
|
|
|
|
\texttt{solr\_install.sh} installation script to appear as a folder named
|
|
|
|
\texttt{server/solr/photo} in the \texttt{solr} folder containing the Solr
|
|
|
|
installation. Therefore, the entire directory corresponds to the configuration
|
|
|
|
and data storage for the collection \texttt{photo}, the only collection present
|
|
|
|
in this project.
|
|
|
|
|
|
|
|
Please note that the \texttt{solr\_config/data} folder is
|
|
|
|
ignored by Git and thus not present in a freshly cloned repository: this is done
|
|
|
|
to preserve only the configuration files, and not the somewhat temporary
|
|
|
|
collection data. The collection data is uploaded every time
|
|
|
|
\texttt{solr\_install.sh} is used from CSV files located in the \texttt{scraped}
|
|
|
|
folder and produced by Scrapy.
|
|
|
|
|
|
|
|
|
|
|
|
|
2020-12-07 17:45:46 +00:00
|
|
|
\section{User interface}
|
|
|
|
|
|
|
|
\section{User evaluation}
|
|
|
|
\end{document}
|