86 lines
2.9 KiB
TeX
86 lines
2.9 KiB
TeX
|
% vim: set ts=2 sw=2 et tw=80:
|
||
|
|
||
|
\documentclass{scrartcl}
|
||
|
\usepackage{hyperref}
|
||
|
\usepackage{parskip}
|
||
|
\usepackage{minted}
|
||
|
\usepackage[utf8]{inputenc}
|
||
|
|
||
|
\setlength{\parindent}{0pt}
|
||
|
|
||
|
\usepackage[margin=2.5cm]{geometry}
|
||
|
|
||
|
\title{\textit{Image Search IR System} \\\vspace{0.3cm}
|
||
|
\Large{WS2020-21 Information Retrieval Project}}
|
||
|
\author{Claudio Maggioni}
|
||
|
|
||
|
\begin{document}
|
||
|
\maketitle
|
||
|
\tableofcontents
|
||
|
\newpage
|
||
|
|
||
|
\section{Introduction}
|
||
|
This report is a summary of the work I have done to create the ``Image Search IR
|
||
|
system'', a proof-of-concept IR system implementation implementing the ``Image
|
||
|
Search Engine'' project (project \#13).
|
||
|
|
||
|
The project is built on a simple
|
||
|
\textit{Scrapy}-\textit{Solr}-\textit{HTML5+CSS+JS} stack. Installation
|
||
|
instructions, an in-depth look to the project components for scraping, indexing,
|
||
|
and displaying the results, and finally the user evaluation report, can all be
|
||
|
found in the following sections.
|
||
|
|
||
|
\section{Installation instructions}
|
||
|
|
||
|
\subsection{Project repository}
|
||
|
The project Git repository is located here:
|
||
|
\url{https://git.maggioni.xyz/maggicl/IRProject}.
|
||
|
|
||
|
\subsection{Solr installation}
|
||
|
The installation of the project and population of the test collection with the
|
||
|
scraped documents is automated by a single script. The script requires you have
|
||
|
downloaded \textit{Solr} version 8.6.2. as a ZIP file, i.e.\ the same
|
||
|
\textit{Solr} ZIP we had to download during lab lectures. Should you need to
|
||
|
download a copy of the ZIP file, you can find it here (on USI's onedrive
|
||
|
hosting): \url{http://to-do.com/file}.
|
||
|
|
||
|
Clone the project's git repository and position yourself with a shell on the
|
||
|
project's root directory. Then execute this command:
|
||
|
|
||
|
% linenos
|
||
|
\begin{minted}[frame=lines,framesep=2mm]{bash}
|
||
|
./solr_install.sh {ZIP path}
|
||
|
\end{minted}
|
||
|
|
||
|
where \texttt{<ZIP path>} is the path of the ZIP file mentioned earlier. This
|
||
|
will install, start, and update \textit{Solr} with the test collection.
|
||
|
|
||
|
\subsection{UI installation}
|
||
|
In order to start the UI, open with your browser of choice the file
|
||
|
\texttt{ui/index.html}. In order to use the UI, it is necessary to bypass
|
||
|
\texttt{Cross Origin Resource Sharing} security checks by downloading and
|
||
|
enabling a ``CORS everywhere'' extension. I suggest
|
||
|
\href{https://addons.mozilla.org/en-US/firefox/addon/cors-everywhere/}{this one} for
|
||
|
Mozilla Firefox and derivatives.
|
||
|
|
||
|
\subsection{Run the website scrapers}
|
||
|
A prerequisite to run the Flickr crawler is to have a working Scrapy Splash
|
||
|
instance listening on port \texttt{localhost:8050}. This can be achieved by
|
||
|
executing this Docker command, should a Docker installation be available:
|
||
|
|
||
|
\begin{minted}[frame=lines,framesep=2mm]{bash}
|
||
|
docker run -p 8050:8050 scrapinghub/scrapy
|
||
|
\end{minted}
|
||
|
|
||
|
In order to all the website scrapers, run the script \texttt{./scrape.sh} with
|
||
|
no arguments.
|
||
|
|
||
|
\section{Scraping}
|
||
|
|
||
|
\section{Indexing and \textit{Solr} configuration}
|
||
|
|
||
|
\section{User interface}
|
||
|
|
||
|
\section{User evaluation}
|
||
|
\end{document}
|