This repository has been archived on 2020-12-10. You can view files and clone it, but cannot push or open issues or pull requests.
IRProject/report/report.tex
2020-12-07 18:45:46 +01:00

85 lines
2.9 KiB
TeX

% vim: set ts=2 sw=2 et tw=80:
\documentclass{scrartcl}
\usepackage{hyperref}
\usepackage{parskip}
\usepackage{minted}
\usepackage[utf8]{inputenc}
\setlength{\parindent}{0pt}
\usepackage[margin=2.5cm]{geometry}
\title{\textit{Image Search IR System} \\\vspace{0.3cm}
\Large{WS2020-21 Information Retrieval Project}}
\author{Claudio Maggioni}
\begin{document}
\maketitle
\tableofcontents
\newpage
\section{Introduction}
This report is a summary of the work I have done to create the ``Image Search IR
system'', a proof-of-concept IR system implementation implementing the ``Image
Search Engine'' project (project \#13).
The project is built on a simple
\textit{Scrapy}-\textit{Solr}-\textit{HTML5+CSS+JS} stack. Installation
instructions, an in-depth look to the project components for scraping, indexing,
and displaying the results, and finally the user evaluation report, can all be
found in the following sections.
\section{Installation instructions}
\subsection{Project repository}
The project Git repository is located here:
\url{https://git.maggioni.xyz/maggicl/IRProject}.
\subsection{Solr installation}
The installation of the project and population of the test collection with the
scraped documents is automated by a single script. The script requires you have
downloaded \textit{Solr} version 8.6.2. as a ZIP file, i.e.\ the same
\textit{Solr} ZIP we had to download during lab lectures. Should you need to
download a copy of the ZIP file, you can find it here (on USI's onedrive
hosting): \url{http://to-do.com/file}.
Clone the project's git repository and position yourself with a shell on the
project's root directory. Then execute this command:
% linenos
\begin{minted}[frame=lines,framesep=2mm]{bash}
./solr_install.sh {ZIP path}
\end{minted}
where \texttt{<ZIP path>} is the path of the ZIP file mentioned earlier. This
will install, start, and update \textit{Solr} with the test collection.
\subsection{UI installation}
In order to start the UI, open with your browser of choice the file
\texttt{ui/index.html}. In order to use the UI, it is necessary to bypass
\texttt{Cross Origin Resource Sharing} security checks by downloading and
enabling a ``CORS everywhere'' extension. I suggest
\href{https://addons.mozilla.org/en-US/firefox/addon/cors-everywhere/}{this one} for
Mozilla Firefox and derivatives.
\subsection{Run the website scrapers}
A prerequisite to run the Flickr crawler is to have a working Scrapy Splash
instance listening on port \texttt{localhost:8050}. This can be achieved by
executing this Docker command, should a Docker installation be available:
\begin{minted}[frame=lines,framesep=2mm]{bash}
docker run -p 8050:8050 scrapinghub/scrapy
\end{minted}
In order to all the website scrapers, run the script \texttt{./scrape.sh} with
no arguments.
\section{Scraping}
\section{Indexing and \textit{Solr} configuration}
\section{User interface}
\section{User evaluation}
\end{document}