IRProject/report/report.tex

% vim: set ts=2 sw=2 et tw=80:

\documentclass{scrartcl}
\usepackage{hyperref}
\usepackage{parskip}
\usepackage{minted}
\usepackage[utf8]{inputenc}

\setlength{\parindent}{0pt}

\usepackage[margin=2.5cm]{geometry}

\title{\textit{Image Search IR System} \\\vspace{0.3cm}
\Large{WS2020-21 Information Retrieval Project}}
\author{Claudio Maggioni}

\begin{document}
\maketitle
\tableofcontents
\newpage

\section{Introduction}
This report is a summary of the work I have done to create the ``Image Search IR
system'', a proof-of-concept IR system implementation implementing the ``Image
Search Engine'' project (project \#13).

The project is built on a simple
\textit{Scrapy}-\textit{Solr}-\textit{HTML5+CSS+JS} stack. Installation
instructions, an in-depth look to the project components for scraping, indexing,
and displaying the results, and finally the user evaluation report, can all be
found in the following sections.

\section{Installation instructions}

\subsection{Project repository}
The project Git repository is located here:
\url{https://git.maggioni.xyz/maggicl/IRProject}.

\subsection{Solr installation}
The installation of the project and population of the test collection with the
scraped documents is automated by a single script. The script requires you have
downloaded \textit{Solr} version 8.6.2. as a ZIP file, i.e.\  the same
\textit{Solr} ZIP we had to download during lab lectures. Should you need to
download a copy of the ZIP file, you can find it here: \url{https://maggioni.xyz/solr-8.6.2.zip}.

Clone the project's git repository and position yourself with a shell on the
project's root directory. Then execute this command:

% linenos
\begin{minted}[frame=lines,framesep=2mm]{bash}
./solr_install.sh {ZIP path}
\end{minted}

where \texttt{\{ZIP path\}} is the path of the ZIP file mentioned earlier. This
will install, start, and update \textit{Solr} with the test collection.

\subsection{UI installation}
In order to start the UI, open with your browser of choice the file
\texttt{ui/index.html}. In order to use the UI, it is necessary to bypass
\texttt{Cross Origin Resource Sharing} security checks by downloading and
enabling a ``CORS everywhere'' extension. I suggest
\href{https://addons.mozilla.org/en-US/firefox/addon/cors-everywhere/}{this one} for
Mozilla Firefox and derivatives.

\subsection{Run the website scrapers}
A prerequisite to run the Flickr crawler is to have a working Scrapy Splash
instance listening on port \texttt{localhost:8050}. This can be achieved by
executing this Docker command, should a Docker installation be available:

\begin{minted}[frame=lines,framesep=2mm]{bash}
docker run -p 8050:8050 scrapinghub/scrapy
\end{minted}

In order to all the website scrapers, run the script \texttt{./scrape.sh} with
no arguments.

\section{Scraping}

The chosen three website to be scraped were \url{flickr.com}, a user-centric
image sharing service service aimed at photography amatures and professionals,
\url{123rf.com}, a stock image website, and \url{shutterstock.com}, another
stock image website.

The stock photo website were scraped with standard scraping technology using
plain \texttt{scrapy}, while \textit{Flickr} was scraped using browser emulation
technology using \texttt{scrapy-splash} in order to execute Javascript code and
scrape infinite-scroll paginated data.

\subsection{\textit{Flickr} and the simulated browser technology
\textit{Splash}}
As mentioned before, the implementation of the \textit{Flickr} scraper uses
\textit{Splash}, a browser emulation that supports Javascript execution and
simulated user interaction. This component is essential to allow for the website
to load correctly and to load as many photos as possible in the photo list
pagest scraped through emulation of the user performing an ``infinite'' scroll
down.

Here is the Lua script used by splash to emulate infinite scrolling. These exact
contents can be found in file
\texttt{photo\_scraper/spiders/infinite\_scroll.lua}.

\begin{minted}[linenos,frame=lines,framesep=2mm]{lua}
function main(splash)
  local num_scrolls = 20
  local scroll_delay = 0.8

  local scroll_to = splash:jsfunc("window.scrollTo")
  local get_body_height = splash:jsfunc(
      "function() {return document.body.scrollHeight;}"
  )
  assert(splash:go(splash.args.url))
  splash:wait(splash.args.wait)

  for _ = 1, num_scrolls do
      scroll_to(0, get_body_height())
      splash:wait(scroll_delay)
  end
  return splash:html()
end
\end{minted}

Line 13 contains the instruction that scrolls down one page height. This
instruction runs in the loop of lines 12-15, which runs the scroll instruction
\texttt{num\_scrolls} by also waiting \texttt{scroll\_delay} seconds after every
execution.

After this operation is done, the resulting HTML markup is returned and normal
crawling tecniques can work on this intermediate result.

\section{Indexing and \textit{Solr} configuration}

\section{User interface}

\section{User evaluation}
\end{document}
Started report 2020-12-07 17:45:46 +00:00			`% vim: set ts=2 sw=2 et tw=80:`

			`\documentclass{scrartcl}`
			`\usepackage{hyperref}`
			`\usepackage{parskip}`
			`\usepackage{minted}`
			`\usepackage[utf8]{inputenc}`

			`\setlength{\parindent}{0pt}`

			`\usepackage[margin=2.5cm]{geometry}`

			`\title{\textit{Image Search IR System} \\\vspace{0.3cm}`
			`\Large{WS2020-21 Information Retrieval Project}}`
			`\author{Claudio Maggioni}`

			`\begin{document}`
			`\maketitle`
			`\tableofcontents`
			`\newpage`

			`\section{Introduction}`
			This report is a summary of the work I have done to create the ``Image Search IR
			system'', a proof-of-concept IR system implementation implementing the ``Image
			`Search Engine'' project (project \#13).`

			`The project is built on a simple`
			`\textit{Scrapy}-\textit{Solr}-\textit{HTML5+CSS+JS} stack. Installation`
			`instructions, an in-depth look to the project components for scraping, indexing,`
			`and displaying the results, and finally the user evaluation report, can all be`
			`found in the following sections.`

			`\section{Installation instructions}`

			`\subsection{Project repository}`
			`The project Git repository is located here:`
			`\url{https://git.maggioni.xyz/maggicl/IRProject}.`

			`\subsection{Solr installation}`
			`The installation of the project and population of the test collection with the`
			`scraped documents is automated by a single script. The script requires you have`
			`downloaded \textit{Solr} version 8.6.2. as a ZIP file, i.e.\ the same`
			`\textit{Solr} ZIP we had to download during lab lectures. Should you need to`
Fix report 2020-12-07 17:54:22 +00:00			`download a copy of the ZIP file, you can find it here: \url{https://maggioni.xyz/solr-8.6.2.zip}.`
Started report 2020-12-07 17:45:46 +00:00
			`Clone the project's git repository and position yourself with a shell on the`
			`project's root directory. Then execute this command:`

			`% linenos`
			`\begin{minted}[frame=lines,framesep=2mm]{bash}`
			`./solr_install.sh {ZIP path}`
			`\end{minted}`

Fix report 2020-12-07 17:54:22 +00:00			`where \texttt{\{ZIP path\}} is the path of the ZIP file mentioned earlier. This`
Started report 2020-12-07 17:45:46 +00:00			`will install, start, and update \textit{Solr} with the test collection.`

			`\subsection{UI installation}`
			`In order to start the UI, open with your browser of choice the file`
			`\texttt{ui/index.html}. In order to use the UI, it is necessary to bypass`
			`\texttt{Cross Origin Resource Sharing} security checks by downloading and`
			enabling a ``CORS everywhere'' extension. I suggest
			`\href{https://addons.mozilla.org/en-US/firefox/addon/cors-everywhere/}{this one} for`
			`Mozilla Firefox and derivatives.`

			`\subsection{Run the website scrapers}`
			`A prerequisite to run the Flickr crawler is to have a working Scrapy Splash`
			`instance listening on port \texttt{localhost:8050}. This can be achieved by`
			`executing this Docker command, should a Docker installation be available:`

			`\begin{minted}[frame=lines,framesep=2mm]{bash}`
			`docker run -p 8050:8050 scrapinghub/scrapy`
			`\end{minted}`

			`In order to all the website scrapers, run the script \texttt{./scrape.sh} with`
			`no arguments.`

			`\section{Scraping}`

Wip 2020-12-07 23:03:00 +00:00			`The chosen three website to be scraped were \url{flickr.com}, a user-centric`
			`image sharing service service aimed at photography amatures and professionals,`
			`\url{123rf.com}, a stock image website, and \url{shutterstock.com}, another`
			`stock image website.`

			`The stock photo website were scraped with standard scraping technology using`
			`plain \texttt{scrapy}, while \textit{Flickr} was scraped using browser emulation`
			`technology using \texttt{scrapy-splash} in order to execute Javascript code and`
			`scrape infinite-scroll paginated data.`

			`\subsection{\textit{Flickr} and the simulated browser technology`
			`\textit{Splash}}`
			`As mentioned before, the implementation of the \textit{Flickr} scraper uses`
			`\textit{Splash}, a browser emulation that supports Javascript execution and`
			`simulated user interaction. This component is essential to allow for the website`
			`to load correctly and to load as many photos as possible in the photo list`
			pagest scraped through emulation of the user performing an ``infinite'' scroll
			`down.`

			`Here is the Lua script used by splash to emulate infinite scrolling. These exact`
			`contents can be found in file`
			`\texttt{photo\_scraper/spiders/infinite\_scroll.lua}.`

			`\begin{minted}[linenos,frame=lines,framesep=2mm]{lua}`
			`function main(splash)`
			`local num_scrolls = 20`
			`local scroll_delay = 0.8`

			`local scroll_to = splash:jsfunc("window.scrollTo")`
			`local get_body_height = splash:jsfunc(`
			`"function() {return document.body.scrollHeight;}"`
			`)`
			`assert(splash:go(splash.args.url))`
			`splash:wait(splash.args.wait)`

			`for _ = 1, num_scrolls do`
			`scroll_to(0, get_body_height())`
			`splash:wait(scroll_delay)`
			`end`
			`return splash:html()`
			`end`
			`\end{minted}`

			`Line 13 contains the instruction that scrolls down one page height. This`
			`instruction runs in the loop of lines 12-15, which runs the scroll instruction`
			`\texttt{num\_scrolls} by also waiting \texttt{scroll\_delay} seconds after every`
			`execution.`

			`After this operation is done, the resulting HTML markup is returned and normal`
			`crawling tecniques can work on this intermediate result.`

Started report 2020-12-07 17:45:46 +00:00			`\section{Indexing and \textit{Solr} configuration}`

			`\section{User interface}`

			`\section{User evaluation}`
			`\end{document}`