% vim: set ts=2 sw=2 et tw=80: \documentclass{scrartcl} \usepackage{hyperref} \usepackage{parskip} \usepackage{minted} \usepackage[utf8]{inputenc} \setlength{\parindent}{0pt} \usepackage[margin=2.5cm]{geometry} \title{\textit{Image Search IR System} \\\vspace{0.3cm} \Large{WS2020-21 Information Retrieval Project}} \author{Claudio Maggioni} \begin{document} \maketitle \tableofcontents \newpage \section{Introduction} This report is a summary of the work I have done to create the ``Image Search IR system'', a proof-of-concept IR system implementation implementing the ``Image Search Engine'' project (project \#13). The project is built on a simple \textit{Scrapy}-\textit{Solr}-\textit{HTML5+CSS+JS} stack. Installation instructions, an in-depth look to the project components for scraping, indexing, and displaying the results, and finally the user evaluation report, can all be found in the following sections. \section{Installation instructions} \subsection{Project repository} The project Git repository is located here: \url{https://git.maggioni.xyz/maggicl/IRProject}. \subsection{Solr installation} The installation of the project and population of the test collection with the scraped documents is automated by a single script. The script requires you have downloaded \textit{Solr} version 8.6.2. as a ZIP file, i.e.\ the same \textit{Solr} ZIP we had to download during lab lectures. Should you need to download a copy of the ZIP file, you can find it here: \url{https://maggioni.xyz/solr-8.6.2.zip}. Clone the project's git repository and position yourself with a shell on the project's root directory. Then execute this command: % linenos \begin{minted}[frame=lines,framesep=2mm]{bash} ./solr_install.sh {ZIP path} \end{minted} where \texttt{\{ZIP path\}} is the path of the ZIP file mentioned earlier. This will install, start, and update \textit{Solr} with the test collection. \subsection{UI installation} In order to start the UI, open with your browser of choice the file \texttt{ui/index.html}. In order to use the UI, it is necessary to bypass \texttt{Cross Origin Resource Sharing} security checks by downloading and enabling a ``CORS everywhere'' extension. I suggest \href{https://addons.mozilla.org/en-US/firefox/addon/cors-everywhere/}{this one} for Mozilla Firefox and derivatives. \subsection{Run the website scrapers} A prerequisite to run the Flickr crawler is to have a working Scrapy Splash instance listening on port \texttt{localhost:8050}. This can be achieved by executing this Docker command, should a Docker installation be available: \begin{minted}[frame=lines,framesep=2mm]{bash} docker run -p 8050:8050 scrapinghub/scrapy \end{minted} In order to all the website scrapers, run the script \texttt{./scrape.sh} with no arguments. \section{Scraping} The chosen three website to be scraped were \url{flickr.com}, a user-centric image sharing service service aimed at photography amatures and professionals, \url{123rf.com}, a stock image website, and \url{shutterstock.com}, another stock image website. The stock photo websites were scraped with standard scraping technology using plain \texttt{scrapy}, while \textit{Flickr} was scraped using browser emulation technology using \texttt{scrapy-splash} in order to execute Javascript code and scrape infinite-scroll paginated data. I would like to point out that in order to save space I scraped only image links, and not the images themselves. Should any content that I scraped be deleted from the services listed above, some results might not be correct as they could have been deleted. As a final note, since some websites are not so kind in their \texttt{robots.txt} file to bots (\textit{Flickr} in particular blocks all bots except Google), ``robots.txt compliance'' has been turned off for all scrapers and the user agent has been changed to mimick a normal browser. All scraper implementations and related files are located in the directory \texttt{photo\_scraper/spiders}. \subsection{\textit{Flickr}} \subsubsection{Simulated browser technology \textit{Splash}} As mentioned before, the implementation of the \textit{Flickr} scraper uses \textit{Splash}, a browser emulation that supports Javascript execution and simulated user interaction. This component is essential to allow for the website to load correctly and to load as many photos as possible in the photo list pagest scraped through emulation of the user performing an ``infinite'' scroll down. Here is the Lua script used by splash to emulate infinite scrolling. These exact contents can be found in file \texttt{infinite\_scroll.lua}. \begin{minted}[linenos,frame=lines,framesep=2mm]{lua} function main(splash) local num_scrolls = 20 local scroll_delay = 0.8 local scroll_to = splash:jsfunc("window.scrollTo") local get_body_height = splash:jsfunc( "function() {return document.body.scrollHeight;}" ) assert(splash:go(splash.args.url)) splash:wait(splash.args.wait) for _ = 1, num_scrolls do scroll_to(0, get_body_height()) splash:wait(scroll_delay) end return splash:html() end \end{minted} Line 13 contains the instruction that scrolls down one page height. This instruction runs in the loop of lines 12-15, which runs the scroll instruction \texttt{num\_scrolls} by also waiting \texttt{scroll\_delay} seconds after every execution. After this operation is done, the resulting HTML markup is returned and normal crawling tecniques can work on this intermediate result. \subsubsection{Scraper implementation} The Python implementation of the \textit{Flickr} scraper can be found under \texttt{flickr.py}. Sadly \textit{Flickr}, other than a recently posted gallery of images, offers no curated list of image content or categorization that can allow for finding images other than querying for them. I therefore had to use the \textit{Flickr} search engine to query for some common words (including the list of the 100 most common english verbs). Then, each search result page is fed through \textit{Splash} and the resulting markup is searched for image links. Each link is opened to scrape the image link and its metadata. \subsection{Implementation for \textit{123rf} and \textit{Shutterstock}} The \textit{123rf} and \textit{Shutterstock} website do not require the use of \textit{Splash} to be scraped and, as stock image websites, offer several precompiled catalogs of images that can be easily scraped. The crawler implementations, that can respectively be found in \texttt{stock123rf.py} and \texttt{shutterstock.py} are pretty straightfoward, and navigate from the list of categories, to each category's photo list, and then to the individual photo page to scrape the image link and metadata. \section{Indexing and \textit{Solr} configuration} Solr configuration was probably the trickiest part of this project. I am not an expert of Solr XML configuration quirks, and I am certainly have not become one by implementng this project. However, I managed to assemble a configuration that has both a tailored collection schema defined as XML and a custom Solr controller to handle result clustering. Configuration files for Solr can be found under the directory \texttt{solr\_config} this directory is symlinked by the \texttt{solr\_install.sh} installation script to appear as a folder named \texttt{server/solr/photo} in the \texttt{solr} folder containing the Solr installation. Therefore, the entire directory corresponds to the configuration and data storage for the collection \texttt{photo}, the only collection present in this project. Please note that the \texttt{solr\_config/data} folder is ignored by Git and thus not present in a freshly cloned repository: this is done to preserve only the configuration files, and not the somewhat temporary collection data. The collection data is uploaded every time \texttt{solr\_install.sh} is used from CSV files located in the \texttt{scraped} folder and produced by Scrapy. The configuration was derived from the \texttt{techproducts} Solr example by changing the collection schema and removing any non-needed controller. \subsection{Solr schema} As some minor edits were made using Solr's web interface, the relevant XML schema to analyse is the file \texttt{solr\_config/conf/managed-schema}. This files also stores the edits done through the UI. An extract of the relevant lines is shown below: \begin{minted}[linenos,frame=lines,framesep=2mm]{xml} id \end{minted} All fields have type \texttt{text-general}. Fields with a name starting by ``\texttt{t\_}'' are included in the \texttt{text} copy field, which is used as the default field for document similarity when searching an clustering. The \texttt{id} field is of type \texttt{string}, but in actuality it is always a positive integer. This field's values do not come from data scraped from the website, but it is computed as a auto-incremented progressive identified when uploading the collection on solr using \texttt{solr\_install.sh}. Shown below is the \texttt{awk}-based piped command included in the installation script that performs this task and uploads the collection. \begin{minted}[linenos,frame=lines,framesep=2mm]{bash} # at this point in the script, `pwd` is the repository root directory cd scraped # POST scraped data tail -q -n +2 photos.csv 123rf.csv shutterstock.csv | \ awk "{print NR-1 ',' \$0}" | \ awk 'BEGIN {print "id,t_author,t_title,t_description,date,img_url,tags"} {print}' | \ ../solr/bin/post -c photo -type text/csv -out yes -d \end{minted} Line 6 strips the heading line of the listed CSV files and concatenates them; Line 7 adds ``\{id\},'' at the beginning of each line, where \{id\} corresponds to the line number. Line 8 and 9 finally add the correct CSV heading, including the ``id'' field. Line 10 reads the processed data and posts it to Solr. \subsection{Clustering configuration} Clustering configuration was performed by using the \texttt{solrconfig.xml} file from the \texttt{techproducts} Solr example and adapting it to the ``photo'' collection schema. Here is the XML configuration relevant to the clustering controller. It can be found at approximately line 900 of the \texttt{solrconfig.xml} file: \begin{minted}[linenos,frame=lines,framesep=2mm]{xml} true true t_title img_url t_description true false edismax text *:* 100 *,score clustering \end{minted} This clustering controller uses Carrot2 technology to perform ``shallow'' one level clustering (Line 19 disables sub-clusters). \texttt{t\_title} is used as the ``title'' field for each document, \texttt{img\_url} as the ``document location'' field and \texttt{t\_description} the ``description'' field (See respectively lines 9, 11, and 13 of the configuration). This controller replaces the normal \texttt{/select} controller, and thus one single request will generate search results and clustering data. Defaults for search are a 100 results limit and the use of \texttt{t\_*} fields to match documents (lines 25 and 23 -- remember the definition of the \texttt{text} field). \section{User interface} \section{User evaluation} \end{document}