whip

2020-12-08 12:27:03 +01:00 · 2020-12-08 12:27:03 +01:00 · 45691fa806
commit 45691fa806
parent ba603bcfd2
2 changed files with 63 additions and 4 deletions
--- a/report/report.pdf
+++ b/report/report.pdf
--- a/report/report.tex
+++ b/report/report.tex
@ -81,13 +81,26 @@ image sharing service service aimed at photography amatures and professionals,
 \url{123rf.com}, a stock image website, and \url{shutterstock.com}, another
 stock image website.

-The stock photo website were scraped with standard scraping technology using
+The stock photo websites were scraped with standard scraping technology using
 plain \texttt{scrapy}, while \textit{Flickr} was scraped using browser emulation
 technology using \texttt{scrapy-splash} in order to execute Javascript code and
 scrape infinite-scroll paginated data.

-\subsection{\textit{Flickr} and the simulated browser technology
-\textit{Splash}}
+I would like to point out that in order to save space I scraped only image
+links, and not the images themselves. Should any content that I scraped be deleted from the
+services listed above, some results might not be correct as they could have been
+deleted.
+
+As a final note, since some websites are not so kind in their
+\texttt{robots.txt} file to bots (\textit{Flickr} in particular blocks all
+bots except Google), ``robots.txt compliance'' has been turned off for all
+scrapers and the user agent has been changed to mimick a normal browser.
+
+All scraper implementations and related files are located in the directory
+\texttt{photo\_scraper/spiders}.
+
+\subsection{\textit{Flickr}}
+\subsubsection{Simulated browser technology \textit{Splash}}
 As mentioned before, the implementation of the \textit{Flickr} scraper uses
 \textit{Splash}, a browser emulation that supports Javascript execution and
 simulated user interaction. This component is essential to allow for the website
@ -97,7 +110,7 @@ down.

 Here is the Lua script used by splash to emulate infinite scrolling. These exact
 contents can be found in file
-\texttt{photo\_scraper/spiders/infinite\_scroll.lua}.
+\texttt{infinite\_scroll.lua}.

 \begin{minted}[linenos,frame=lines,framesep=2mm]{lua}
 function main(splash)
@ -127,8 +140,54 @@ execution.
 After this operation is done, the resulting HTML markup is returned and normal
 crawling tecniques can work on this intermediate result.

+\subsubsection{Scraper implementation}
+The Python implementation of the \textit{Flickr} scraper can be found under
+\texttt{flickr.py}.
+
+Sadly \textit{Flickr}, other than a recently posted gallery of images, offers no
+curated list of image content or categorization that can allow for finding
+images other than querying for them.
+
+I therefore had to use the \textit{Flickr}
+search engine to query for some common words (including the list of the 100 most
+common english verbs). Then, each search result page is fed through
+\textit{Splash} and the resulting markup is searched for image links. Each link
+is opened to scrape the image link and its metadata.
+
+\subsection{Implementation for \textit{123rf} and \textit{Shutterstock}}
+The \textit{123rf} and \textit{Shutterstock} website do not require the use of
+\textit{Splash} to be scraped and, as stock image websites, offer several
+precompiled catalogs of images that can be easily scraped. The crawler
+implementations, that can respectively be found in \texttt{stock123rf.py} and
+\texttt{shutterstock.py} are pretty straightfoward, and
+navigate from the list of categories, to each category's photo list, and then
+to the individual photo page to scrape the image link and metadata.
+
 \section{Indexing and \textit{Solr} configuration}

+Solr configuration was probably the trickiest part of this project. I am not an
+expert of Solr XML configuration quirks, and I am certainly have not become one
+by implementng this project. However, I managed to assemble a configuration that
+has both a tailored collection schema defined as XML and a custom Solr
+controller to handle result clustering.
+
+Configuration files for Solr can be found under the directory
+\texttt{solr\_config} this directory is symlinked by the
+\texttt{solr\_install.sh} installation script to appear as a folder named
+\texttt{server/solr/photo} in the \texttt{solr} folder containing the Solr
+installation. Therefore, the entire directory corresponds to the configuration
+and data storage for the collection \texttt{photo}, the only collection present
+in this project.
+
+Please note that the \texttt{solr\_config/data} folder is
+ignored by Git and thus not present in a freshly cloned repository: this is done
+to preserve only the configuration files, and not the somewhat temporary
+collection data. The collection data is uploaded every time
+\texttt{solr\_install.sh} is used from CSV files located in the \texttt{scraped}
+folder and produced by Scrapy.
+
+
+
 \section{User interface}

 \section{User evaluation}