whip

2020-12-08 12:27:03 +01:00 · 2020-12-08 12:27:03 +01:00 · 45691fa806
commit 45691fa806
parent ba603bcfd2
2 changed files with 63 additions and 4 deletions
--- a/report/report.pdf
+++ b/report/report.pdf
--- a/report/report.tex
+++ b/report/report.tex
@ -81,13 +81,26 @@ image sharing service service aimed at photography amatures and professionals,
 \url{123rf.com}, a stock image website, and \url{shutterstock.com}, another
 stock image website.
-The stock photo website were scraped with standard scraping technology using
+The stock photo websites were scraped with standard scraping technology using
 plain \texttt{scrapy}, while \textit{Flickr} was scraped using browser emulation
 technology using \texttt{scrapy-splash} in order to execute Javascript code and
 scrape infinite-scroll paginated data.
-\subsection{\textit{Flickr} and the simulated browser technology
+I would like to point out that in order to save space I scraped only image
-\textit{Splash}}
+links, and not the images themselves. Should any content that I scraped be deleted from the
 services listed above, some results might not be correct as they could have been
 deleted.
 As a final note, since some websites are not so kind in their
 \texttt{robots.txt} file to bots (\textit{Flickr} in particular blocks all
 bots except Google), ``robots.txt compliance'' has been turned off for all
 scrapers and the user agent has been changed to mimick a normal browser.
 All scraper implementations and related files are located in the directory
 \texttt{photo\_scraper/spiders}.
 \subsection{\textit{Flickr}}
 \subsubsection{Simulated browser technology \textit{Splash}}
 As mentioned before, the implementation of the \textit{Flickr} scraper uses
 \textit{Splash}, a browser emulation that supports Javascript execution and
 simulated user interaction. This component is essential to allow for the website
@ -97,7 +110,7 @@ down.
 Here is the Lua script used by splash to emulate infinite scrolling. These exact
 contents can be found in file
-\texttt{photo\_scraper/spiders/infinite\_scroll.lua}.
+\texttt{infinite\_scroll.lua}.
 \begin{minted}[linenos,frame=lines,framesep=2mm]{lua}
 function main(splash)
@ -127,8 +140,54 @@ execution.
 After this operation is done, the resulting HTML markup is returned and normal
 crawling tecniques can work on this intermediate result.
 \subsubsection{Scraper implementation}
 The Python implementation of the \textit{Flickr} scraper can be found under
 \texttt{flickr.py}.
 Sadly \textit{Flickr}, other than a recently posted gallery of images, offers no
 curated list of image content or categorization that can allow for finding
 images other than querying for them.
 I therefore had to use the \textit{Flickr}
 search engine to query for some common words (including the list of the 100 most
 common english verbs). Then, each search result page is fed through
 \textit{Splash} and the resulting markup is searched for image links. Each link
 is opened to scrape the image link and its metadata.
 \subsection{Implementation for \textit{123rf} and \textit{Shutterstock}}
 The \textit{123rf} and \textit{Shutterstock} website do not require the use of
 \textit{Splash} to be scraped and, as stock image websites, offer several
 precompiled catalogs of images that can be easily scraped. The crawler
 implementations, that can respectively be found in \texttt{stock123rf.py} and
 \texttt{shutterstock.py} are pretty straightfoward, and
 navigate from the list of categories, to each category's photo list, and then
 to the individual photo page to scrape the image link and metadata.
 \section{Indexing and \textit{Solr} configuration}
 Solr configuration was probably the trickiest part of this project. I am not an
 expert of Solr XML configuration quirks, and I am certainly have not become one
 by implementng this project. However, I managed to assemble a configuration that
 has both a tailored collection schema defined as XML and a custom Solr
 controller to handle result clustering.
 Configuration files for Solr can be found under the directory
 \texttt{solr\_config} this directory is symlinked by the
 \texttt{solr\_install.sh} installation script to appear as a folder named
 \texttt{server/solr/photo} in the \texttt{solr} folder containing the Solr
 installation. Therefore, the entire directory corresponds to the configuration
 and data storage for the collection \texttt{photo}, the only collection present
 in this project.
 Please note that the \texttt{solr\_config/data} folder is
 ignored by Git and thus not present in a freshly cloned repository: this is done
 to preserve only the configuration files, and not the somewhat temporary
 collection data. The collection data is uploaded every time
 \texttt{solr\_install.sh} is used from CSV files located in the \texttt{scraped}
 folder and produced by Scrapy.
 \section{User interface}
 \section{User evaluation}