This commit is contained in:
Claudio Maggioni 2020-12-08 12:27:03 +01:00
parent ba603bcfd2
commit 45691fa806
2 changed files with 63 additions and 4 deletions

Binary file not shown.

View file

@ -81,13 +81,26 @@ image sharing service service aimed at photography amatures and professionals,
\url{123rf.com}, a stock image website, and \url{shutterstock.com}, another \url{123rf.com}, a stock image website, and \url{shutterstock.com}, another
stock image website. stock image website.
The stock photo website were scraped with standard scraping technology using The stock photo websites were scraped with standard scraping technology using
plain \texttt{scrapy}, while \textit{Flickr} was scraped using browser emulation plain \texttt{scrapy}, while \textit{Flickr} was scraped using browser emulation
technology using \texttt{scrapy-splash} in order to execute Javascript code and technology using \texttt{scrapy-splash} in order to execute Javascript code and
scrape infinite-scroll paginated data. scrape infinite-scroll paginated data.
\subsection{\textit{Flickr} and the simulated browser technology I would like to point out that in order to save space I scraped only image
\textit{Splash}} links, and not the images themselves. Should any content that I scraped be deleted from the
services listed above, some results might not be correct as they could have been
deleted.
As a final note, since some websites are not so kind in their
\texttt{robots.txt} file to bots (\textit{Flickr} in particular blocks all
bots except Google), ``robots.txt compliance'' has been turned off for all
scrapers and the user agent has been changed to mimick a normal browser.
All scraper implementations and related files are located in the directory
\texttt{photo\_scraper/spiders}.
\subsection{\textit{Flickr}}
\subsubsection{Simulated browser technology \textit{Splash}}
As mentioned before, the implementation of the \textit{Flickr} scraper uses As mentioned before, the implementation of the \textit{Flickr} scraper uses
\textit{Splash}, a browser emulation that supports Javascript execution and \textit{Splash}, a browser emulation that supports Javascript execution and
simulated user interaction. This component is essential to allow for the website simulated user interaction. This component is essential to allow for the website
@ -97,7 +110,7 @@ down.
Here is the Lua script used by splash to emulate infinite scrolling. These exact Here is the Lua script used by splash to emulate infinite scrolling. These exact
contents can be found in file contents can be found in file
\texttt{photo\_scraper/spiders/infinite\_scroll.lua}. \texttt{infinite\_scroll.lua}.
\begin{minted}[linenos,frame=lines,framesep=2mm]{lua} \begin{minted}[linenos,frame=lines,framesep=2mm]{lua}
function main(splash) function main(splash)
@ -127,8 +140,54 @@ execution.
After this operation is done, the resulting HTML markup is returned and normal After this operation is done, the resulting HTML markup is returned and normal
crawling tecniques can work on this intermediate result. crawling tecniques can work on this intermediate result.
\subsubsection{Scraper implementation}
The Python implementation of the \textit{Flickr} scraper can be found under
\texttt{flickr.py}.
Sadly \textit{Flickr}, other than a recently posted gallery of images, offers no
curated list of image content or categorization that can allow for finding
images other than querying for them.
I therefore had to use the \textit{Flickr}
search engine to query for some common words (including the list of the 100 most
common english verbs). Then, each search result page is fed through
\textit{Splash} and the resulting markup is searched for image links. Each link
is opened to scrape the image link and its metadata.
\subsection{Implementation for \textit{123rf} and \textit{Shutterstock}}
The \textit{123rf} and \textit{Shutterstock} website do not require the use of
\textit{Splash} to be scraped and, as stock image websites, offer several
precompiled catalogs of images that can be easily scraped. The crawler
implementations, that can respectively be found in \texttt{stock123rf.py} and
\texttt{shutterstock.py} are pretty straightfoward, and
navigate from the list of categories, to each category's photo list, and then
to the individual photo page to scrape the image link and metadata.
\section{Indexing and \textit{Solr} configuration} \section{Indexing and \textit{Solr} configuration}
Solr configuration was probably the trickiest part of this project. I am not an
expert of Solr XML configuration quirks, and I am certainly have not become one
by implementng this project. However, I managed to assemble a configuration that
has both a tailored collection schema defined as XML and a custom Solr
controller to handle result clustering.
Configuration files for Solr can be found under the directory
\texttt{solr\_config} this directory is symlinked by the
\texttt{solr\_install.sh} installation script to appear as a folder named
\texttt{server/solr/photo} in the \texttt{solr} folder containing the Solr
installation. Therefore, the entire directory corresponds to the configuration
and data storage for the collection \texttt{photo}, the only collection present
in this project.
Please note that the \texttt{solr\_config/data} folder is
ignored by Git and thus not present in a freshly cloned repository: this is done
to preserve only the configuration files, and not the somewhat temporary
collection data. The collection data is uploaded every time
\texttt{solr\_install.sh} is used from CSV files located in the \texttt{scraped}
folder and produced by Scrapy.
\section{User interface} \section{User interface}
\section{User evaluation} \section{User evaluation}