This commit is contained in:
Claudio Maggioni 2020-12-08 12:27:03 +01:00
parent ba603bcfd2
commit 45691fa806
2 changed files with 63 additions and 4 deletions

Binary file not shown.

View file

@ -81,13 +81,26 @@ image sharing service service aimed at photography amatures and professionals,
\url{123rf.com}, a stock image website, and \url{shutterstock.com}, another
stock image website.
The stock photo website were scraped with standard scraping technology using
The stock photo websites were scraped with standard scraping technology using
plain \texttt{scrapy}, while \textit{Flickr} was scraped using browser emulation
technology using \texttt{scrapy-splash} in order to execute Javascript code and
scrape infinite-scroll paginated data.
\subsection{\textit{Flickr} and the simulated browser technology
\textit{Splash}}
I would like to point out that in order to save space I scraped only image
links, and not the images themselves. Should any content that I scraped be deleted from the
services listed above, some results might not be correct as they could have been
deleted.
As a final note, since some websites are not so kind in their
\texttt{robots.txt} file to bots (\textit{Flickr} in particular blocks all
bots except Google), ``robots.txt compliance'' has been turned off for all
scrapers and the user agent has been changed to mimick a normal browser.
All scraper implementations and related files are located in the directory
\texttt{photo\_scraper/spiders}.
\subsection{\textit{Flickr}}
\subsubsection{Simulated browser technology \textit{Splash}}
As mentioned before, the implementation of the \textit{Flickr} scraper uses
\textit{Splash}, a browser emulation that supports Javascript execution and
simulated user interaction. This component is essential to allow for the website
@ -97,7 +110,7 @@ down.
Here is the Lua script used by splash to emulate infinite scrolling. These exact
contents can be found in file
\texttt{photo\_scraper/spiders/infinite\_scroll.lua}.
\texttt{infinite\_scroll.lua}.
\begin{minted}[linenos,frame=lines,framesep=2mm]{lua}
function main(splash)
@ -127,8 +140,54 @@ execution.
After this operation is done, the resulting HTML markup is returned and normal
crawling tecniques can work on this intermediate result.
\subsubsection{Scraper implementation}
The Python implementation of the \textit{Flickr} scraper can be found under
\texttt{flickr.py}.
Sadly \textit{Flickr}, other than a recently posted gallery of images, offers no
curated list of image content or categorization that can allow for finding
images other than querying for them.
I therefore had to use the \textit{Flickr}
search engine to query for some common words (including the list of the 100 most
common english verbs). Then, each search result page is fed through
\textit{Splash} and the resulting markup is searched for image links. Each link
is opened to scrape the image link and its metadata.
\subsection{Implementation for \textit{123rf} and \textit{Shutterstock}}
The \textit{123rf} and \textit{Shutterstock} website do not require the use of
\textit{Splash} to be scraped and, as stock image websites, offer several
precompiled catalogs of images that can be easily scraped. The crawler
implementations, that can respectively be found in \texttt{stock123rf.py} and
\texttt{shutterstock.py} are pretty straightfoward, and
navigate from the list of categories, to each category's photo list, and then
to the individual photo page to scrape the image link and metadata.
\section{Indexing and \textit{Solr} configuration}
Solr configuration was probably the trickiest part of this project. I am not an
expert of Solr XML configuration quirks, and I am certainly have not become one
by implementng this project. However, I managed to assemble a configuration that
has both a tailored collection schema defined as XML and a custom Solr
controller to handle result clustering.
Configuration files for Solr can be found under the directory
\texttt{solr\_config} this directory is symlinked by the
\texttt{solr\_install.sh} installation script to appear as a folder named
\texttt{server/solr/photo} in the \texttt{solr} folder containing the Solr
installation. Therefore, the entire directory corresponds to the configuration
and data storage for the collection \texttt{photo}, the only collection present
in this project.
Please note that the \texttt{solr\_config/data} folder is
ignored by Git and thus not present in a freshly cloned repository: this is done
to preserve only the configuration files, and not the somewhat temporary
collection data. The collection data is uploaded every time
\texttt{solr\_install.sh} is used from CSV files located in the \texttt{scraped}
folder and produced by Scrapy.
\section{User interface}
\section{User evaluation}