whip
This commit is contained in:
parent
ba603bcfd2
commit
45691fa806
2 changed files with 63 additions and 4 deletions
Binary file not shown.
|
@ -81,13 +81,26 @@ image sharing service service aimed at photography amatures and professionals,
|
|||
\url{123rf.com}, a stock image website, and \url{shutterstock.com}, another
|
||||
stock image website.
|
||||
|
||||
The stock photo website were scraped with standard scraping technology using
|
||||
The stock photo websites were scraped with standard scraping technology using
|
||||
plain \texttt{scrapy}, while \textit{Flickr} was scraped using browser emulation
|
||||
technology using \texttt{scrapy-splash} in order to execute Javascript code and
|
||||
scrape infinite-scroll paginated data.
|
||||
|
||||
\subsection{\textit{Flickr} and the simulated browser technology
|
||||
\textit{Splash}}
|
||||
I would like to point out that in order to save space I scraped only image
|
||||
links, and not the images themselves. Should any content that I scraped be deleted from the
|
||||
services listed above, some results might not be correct as they could have been
|
||||
deleted.
|
||||
|
||||
As a final note, since some websites are not so kind in their
|
||||
\texttt{robots.txt} file to bots (\textit{Flickr} in particular blocks all
|
||||
bots except Google), ``robots.txt compliance'' has been turned off for all
|
||||
scrapers and the user agent has been changed to mimick a normal browser.
|
||||
|
||||
All scraper implementations and related files are located in the directory
|
||||
\texttt{photo\_scraper/spiders}.
|
||||
|
||||
\subsection{\textit{Flickr}}
|
||||
\subsubsection{Simulated browser technology \textit{Splash}}
|
||||
As mentioned before, the implementation of the \textit{Flickr} scraper uses
|
||||
\textit{Splash}, a browser emulation that supports Javascript execution and
|
||||
simulated user interaction. This component is essential to allow for the website
|
||||
|
@ -97,7 +110,7 @@ down.
|
|||
|
||||
Here is the Lua script used by splash to emulate infinite scrolling. These exact
|
||||
contents can be found in file
|
||||
\texttt{photo\_scraper/spiders/infinite\_scroll.lua}.
|
||||
\texttt{infinite\_scroll.lua}.
|
||||
|
||||
\begin{minted}[linenos,frame=lines,framesep=2mm]{lua}
|
||||
function main(splash)
|
||||
|
@ -127,8 +140,54 @@ execution.
|
|||
After this operation is done, the resulting HTML markup is returned and normal
|
||||
crawling tecniques can work on this intermediate result.
|
||||
|
||||
\subsubsection{Scraper implementation}
|
||||
The Python implementation of the \textit{Flickr} scraper can be found under
|
||||
\texttt{flickr.py}.
|
||||
|
||||
Sadly \textit{Flickr}, other than a recently posted gallery of images, offers no
|
||||
curated list of image content or categorization that can allow for finding
|
||||
images other than querying for them.
|
||||
|
||||
I therefore had to use the \textit{Flickr}
|
||||
search engine to query for some common words (including the list of the 100 most
|
||||
common english verbs). Then, each search result page is fed through
|
||||
\textit{Splash} and the resulting markup is searched for image links. Each link
|
||||
is opened to scrape the image link and its metadata.
|
||||
|
||||
\subsection{Implementation for \textit{123rf} and \textit{Shutterstock}}
|
||||
The \textit{123rf} and \textit{Shutterstock} website do not require the use of
|
||||
\textit{Splash} to be scraped and, as stock image websites, offer several
|
||||
precompiled catalogs of images that can be easily scraped. The crawler
|
||||
implementations, that can respectively be found in \texttt{stock123rf.py} and
|
||||
\texttt{shutterstock.py} are pretty straightfoward, and
|
||||
navigate from the list of categories, to each category's photo list, and then
|
||||
to the individual photo page to scrape the image link and metadata.
|
||||
|
||||
\section{Indexing and \textit{Solr} configuration}
|
||||
|
||||
Solr configuration was probably the trickiest part of this project. I am not an
|
||||
expert of Solr XML configuration quirks, and I am certainly have not become one
|
||||
by implementng this project. However, I managed to assemble a configuration that
|
||||
has both a tailored collection schema defined as XML and a custom Solr
|
||||
controller to handle result clustering.
|
||||
|
||||
Configuration files for Solr can be found under the directory
|
||||
\texttt{solr\_config} this directory is symlinked by the
|
||||
\texttt{solr\_install.sh} installation script to appear as a folder named
|
||||
\texttt{server/solr/photo} in the \texttt{solr} folder containing the Solr
|
||||
installation. Therefore, the entire directory corresponds to the configuration
|
||||
and data storage for the collection \texttt{photo}, the only collection present
|
||||
in this project.
|
||||
|
||||
Please note that the \texttt{solr\_config/data} folder is
|
||||
ignored by Git and thus not present in a freshly cloned repository: this is done
|
||||
to preserve only the configuration files, and not the somewhat temporary
|
||||
collection data. The collection data is uploaded every time
|
||||
\texttt{solr\_install.sh} is used from CSV files located in the \texttt{scraped}
|
||||
folder and produced by Scrapy.
|
||||
|
||||
|
||||
|
||||
\section{User interface}
|
||||
|
||||
\section{User evaluation}
|
||||
|
|
Reference in a new issue