whip
This commit is contained in:
parent
ba603bcfd2
commit
45691fa806
2 changed files with 63 additions and 4 deletions
Binary file not shown.
|
@ -81,13 +81,26 @@ image sharing service service aimed at photography amatures and professionals,
|
||||||
\url{123rf.com}, a stock image website, and \url{shutterstock.com}, another
|
\url{123rf.com}, a stock image website, and \url{shutterstock.com}, another
|
||||||
stock image website.
|
stock image website.
|
||||||
|
|
||||||
The stock photo website were scraped with standard scraping technology using
|
The stock photo websites were scraped with standard scraping technology using
|
||||||
plain \texttt{scrapy}, while \textit{Flickr} was scraped using browser emulation
|
plain \texttt{scrapy}, while \textit{Flickr} was scraped using browser emulation
|
||||||
technology using \texttt{scrapy-splash} in order to execute Javascript code and
|
technology using \texttt{scrapy-splash} in order to execute Javascript code and
|
||||||
scrape infinite-scroll paginated data.
|
scrape infinite-scroll paginated data.
|
||||||
|
|
||||||
\subsection{\textit{Flickr} and the simulated browser technology
|
I would like to point out that in order to save space I scraped only image
|
||||||
\textit{Splash}}
|
links, and not the images themselves. Should any content that I scraped be deleted from the
|
||||||
|
services listed above, some results might not be correct as they could have been
|
||||||
|
deleted.
|
||||||
|
|
||||||
|
As a final note, since some websites are not so kind in their
|
||||||
|
\texttt{robots.txt} file to bots (\textit{Flickr} in particular blocks all
|
||||||
|
bots except Google), ``robots.txt compliance'' has been turned off for all
|
||||||
|
scrapers and the user agent has been changed to mimick a normal browser.
|
||||||
|
|
||||||
|
All scraper implementations and related files are located in the directory
|
||||||
|
\texttt{photo\_scraper/spiders}.
|
||||||
|
|
||||||
|
\subsection{\textit{Flickr}}
|
||||||
|
\subsubsection{Simulated browser technology \textit{Splash}}
|
||||||
As mentioned before, the implementation of the \textit{Flickr} scraper uses
|
As mentioned before, the implementation of the \textit{Flickr} scraper uses
|
||||||
\textit{Splash}, a browser emulation that supports Javascript execution and
|
\textit{Splash}, a browser emulation that supports Javascript execution and
|
||||||
simulated user interaction. This component is essential to allow for the website
|
simulated user interaction. This component is essential to allow for the website
|
||||||
|
@ -97,7 +110,7 @@ down.
|
||||||
|
|
||||||
Here is the Lua script used by splash to emulate infinite scrolling. These exact
|
Here is the Lua script used by splash to emulate infinite scrolling. These exact
|
||||||
contents can be found in file
|
contents can be found in file
|
||||||
\texttt{photo\_scraper/spiders/infinite\_scroll.lua}.
|
\texttt{infinite\_scroll.lua}.
|
||||||
|
|
||||||
\begin{minted}[linenos,frame=lines,framesep=2mm]{lua}
|
\begin{minted}[linenos,frame=lines,framesep=2mm]{lua}
|
||||||
function main(splash)
|
function main(splash)
|
||||||
|
@ -127,8 +140,54 @@ execution.
|
||||||
After this operation is done, the resulting HTML markup is returned and normal
|
After this operation is done, the resulting HTML markup is returned and normal
|
||||||
crawling tecniques can work on this intermediate result.
|
crawling tecniques can work on this intermediate result.
|
||||||
|
|
||||||
|
\subsubsection{Scraper implementation}
|
||||||
|
The Python implementation of the \textit{Flickr} scraper can be found under
|
||||||
|
\texttt{flickr.py}.
|
||||||
|
|
||||||
|
Sadly \textit{Flickr}, other than a recently posted gallery of images, offers no
|
||||||
|
curated list of image content or categorization that can allow for finding
|
||||||
|
images other than querying for them.
|
||||||
|
|
||||||
|
I therefore had to use the \textit{Flickr}
|
||||||
|
search engine to query for some common words (including the list of the 100 most
|
||||||
|
common english verbs). Then, each search result page is fed through
|
||||||
|
\textit{Splash} and the resulting markup is searched for image links. Each link
|
||||||
|
is opened to scrape the image link and its metadata.
|
||||||
|
|
||||||
|
\subsection{Implementation for \textit{123rf} and \textit{Shutterstock}}
|
||||||
|
The \textit{123rf} and \textit{Shutterstock} website do not require the use of
|
||||||
|
\textit{Splash} to be scraped and, as stock image websites, offer several
|
||||||
|
precompiled catalogs of images that can be easily scraped. The crawler
|
||||||
|
implementations, that can respectively be found in \texttt{stock123rf.py} and
|
||||||
|
\texttt{shutterstock.py} are pretty straightfoward, and
|
||||||
|
navigate from the list of categories, to each category's photo list, and then
|
||||||
|
to the individual photo page to scrape the image link and metadata.
|
||||||
|
|
||||||
\section{Indexing and \textit{Solr} configuration}
|
\section{Indexing and \textit{Solr} configuration}
|
||||||
|
|
||||||
|
Solr configuration was probably the trickiest part of this project. I am not an
|
||||||
|
expert of Solr XML configuration quirks, and I am certainly have not become one
|
||||||
|
by implementng this project. However, I managed to assemble a configuration that
|
||||||
|
has both a tailored collection schema defined as XML and a custom Solr
|
||||||
|
controller to handle result clustering.
|
||||||
|
|
||||||
|
Configuration files for Solr can be found under the directory
|
||||||
|
\texttt{solr\_config} this directory is symlinked by the
|
||||||
|
\texttt{solr\_install.sh} installation script to appear as a folder named
|
||||||
|
\texttt{server/solr/photo} in the \texttt{solr} folder containing the Solr
|
||||||
|
installation. Therefore, the entire directory corresponds to the configuration
|
||||||
|
and data storage for the collection \texttt{photo}, the only collection present
|
||||||
|
in this project.
|
||||||
|
|
||||||
|
Please note that the \texttt{solr\_config/data} folder is
|
||||||
|
ignored by Git and thus not present in a freshly cloned repository: this is done
|
||||||
|
to preserve only the configuration files, and not the somewhat temporary
|
||||||
|
collection data. The collection data is uploaded every time
|
||||||
|
\texttt{solr\_install.sh} is used from CSV files located in the \texttt{scraped}
|
||||||
|
folder and produced by Scrapy.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
\section{User interface}
|
\section{User interface}
|
||||||
|
|
||||||
\section{User evaluation}
|
\section{User evaluation}
|
||||||
|
|
Reference in a new issue