Wip
This commit is contained in:
parent
6c15791271
commit
ba603bcfd2
2 changed files with 51 additions and 0 deletions
Binary file not shown.
|
@ -76,6 +76,57 @@ no arguments.
|
||||||
|
|
||||||
\section{Scraping}
|
\section{Scraping}
|
||||||
|
|
||||||
|
The chosen three website to be scraped were \url{flickr.com}, a user-centric
|
||||||
|
image sharing service service aimed at photography amatures and professionals,
|
||||||
|
\url{123rf.com}, a stock image website, and \url{shutterstock.com}, another
|
||||||
|
stock image website.
|
||||||
|
|
||||||
|
The stock photo website were scraped with standard scraping technology using
|
||||||
|
plain \texttt{scrapy}, while \textit{Flickr} was scraped using browser emulation
|
||||||
|
technology using \texttt{scrapy-splash} in order to execute Javascript code and
|
||||||
|
scrape infinite-scroll paginated data.
|
||||||
|
|
||||||
|
\subsection{\textit{Flickr} and the simulated browser technology
|
||||||
|
\textit{Splash}}
|
||||||
|
As mentioned before, the implementation of the \textit{Flickr} scraper uses
|
||||||
|
\textit{Splash}, a browser emulation that supports Javascript execution and
|
||||||
|
simulated user interaction. This component is essential to allow for the website
|
||||||
|
to load correctly and to load as many photos as possible in the photo list
|
||||||
|
pagest scraped through emulation of the user performing an ``infinite'' scroll
|
||||||
|
down.
|
||||||
|
|
||||||
|
Here is the Lua script used by splash to emulate infinite scrolling. These exact
|
||||||
|
contents can be found in file
|
||||||
|
\texttt{photo\_scraper/spiders/infinite\_scroll.lua}.
|
||||||
|
|
||||||
|
\begin{minted}[linenos,frame=lines,framesep=2mm]{lua}
|
||||||
|
function main(splash)
|
||||||
|
local num_scrolls = 20
|
||||||
|
local scroll_delay = 0.8
|
||||||
|
|
||||||
|
local scroll_to = splash:jsfunc("window.scrollTo")
|
||||||
|
local get_body_height = splash:jsfunc(
|
||||||
|
"function() {return document.body.scrollHeight;}"
|
||||||
|
)
|
||||||
|
assert(splash:go(splash.args.url))
|
||||||
|
splash:wait(splash.args.wait)
|
||||||
|
|
||||||
|
for _ = 1, num_scrolls do
|
||||||
|
scroll_to(0, get_body_height())
|
||||||
|
splash:wait(scroll_delay)
|
||||||
|
end
|
||||||
|
return splash:html()
|
||||||
|
end
|
||||||
|
\end{minted}
|
||||||
|
|
||||||
|
Line 13 contains the instruction that scrolls down one page height. This
|
||||||
|
instruction runs in the loop of lines 12-15, which runs the scroll instruction
|
||||||
|
\texttt{num\_scrolls} by also waiting \texttt{scroll\_delay} seconds after every
|
||||||
|
execution.
|
||||||
|
|
||||||
|
After this operation is done, the resulting HTML markup is returned and normal
|
||||||
|
crawling tecniques can work on this intermediate result.
|
||||||
|
|
||||||
\section{Indexing and \textit{Solr} configuration}
|
\section{Indexing and \textit{Solr} configuration}
|
||||||
|
|
||||||
\section{User interface}
|
\section{User interface}
|
||||||
|
|
Reference in a new issue