This commit is contained in:
Claudio Maggioni (maggicl) 2020-12-08 00:03:00 +01:00
parent 6c15791271
commit ba603bcfd2
2 changed files with 51 additions and 0 deletions

Binary file not shown.

View File

@ -76,6 +76,57 @@ no arguments.
\section{Scraping}
The chosen three website to be scraped were \url{flickr.com}, a user-centric
image sharing service service aimed at photography amatures and professionals,
\url{123rf.com}, a stock image website, and \url{shutterstock.com}, another
stock image website.
The stock photo website were scraped with standard scraping technology using
plain \texttt{scrapy}, while \textit{Flickr} was scraped using browser emulation
technology using \texttt{scrapy-splash} in order to execute Javascript code and
scrape infinite-scroll paginated data.
\subsection{\textit{Flickr} and the simulated browser technology
\textit{Splash}}
As mentioned before, the implementation of the \textit{Flickr} scraper uses
\textit{Splash}, a browser emulation that supports Javascript execution and
simulated user interaction. This component is essential to allow for the website
to load correctly and to load as many photos as possible in the photo list
pagest scraped through emulation of the user performing an ``infinite'' scroll
down.
Here is the Lua script used by splash to emulate infinite scrolling. These exact
contents can be found in file
\texttt{photo\_scraper/spiders/infinite\_scroll.lua}.
\begin{minted}[linenos,frame=lines,framesep=2mm]{lua}
function main(splash)
local num_scrolls = 20
local scroll_delay = 0.8
local scroll_to = splash:jsfunc("window.scrollTo")
local get_body_height = splash:jsfunc(
"function() {return document.body.scrollHeight;}"
)
assert(splash:go(splash.args.url))
splash:wait(splash.args.wait)
for _ = 1, num_scrolls do
scroll_to(0, get_body_height())
splash:wait(scroll_delay)
end
return splash:html()
end
\end{minted}
Line 13 contains the instruction that scrolls down one page height. This
instruction runs in the loop of lines 12-15, which runs the scroll instruction
\texttt{num\_scrolls} by also waiting \texttt{scroll\_delay} seconds after every
execution.
After this operation is done, the resulting HTML markup is returned and normal
crawling tecniques can work on this intermediate result.
\section{Indexing and \textit{Solr} configuration}
\section{User interface}