diff --git a/report/report.pdf b/report/report.pdf index 188943b..a2125c9 100644 Binary files a/report/report.pdf and b/report/report.pdf differ diff --git a/report/report.tex b/report/report.tex index 426cea8..a6ad496 100644 --- a/report/report.tex +++ b/report/report.tex @@ -76,6 +76,57 @@ no arguments. \section{Scraping} +The chosen three website to be scraped were \url{flickr.com}, a user-centric +image sharing service service aimed at photography amatures and professionals, +\url{123rf.com}, a stock image website, and \url{shutterstock.com}, another +stock image website. + +The stock photo website were scraped with standard scraping technology using +plain \texttt{scrapy}, while \textit{Flickr} was scraped using browser emulation +technology using \texttt{scrapy-splash} in order to execute Javascript code and +scrape infinite-scroll paginated data. + +\subsection{\textit{Flickr} and the simulated browser technology +\textit{Splash}} +As mentioned before, the implementation of the \textit{Flickr} scraper uses +\textit{Splash}, a browser emulation that supports Javascript execution and +simulated user interaction. This component is essential to allow for the website +to load correctly and to load as many photos as possible in the photo list +pagest scraped through emulation of the user performing an ``infinite'' scroll +down. + +Here is the Lua script used by splash to emulate infinite scrolling. These exact +contents can be found in file +\texttt{photo\_scraper/spiders/infinite\_scroll.lua}. + +\begin{minted}[linenos,frame=lines,framesep=2mm]{lua} +function main(splash) + local num_scrolls = 20 + local scroll_delay = 0.8 + + local scroll_to = splash:jsfunc("window.scrollTo") + local get_body_height = splash:jsfunc( + "function() {return document.body.scrollHeight;}" + ) + assert(splash:go(splash.args.url)) + splash:wait(splash.args.wait) + + for _ = 1, num_scrolls do + scroll_to(0, get_body_height()) + splash:wait(scroll_delay) + end + return splash:html() +end +\end{minted} + +Line 13 contains the instruction that scrolls down one page height. This +instruction runs in the loop of lines 12-15, which runs the scroll instruction +\texttt{num\_scrolls} by also waiting \texttt{scroll\_delay} seconds after every +execution. + +After this operation is done, the resulting HTML markup is returned and normal +crawling tecniques can work on this intermediate result. + \section{Indexing and \textit{Solr} configuration} \section{User interface}