Wip

2020-12-08 00:03:00 +01:00 · 2020-12-08 00:03:00 +01:00 · ba603bcfd2
commit ba603bcfd2
parent 6c15791271
2 changed files with 51 additions and 0 deletions
--- a/report/report.pdf
+++ b/report/report.pdf
--- a/report/report.tex
+++ b/report/report.tex
@ -76,6 +76,57 @@ no arguments.

 \section{Scraping}

+The chosen three website to be scraped were \url{flickr.com}, a user-centric
+image sharing service service aimed at photography amatures and professionals,
+\url{123rf.com}, a stock image website, and \url{shutterstock.com}, another
+stock image website.
+
+The stock photo website were scraped with standard scraping technology using
+plain \texttt{scrapy}, while \textit{Flickr} was scraped using browser emulation
+technology using \texttt{scrapy-splash} in order to execute Javascript code and
+scrape infinite-scroll paginated data.
+
+\subsection{\textit{Flickr} and the simulated browser technology
+\textit{Splash}}
+As mentioned before, the implementation of the \textit{Flickr} scraper uses
+\textit{Splash}, a browser emulation that supports Javascript execution and
+simulated user interaction. This component is essential to allow for the website
+to load correctly and to load as many photos as possible in the photo list
+pagest scraped through emulation of the user performing an ``infinite'' scroll
+down.
+
+Here is the Lua script used by splash to emulate infinite scrolling. These exact
+contents can be found in file
+\texttt{photo\_scraper/spiders/infinite\_scroll.lua}.
+
+\begin{minted}[linenos,frame=lines,framesep=2mm]{lua}
+function main(splash)
+  local num_scrolls = 20
+  local scroll_delay = 0.8
+
+  local scroll_to = splash:jsfunc("window.scrollTo")
+  local get_body_height = splash:jsfunc(
+      "function() {return document.body.scrollHeight;}"
+  )
+  assert(splash:go(splash.args.url))
+  splash:wait(splash.args.wait)
+
+  for _ = 1, num_scrolls do
+      scroll_to(0, get_body_height())
+      splash:wait(scroll_delay)
+  end
+  return splash:html()
+end
+\end{minted}
+
+Line 13 contains the instruction that scrolls down one page height. This
+instruction runs in the loop of lines 12-15, which runs the scroll instruction
+\texttt{num\_scrolls} by also waiting \texttt{scroll\_delay} seconds after every
+execution.
+
+After this operation is done, the resulting HTML markup is returned and normal
+crawling tecniques can work on this intermediate result.
+
 \section{Indexing and \textit{Solr} configuration}

 \section{User interface}