Wip

2020-12-08 00:03:00 +01:00 · 2020-12-08 00:03:00 +01:00 · ba603bcfd2
commit ba603bcfd2
parent 6c15791271
2 changed files with 51 additions and 0 deletions
--- a/report/report.pdf
+++ b/report/report.pdf
--- a/report/report.tex
+++ b/report/report.tex
@ -76,6 +76,57 @@ no arguments.
 \section{Scraping}
 The chosen three website to be scraped were \url{flickr.com}, a user-centric
 image sharing service service aimed at photography amatures and professionals,
 \url{123rf.com}, a stock image website, and \url{shutterstock.com}, another
 stock image website.
 The stock photo website were scraped with standard scraping technology using
 plain \texttt{scrapy}, while \textit{Flickr} was scraped using browser emulation
 technology using \texttt{scrapy-splash} in order to execute Javascript code and
 scrape infinite-scroll paginated data.
 \subsection{\textit{Flickr} and the simulated browser technology
 \textit{Splash}}
 As mentioned before, the implementation of the \textit{Flickr} scraper uses
 \textit{Splash}, a browser emulation that supports Javascript execution and
 simulated user interaction. This component is essential to allow for the website
 to load correctly and to load as many photos as possible in the photo list
 pagest scraped through emulation of the user performing an ``infinite'' scroll
 down.
 Here is the Lua script used by splash to emulate infinite scrolling. These exact
 contents can be found in file
 \texttt{photo\_scraper/spiders/infinite\_scroll.lua}.
 \begin{minted}[linenos,frame=lines,framesep=2mm]{lua}
 function main(splash)
  local num_scrolls = 20
  local scroll_delay = 0.8
  local scroll_to = splash:jsfunc("window.scrollTo")
  local get_body_height = splash:jsfunc(
      "function() {return document.body.scrollHeight;}"
  )
  assert(splash:go(splash.args.url))
  splash:wait(splash.args.wait)
  for _ = 1, num_scrolls do
      scroll_to(0, get_body_height())
      splash:wait(scroll_delay)
  end
  return splash:html()
 end
 \end{minted}
 Line 13 contains the instruction that scrolls down one page height. This
 instruction runs in the loop of lines 12-15, which runs the scroll instruction
 \texttt{num\_scrolls} by also waiting \texttt{scroll\_delay} seconds after every
 execution.
 After this operation is done, the resulting HTML markup is returned and normal
 crawling tecniques can work on this intermediate result.
 \section{Indexing and \textit{Solr} configuration}
 \section{User interface}