Wip
This commit is contained in:
parent
6c15791271
commit
ba603bcfd2
2 changed files with 51 additions and 0 deletions
Binary file not shown.
|
@ -76,6 +76,57 @@ no arguments.
|
|||
|
||||
\section{Scraping}
|
||||
|
||||
The chosen three website to be scraped were \url{flickr.com}, a user-centric
|
||||
image sharing service service aimed at photography amatures and professionals,
|
||||
\url{123rf.com}, a stock image website, and \url{shutterstock.com}, another
|
||||
stock image website.
|
||||
|
||||
The stock photo website were scraped with standard scraping technology using
|
||||
plain \texttt{scrapy}, while \textit{Flickr} was scraped using browser emulation
|
||||
technology using \texttt{scrapy-splash} in order to execute Javascript code and
|
||||
scrape infinite-scroll paginated data.
|
||||
|
||||
\subsection{\textit{Flickr} and the simulated browser technology
|
||||
\textit{Splash}}
|
||||
As mentioned before, the implementation of the \textit{Flickr} scraper uses
|
||||
\textit{Splash}, a browser emulation that supports Javascript execution and
|
||||
simulated user interaction. This component is essential to allow for the website
|
||||
to load correctly and to load as many photos as possible in the photo list
|
||||
pagest scraped through emulation of the user performing an ``infinite'' scroll
|
||||
down.
|
||||
|
||||
Here is the Lua script used by splash to emulate infinite scrolling. These exact
|
||||
contents can be found in file
|
||||
\texttt{photo\_scraper/spiders/infinite\_scroll.lua}.
|
||||
|
||||
\begin{minted}[linenos,frame=lines,framesep=2mm]{lua}
|
||||
function main(splash)
|
||||
local num_scrolls = 20
|
||||
local scroll_delay = 0.8
|
||||
|
||||
local scroll_to = splash:jsfunc("window.scrollTo")
|
||||
local get_body_height = splash:jsfunc(
|
||||
"function() {return document.body.scrollHeight;}"
|
||||
)
|
||||
assert(splash:go(splash.args.url))
|
||||
splash:wait(splash.args.wait)
|
||||
|
||||
for _ = 1, num_scrolls do
|
||||
scroll_to(0, get_body_height())
|
||||
splash:wait(scroll_delay)
|
||||
end
|
||||
return splash:html()
|
||||
end
|
||||
\end{minted}
|
||||
|
||||
Line 13 contains the instruction that scrolls down one page height. This
|
||||
instruction runs in the loop of lines 12-15, which runs the scroll instruction
|
||||
\texttt{num\_scrolls} by also waiting \texttt{scroll\_delay} seconds after every
|
||||
execution.
|
||||
|
||||
After this operation is done, the resulting HTML markup is returned and normal
|
||||
crawling tecniques can work on this intermediate result.
|
||||
|
||||
\section{Indexing and \textit{Solr} configuration}
|
||||
|
||||
\section{User interface}
|
||||
|
|
Reference in a new issue