diff --git a/report/report.pdf b/report/report.pdf index 4a4e6d9..e48a339 100644 Binary files a/report/report.pdf and b/report/report.pdf differ diff --git a/report/report.tex b/report/report.tex index ee42e0e..40b76fc 100644 --- a/report/report.tex +++ b/report/report.tex @@ -190,9 +190,119 @@ The configuration was derived from the \texttt{techproducts} Solr example by changing the collection schema and removing any non-needed controller. \subsection{Solr schema} +As some minor edits were made using Solr's web interface, the relevant XML +schema to analyse is the file \texttt{solr\_config/conf/managed-schema}. This +files also stores the edits done through the UI. An extract of the relevant +lines is shown below: -\subsection{The cluster controller} +\begin{minted}[linenos,frame=lines,framesep=2mm]{xml} + + + + id + + + + + + + + + + + + + + + + +\end{minted} + +All fields have type \texttt{text-general}. Fields with a name starting by +``\texttt{t\_}'' are included in the \texttt{text} copy field, which is used as +the default field for document similarity when searching an clustering. + +The \texttt{id} field is of type \texttt{string}, but in actuality it is always +a positive integer. This field's values do not come from data scraped from the +website, but it is computed as a auto-incremented progressive identified when +uploading the collection on solr using \texttt{solr\_install.sh}. Shown below is +the \texttt{awk}-based piped command included in the installation script +that performs this task and uploads the collection. + +\begin{minted}[linenos,frame=lines,framesep=2mm]{bash} +# at this point in the script, `pwd` is the repository root directory + +cd scraped + +# POST scraped data +tail -q -n +2 photos.csv 123rf.csv shutterstock.csv | \ + awk "{print NR-1 ',' \$0}" | \ + awk 'BEGIN {print "id,t_author,t_title,t_description,date,img_url,tags"} + {print}' | \ + ../solr/bin/post -c photo -type text/csv -out yes -d +\end{minted} + +Line 6 strips the heading line of the listed CSV files and concatenates them; +Line 7 adds ``\{id\},'' at the beginning of each line, where \{id\} corresponds +to the line number. Line 8 and 9 finally add the correct CSV heading, including +the ``id'' field. Line 10 reads the processed data and posts it to Solr. + +\subsection{Clustering configuration} +Clustering configuration was performed by using the \texttt{solrconfig.xml} file +from the \texttt{techproducts} Solr example and adapting it to the ``photo'' +collection schema. + +Here is the XML configuration relevant to the clustering controller. It can be +found at approximately line 900 of the \texttt{solrconfig.xml} file: + +\begin{minted}[linenos,frame=lines,framesep=2mm]{xml} + + + true + true + + t_title + + img_url + + t_description + + true + + + + false + + + edismax + text + *:* + 100 + *,score + + + clustering + + +\end{minted} + +This clustering controller uses Carrot2 technology to perform ``shallow'' one +level clustering (Line 19 disables sub-clusters). \texttt{t\_title} is used as +the ``title'' field for each document, \texttt{img\_url} as the ``document +location'' field and \texttt{t\_description} the ``description'' field (See +respectively lines 9, 11, and 13 of the configuration). + +This controller replaces the normal \texttt{/select} controller, and thus one +single request will generate search results and clustering data. Defaults for +search are a 100 results limit and the use of \texttt{t\_*} fields to match +documents (lines 25 and 23 -- remember the definition of the \texttt{text} field). \section{User interface}