report has installation instr., scraping, and solr

2020-12-08 13:37:53 +01:00 · 2020-12-08 13:37:53 +01:00 · d08027718f
commit d08027718f
parent f66d4a3acc
2 changed files with 111 additions and 1 deletions
--- a/report/report.pdf
+++ b/report/report.pdf
--- a/report/report.tex
+++ b/report/report.tex
@ -190,9 +190,119 @@ The configuration was derived from the \texttt{techproducts} Solr example by
 changing the collection schema and removing any non-needed controller.

 \subsection{Solr schema}
+As some minor edits were made using Solr's web interface, the relevant XML
+schema to analyse is the file \texttt{solr\_config/conf/managed-schema}. This
+files also stores the edits done through the UI. An extract of the relevant
+lines is shown below:

-\subsection{The cluster controller}
+\begin{minted}[linenos,frame=lines,framesep=2mm]{xml}
+<?xml version="1.0" encoding="UTF-8"?>
+<!-- Solr managed schema - automatically generated - DO NOT EDIT -->
+<schema name="example" version="1.6">
+  <uniqueKey>id</uniqueKey>

+  <!-- Omitted field type definitions and default fields added by Solr -->
+
+  <field name="id" type="string" multiValued="false" indexed="true"
+         required="true" stored="true"/>
+  <field name="date" type="text_general" indexed="true" stored="true"/>
+  <field name="img_url" type="text_general" indexed="true" stored="true"/>
+  <field name="t_author" type="text_general" indexed="true" stored="true"/>
+  <field name="t_description" type="text_general" indexed="true"
+         stored="true"/>
+  <field name="t_title" type="text_general" indexed="true" stored="true"/>
+  <field name="tags" type="text_general" indexed="true" stored="true"/>
+
+  <field name="text" type="text_general" uninvertible="true" multiValued="true"
+         indexed="true" stored="true"/>
+
+  <!-- Omitted unused default dynamicField fields added by Solr -->
+
+  <copyField source="t_*" dest="text"/>
+</schema>
+\end{minted}
+
+All fields have type \texttt{text-general}. Fields with a name starting by
+``\texttt{t\_}'' are included in the \texttt{text} copy field, which is used as
+the default field for document similarity when searching an clustering.
+
+The \texttt{id} field is of type \texttt{string}, but in actuality it is always
+a positive integer. This field's values do not come from data scraped from the
+website, but it is computed as a auto-incremented progressive identified when
+uploading the collection on solr using \texttt{solr\_install.sh}. Shown below is
+the \texttt{awk}-based piped command included in the installation script
+that performs this task and uploads the collection.
+
+\begin{minted}[linenos,frame=lines,framesep=2mm]{bash}
+# at this point in the script, `pwd` is the repository root directory
+
+cd scraped
+
+# POST scraped data
+tail -q -n +2 photos.csv 123rf.csv shutterstock.csv | \
+	awk "{print NR-1 ',' \$0}" | \
+	awk 'BEGIN {print "id,t_author,t_title,t_description,date,img_url,tags"}
+             {print}' | \
+	../solr/bin/post -c photo -type text/csv -out yes -d
+\end{minted}
+
+Line 6 strips the heading line of the listed CSV files and concatenates them;
+Line 7 adds ``\{id\},'' at the beginning of each line, where \{id\} corresponds
+to the line number. Line 8 and 9 finally add the correct CSV heading, including
+the ``id'' field. Line 10 reads the processed data and posts it to Solr.
+
+\subsection{Clustering configuration}
+Clustering configuration was performed by using the \texttt{solrconfig.xml} file
+from the \texttt{techproducts} Solr example and adapting it to the ``photo''
+collection schema.
+
+Here is the XML configuration relevant to the clustering controller. It can be
+found at approximately line 900 of the \texttt{solrconfig.xml} file:
+
+\begin{minted}[linenos,frame=lines,framesep=2mm]{xml}
+<requestHandler name="/clustering"
+                startup="lazy"
+                enable="true"
+                class="solr.SearchHandler">
+  <lst name="defaults">
+    <bool name="clustering">true</bool>
+    <bool name="clustering.results">true</bool>
+    <!-- Field name with the logical "title" of a each document (optional) -->
+    <str name="carrot.title">t_title</str>
+    <!-- Field name with the logical "URL" of a each document (optional) -->
+    <str name="carrot.url">img_url</str>
+    <!-- Field name with the logical "content" of a each document (optional) -->
+    <str name="carrot.snippet">t_description</str>
+    <!-- Apply highlighter to the title/ content and use this for clustering. -->
+    <bool name="carrot.produceSummary">true</bool>
+    <!-- the maximum number of labels per cluster -->
+    <!--<int name="carrot.numDescriptions">5</int>-->
+    <!-- produce sub clusters -->
+    <bool name="carrot.outputSubClusters">false</bool>
+
+    <!-- Configure the remaining request handler parameters. -->
+    <str name="defType">edismax</str>
+    <str name="df">text</str>
+    <str name="q.alt">*:*</str>
+    <str name="rows">100</str>
+    <str name="fl">*,score</str>
+  </lst>
+  <arr name="last-components">
+    <str>clustering</str>
+  </arr>
+</requestHandler>
+\end{minted}
+
+This clustering controller uses Carrot2 technology to perform ``shallow'' one
+level clustering (Line 19 disables sub-clusters). \texttt{t\_title} is used as
+the ``title'' field for each document, \texttt{img\_url} as the ``document
+location'' field and \texttt{t\_description} the ``description'' field (See
+respectively lines 9, 11, and 13 of the configuration).
+
+This controller replaces the normal \texttt{/select} controller, and thus one
+single request will generate search results and clustering data. Defaults for
+search are a 100 results limit and the use of \texttt{t\_*} fields to match
+documents (lines 25 and 23 -- remember the definition of the \texttt{text} field).

 \section{User interface}