report has installation instr., scraping, and solr

2020-12-08 13:37:53 +01:00 · 2020-12-08 13:37:53 +01:00 · d08027718f
commit d08027718f
parent f66d4a3acc
2 changed files with 111 additions and 1 deletions
--- a/report/report.pdf
+++ b/report/report.pdf
--- a/report/report.tex
+++ b/report/report.tex
@ -190,9 +190,119 @@ The configuration was derived from the \texttt{techproducts} Solr example by
 changing the collection schema and removing any non-needed controller.
 \subsection{Solr schema}
 As some minor edits were made using Solr's web interface, the relevant XML
 schema to analyse is the file \texttt{solr\_config/conf/managed-schema}. This
 files also stores the edits done through the UI. An extract of the relevant
 lines is shown below:
-\subsection{The cluster controller}
+\begin{minted}[linenos,frame=lines,framesep=2mm]{xml}
 <?xml version="1.0" encoding="UTF-8"?>
 <!-- Solr managed schema - automatically generated - DO NOT EDIT -->
 <schema name="example" version="1.6">
  <uniqueKey>id</uniqueKey>
  <!-- Omitted field type definitions and default fields added by Solr -->
  <field name="id" type="string" multiValued="false" indexed="true"
         required="true" stored="true"/>
  <field name="date" type="text_general" indexed="true" stored="true"/>
  <field name="img_url" type="text_general" indexed="true" stored="true"/>
  <field name="t_author" type="text_general" indexed="true" stored="true"/>
  <field name="t_description" type="text_general" indexed="true"
         stored="true"/>
  <field name="t_title" type="text_general" indexed="true" stored="true"/>
  <field name="tags" type="text_general" indexed="true" stored="true"/>
  <field name="text" type="text_general" uninvertible="true" multiValued="true"
         indexed="true" stored="true"/>
  <!-- Omitted unused default dynamicField fields added by Solr -->
  <copyField source="t_*" dest="text"/>
 </schema>
 \end{minted}
 All fields have type \texttt{text-general}. Fields with a name starting by
 ``\texttt{t\_}'' are included in the \texttt{text} copy field, which is used as
 the default field for document similarity when searching an clustering.
 The \texttt{id} field is of type \texttt{string}, but in actuality it is always
 a positive integer. This field's values do not come from data scraped from the
 website, but it is computed as a auto-incremented progressive identified when
 uploading the collection on solr using \texttt{solr\_install.sh}. Shown below is
 the \texttt{awk}-based piped command included in the installation script
 that performs this task and uploads the collection.
 \begin{minted}[linenos,frame=lines,framesep=2mm]{bash}
 # at this point in the script, `pwd` is the repository root directory
 cd scraped
 # POST scraped data
 tail -q -n +2 photos.csv 123rf.csv shutterstock.csv | \
 	awk "{print NR-1 ',' \$0}" | \
 	awk 'BEGIN {print "id,t_author,t_title,t_description,date,img_url,tags"}
             {print}' | \
 	../solr/bin/post -c photo -type text/csv -out yes -d
 \end{minted}
 Line 6 strips the heading line of the listed CSV files and concatenates them;
 Line 7 adds ``\{id\},'' at the beginning of each line, where \{id\} corresponds
 to the line number. Line 8 and 9 finally add the correct CSV heading, including
 the ``id'' field. Line 10 reads the processed data and posts it to Solr.
 \subsection{Clustering configuration}
 Clustering configuration was performed by using the \texttt{solrconfig.xml} file
 from the \texttt{techproducts} Solr example and adapting it to the ``photo''
 collection schema.
 Here is the XML configuration relevant to the clustering controller. It can be
 found at approximately line 900 of the \texttt{solrconfig.xml} file:
 \begin{minted}[linenos,frame=lines,framesep=2mm]{xml}
 <requestHandler name="/clustering"
                startup="lazy"
                enable="true"
                class="solr.SearchHandler">
  <lst name="defaults">
    <bool name="clustering">true</bool>
    <bool name="clustering.results">true</bool>
    <!-- Field name with the logical "title" of a each document (optional) -->
    <str name="carrot.title">t_title</str>
    <!-- Field name with the logical "URL" of a each document (optional) -->
    <str name="carrot.url">img_url</str>
    <!-- Field name with the logical "content" of a each document (optional) -->
    <str name="carrot.snippet">t_description</str>
    <!-- Apply highlighter to the title/ content and use this for clustering. -->
    <bool name="carrot.produceSummary">true</bool>
    <!-- the maximum number of labels per cluster -->
    <!--<int name="carrot.numDescriptions">5</int>-->
    <!-- produce sub clusters -->
    <bool name="carrot.outputSubClusters">false</bool>
    <!-- Configure the remaining request handler parameters. -->
    <str name="defType">edismax</str>
    <str name="df">text</str>
    <str name="q.alt">*:*</str>
    <str name="rows">100</str>
    <str name="fl">*,score</str>
  </lst>
  <arr name="last-components">
    <str>clustering</str>
  </arr>
 </requestHandler>
 \end{minted}
 This clustering controller uses Carrot2 technology to perform ``shallow'' one
 level clustering (Line 19 disables sub-clusters). \texttt{t\_title} is used as
 the ``title'' field for each document, \texttt{img\_url} as the ``document
 location'' field and \texttt{t\_description} the ``description'' field (See
 respectively lines 9, 11, and 13 of the configuration).
 This controller replaces the normal \texttt{/select} controller, and thus one
 single request will generate search results and clustering data. Defaults for
 search are a 100 results limit and the use of \texttt{t\_*} fields to match
 documents (lines 25 and 23 -- remember the definition of the \texttt{text} field).
 \section{User interface}