report has installation instr., scraping, and solr

This commit is contained in:
Claudio Maggioni 2020-12-08 13:37:53 +01:00
parent f66d4a3acc
commit d08027718f
2 changed files with 111 additions and 1 deletions

Binary file not shown.

View file

@ -190,9 +190,119 @@ The configuration was derived from the \texttt{techproducts} Solr example by
changing the collection schema and removing any non-needed controller. changing the collection schema and removing any non-needed controller.
\subsection{Solr schema} \subsection{Solr schema}
As some minor edits were made using Solr's web interface, the relevant XML
schema to analyse is the file \texttt{solr\_config/conf/managed-schema}. This
files also stores the edits done through the UI. An extract of the relevant
lines is shown below:
\subsection{The cluster controller} \begin{minted}[linenos,frame=lines,framesep=2mm]{xml}
<?xml version="1.0" encoding="UTF-8"?>
<!-- Solr managed schema - automatically generated - DO NOT EDIT -->
<schema name="example" version="1.6">
<uniqueKey>id</uniqueKey>
<!-- Omitted field type definitions and default fields added by Solr -->
<field name="id" type="string" multiValued="false" indexed="true"
required="true" stored="true"/>
<field name="date" type="text_general" indexed="true" stored="true"/>
<field name="img_url" type="text_general" indexed="true" stored="true"/>
<field name="t_author" type="text_general" indexed="true" stored="true"/>
<field name="t_description" type="text_general" indexed="true"
stored="true"/>
<field name="t_title" type="text_general" indexed="true" stored="true"/>
<field name="tags" type="text_general" indexed="true" stored="true"/>
<field name="text" type="text_general" uninvertible="true" multiValued="true"
indexed="true" stored="true"/>
<!-- Omitted unused default dynamicField fields added by Solr -->
<copyField source="t_*" dest="text"/>
</schema>
\end{minted}
All fields have type \texttt{text-general}. Fields with a name starting by
``\texttt{t\_}'' are included in the \texttt{text} copy field, which is used as
the default field for document similarity when searching an clustering.
The \texttt{id} field is of type \texttt{string}, but in actuality it is always
a positive integer. This field's values do not come from data scraped from the
website, but it is computed as a auto-incremented progressive identified when
uploading the collection on solr using \texttt{solr\_install.sh}. Shown below is
the \texttt{awk}-based piped command included in the installation script
that performs this task and uploads the collection.
\begin{minted}[linenos,frame=lines,framesep=2mm]{bash}
# at this point in the script, `pwd` is the repository root directory
cd scraped
# POST scraped data
tail -q -n +2 photos.csv 123rf.csv shutterstock.csv | \
awk "{print NR-1 ',' \$0}" | \
awk 'BEGIN {print "id,t_author,t_title,t_description,date,img_url,tags"}
{print}' | \
../solr/bin/post -c photo -type text/csv -out yes -d
\end{minted}
Line 6 strips the heading line of the listed CSV files and concatenates them;
Line 7 adds ``\{id\},'' at the beginning of each line, where \{id\} corresponds
to the line number. Line 8 and 9 finally add the correct CSV heading, including
the ``id'' field. Line 10 reads the processed data and posts it to Solr.
\subsection{Clustering configuration}
Clustering configuration was performed by using the \texttt{solrconfig.xml} file
from the \texttt{techproducts} Solr example and adapting it to the ``photo''
collection schema.
Here is the XML configuration relevant to the clustering controller. It can be
found at approximately line 900 of the \texttt{solrconfig.xml} file:
\begin{minted}[linenos,frame=lines,framesep=2mm]{xml}
<requestHandler name="/clustering"
startup="lazy"
enable="true"
class="solr.SearchHandler">
<lst name="defaults">
<bool name="clustering">true</bool>
<bool name="clustering.results">true</bool>
<!-- Field name with the logical "title" of a each document (optional) -->
<str name="carrot.title">t_title</str>
<!-- Field name with the logical "URL" of a each document (optional) -->
<str name="carrot.url">img_url</str>
<!-- Field name with the logical "content" of a each document (optional) -->
<str name="carrot.snippet">t_description</str>
<!-- Apply highlighter to the title/ content and use this for clustering. -->
<bool name="carrot.produceSummary">true</bool>
<!-- the maximum number of labels per cluster -->
<!--<int name="carrot.numDescriptions">5</int>-->
<!-- produce sub clusters -->
<bool name="carrot.outputSubClusters">false</bool>
<!-- Configure the remaining request handler parameters. -->
<str name="defType">edismax</str>
<str name="df">text</str>
<str name="q.alt">*:*</str>
<str name="rows">100</str>
<str name="fl">*,score</str>
</lst>
<arr name="last-components">
<str>clustering</str>
</arr>
</requestHandler>
\end{minted}
This clustering controller uses Carrot2 technology to perform ``shallow'' one
level clustering (Line 19 disables sub-clusters). \texttt{t\_title} is used as
the ``title'' field for each document, \texttt{img\_url} as the ``document
location'' field and \texttt{t\_description} the ``description'' field (See
respectively lines 9, 11, and 13 of the configuration).
This controller replaces the normal \texttt{/select} controller, and thus one
single request will generate search results and clustering data. Defaults for
search are a 100 results limit and the use of \texttt{t\_*} fields to match
documents (lines 25 and 23 -- remember the definition of the \texttt{text} field).
\section{User interface} \section{User interface}