report has installation instr., scraping, and solr
This commit is contained in:
parent
f66d4a3acc
commit
d08027718f
2 changed files with 111 additions and 1 deletions
Binary file not shown.
|
@ -190,9 +190,119 @@ The configuration was derived from the \texttt{techproducts} Solr example by
|
||||||
changing the collection schema and removing any non-needed controller.
|
changing the collection schema and removing any non-needed controller.
|
||||||
|
|
||||||
\subsection{Solr schema}
|
\subsection{Solr schema}
|
||||||
|
As some minor edits were made using Solr's web interface, the relevant XML
|
||||||
|
schema to analyse is the file \texttt{solr\_config/conf/managed-schema}. This
|
||||||
|
files also stores the edits done through the UI. An extract of the relevant
|
||||||
|
lines is shown below:
|
||||||
|
|
||||||
\subsection{The cluster controller}
|
\begin{minted}[linenos,frame=lines,framesep=2mm]{xml}
|
||||||
|
<?xml version="1.0" encoding="UTF-8"?>
|
||||||
|
<!-- Solr managed schema - automatically generated - DO NOT EDIT -->
|
||||||
|
<schema name="example" version="1.6">
|
||||||
|
<uniqueKey>id</uniqueKey>
|
||||||
|
|
||||||
|
<!-- Omitted field type definitions and default fields added by Solr -->
|
||||||
|
|
||||||
|
<field name="id" type="string" multiValued="false" indexed="true"
|
||||||
|
required="true" stored="true"/>
|
||||||
|
<field name="date" type="text_general" indexed="true" stored="true"/>
|
||||||
|
<field name="img_url" type="text_general" indexed="true" stored="true"/>
|
||||||
|
<field name="t_author" type="text_general" indexed="true" stored="true"/>
|
||||||
|
<field name="t_description" type="text_general" indexed="true"
|
||||||
|
stored="true"/>
|
||||||
|
<field name="t_title" type="text_general" indexed="true" stored="true"/>
|
||||||
|
<field name="tags" type="text_general" indexed="true" stored="true"/>
|
||||||
|
|
||||||
|
<field name="text" type="text_general" uninvertible="true" multiValued="true"
|
||||||
|
indexed="true" stored="true"/>
|
||||||
|
|
||||||
|
<!-- Omitted unused default dynamicField fields added by Solr -->
|
||||||
|
|
||||||
|
<copyField source="t_*" dest="text"/>
|
||||||
|
</schema>
|
||||||
|
\end{minted}
|
||||||
|
|
||||||
|
All fields have type \texttt{text-general}. Fields with a name starting by
|
||||||
|
``\texttt{t\_}'' are included in the \texttt{text} copy field, which is used as
|
||||||
|
the default field for document similarity when searching an clustering.
|
||||||
|
|
||||||
|
The \texttt{id} field is of type \texttt{string}, but in actuality it is always
|
||||||
|
a positive integer. This field's values do not come from data scraped from the
|
||||||
|
website, but it is computed as a auto-incremented progressive identified when
|
||||||
|
uploading the collection on solr using \texttt{solr\_install.sh}. Shown below is
|
||||||
|
the \texttt{awk}-based piped command included in the installation script
|
||||||
|
that performs this task and uploads the collection.
|
||||||
|
|
||||||
|
\begin{minted}[linenos,frame=lines,framesep=2mm]{bash}
|
||||||
|
# at this point in the script, `pwd` is the repository root directory
|
||||||
|
|
||||||
|
cd scraped
|
||||||
|
|
||||||
|
# POST scraped data
|
||||||
|
tail -q -n +2 photos.csv 123rf.csv shutterstock.csv | \
|
||||||
|
awk "{print NR-1 ',' \$0}" | \
|
||||||
|
awk 'BEGIN {print "id,t_author,t_title,t_description,date,img_url,tags"}
|
||||||
|
{print}' | \
|
||||||
|
../solr/bin/post -c photo -type text/csv -out yes -d
|
||||||
|
\end{minted}
|
||||||
|
|
||||||
|
Line 6 strips the heading line of the listed CSV files and concatenates them;
|
||||||
|
Line 7 adds ``\{id\},'' at the beginning of each line, where \{id\} corresponds
|
||||||
|
to the line number. Line 8 and 9 finally add the correct CSV heading, including
|
||||||
|
the ``id'' field. Line 10 reads the processed data and posts it to Solr.
|
||||||
|
|
||||||
|
\subsection{Clustering configuration}
|
||||||
|
Clustering configuration was performed by using the \texttt{solrconfig.xml} file
|
||||||
|
from the \texttt{techproducts} Solr example and adapting it to the ``photo''
|
||||||
|
collection schema.
|
||||||
|
|
||||||
|
Here is the XML configuration relevant to the clustering controller. It can be
|
||||||
|
found at approximately line 900 of the \texttt{solrconfig.xml} file:
|
||||||
|
|
||||||
|
\begin{minted}[linenos,frame=lines,framesep=2mm]{xml}
|
||||||
|
<requestHandler name="/clustering"
|
||||||
|
startup="lazy"
|
||||||
|
enable="true"
|
||||||
|
class="solr.SearchHandler">
|
||||||
|
<lst name="defaults">
|
||||||
|
<bool name="clustering">true</bool>
|
||||||
|
<bool name="clustering.results">true</bool>
|
||||||
|
<!-- Field name with the logical "title" of a each document (optional) -->
|
||||||
|
<str name="carrot.title">t_title</str>
|
||||||
|
<!-- Field name with the logical "URL" of a each document (optional) -->
|
||||||
|
<str name="carrot.url">img_url</str>
|
||||||
|
<!-- Field name with the logical "content" of a each document (optional) -->
|
||||||
|
<str name="carrot.snippet">t_description</str>
|
||||||
|
<!-- Apply highlighter to the title/ content and use this for clustering. -->
|
||||||
|
<bool name="carrot.produceSummary">true</bool>
|
||||||
|
<!-- the maximum number of labels per cluster -->
|
||||||
|
<!--<int name="carrot.numDescriptions">5</int>-->
|
||||||
|
<!-- produce sub clusters -->
|
||||||
|
<bool name="carrot.outputSubClusters">false</bool>
|
||||||
|
|
||||||
|
<!-- Configure the remaining request handler parameters. -->
|
||||||
|
<str name="defType">edismax</str>
|
||||||
|
<str name="df">text</str>
|
||||||
|
<str name="q.alt">*:*</str>
|
||||||
|
<str name="rows">100</str>
|
||||||
|
<str name="fl">*,score</str>
|
||||||
|
</lst>
|
||||||
|
<arr name="last-components">
|
||||||
|
<str>clustering</str>
|
||||||
|
</arr>
|
||||||
|
</requestHandler>
|
||||||
|
\end{minted}
|
||||||
|
|
||||||
|
This clustering controller uses Carrot2 technology to perform ``shallow'' one
|
||||||
|
level clustering (Line 19 disables sub-clusters). \texttt{t\_title} is used as
|
||||||
|
the ``title'' field for each document, \texttt{img\_url} as the ``document
|
||||||
|
location'' field and \texttt{t\_description} the ``description'' field (See
|
||||||
|
respectively lines 9, 11, and 13 of the configuration).
|
||||||
|
|
||||||
|
This controller replaces the normal \texttt{/select} controller, and thus one
|
||||||
|
single request will generate search results and clustering data. Defaults for
|
||||||
|
search are a 100 results limit and the use of \texttt{t\_*} fields to match
|
||||||
|
documents (lines 25 and 23 -- remember the definition of the \texttt{text} field).
|
||||||
|
|
||||||
\section{User interface}
|
\section{User interface}
|
||||||
|
|
||||||
|
|
Reference in a new issue