ddm/hw01/report.tex

\documentclass[11pt,a4paper]{scrartcl}
\usepackage[utf8]{inputenc}
\usepackage[margin=2.25cm]{geometry}
\usepackage{hyperref}
\usepackage{listings}
\usepackage{xcolor}
\usepackage{lmodern}
\usepackage{booktabs}
\usepackage{graphicx}
\usepackage{float}
\usepackage{tikz}
\usepackage{listings}
\usepackage{pgfplots}
\usepackage{subcaption}
\setlength{\parindent}{0cm}
\setlength{\parskip}{0.3em}
\hypersetup{pdfborder={0 0 0}}
\usepackage[nomessages]{fp}

\renewcommand*{\titlepagestyle}{empty}

\lstset{
  basicstyle=\small\ttfamily,
  %frame=shadowbox,
  rulesepcolor=\color{black},
  columns=fullflexible,
  commentstyle=\color{gray},
  keywordstyle=\color{blue},
  mathescape=true,
  aboveskip=1em,
  captionpos=b,
  numbers=left,
  abovecaptionskip=1em,
  frame=shadowbox,
  belowcaptionskip=1em,
  escapeinside={\%*}{*)},
}

\lstdefinelanguage{cypher}
{
	morekeywords={
		MATCH, OPTIONAL, WHERE, NOT, AND, OR, XOR, RETURN, DISTINCT, ORDER, BY, ASC, ASCENDING, DESC, DESCENDING, UNWIND, AS, UNION, WITH, ALL, CREATE, DELETE, DETACH, REMOVE, SET, MERGE, SET, SKIP, LIMIT, IN, CASE, WHEN, THEN, ELSE, END,
		INDEX, DROP, UNIQUE, CONSTRAINT, EXPLAIN, PROFILE, START, STARTS,
	}
}


\newcommand{\mycdots}{\cdot\!\cdot\!\cdot}
\lstset{language=cypher,
	literate=*
	{...}{$\mycdots$}{1}
	{theta}{$\theta$}{1}
}


\title{Project Work Part 1 -- Data Design and Modelling}

\author{David Alarcórn \and Sepehr Beheshti \and Claudio Maggioni \and Lorenzo Martingnetti}

\begin{document}
\maketitle
\tableofcontents

\section{Introduction}

This project is about storing the data from the \textit{DBLP} publication repository\footnote{\url{https://dblp.org}} in a graph database implemented using Neo4J and the \textit{Cypher} query language. To achieve this result, we have to understand the business domain of a publication repository, translate that knowledge in an Entity-Relationship model, and finally implement a Neo4J database. This consists of downloading the data in \textit{DBLP}, design a way to import this data, define a Graph Model based on the ER Model we define, and finally to define some useful \textit{Cypher} commands to both query and update the data.

\section{The DBLP Domain}

As mentioned, DBLP is a publication repository, i.e.\ an index of metadata regarding various
scientific publication mediums, such as journals and conference articles. DBLP stores titles,
authors, references and where to find each publication, but not the publication document itself.
Given the broad diversity in the kinds of documents indexed by DBLP, we decided to focus our
attention on articles, books, thesis at both the Masters' and PhD level, and on author websites.
Each publication record has several attributes representing the metadata of each publication, such as
a ``metadata version'', an record version number internal to DBLP, a list of authors or the journal
in which it was published. Different publications have different information attached to it, though
all the publications we considered had a year of publication and a title attribute.

\section{ER Model}
\label{sec:er}

\begin{figure}[H]
\includegraphics[width=\textwidth]{ER.pdf}
\caption{The ER diagram for the DBLP database. Cardinality annotation to be interpreted as look-across, e.g.\ a \textit{Publication} is related to at most one \textit{Editor}.}
\label{fig:er}
\end{figure}

The ER model highlights the central concept of the DBLP knowledge: the publication. Each publication registered into the DBLP database is described by its title, year of publication, and metadata version (the version of the metadata stored into the DBLP database). Additionally, each publication can be one of the following:
\begin{itemize}
    \item \textbf{Article}: it may contain \textit{references} to other publications, it is published on a specific range of \textit{pages in a journal}, which in turn may be part of a \textit{volume}.
    \item \textbf{Book}: it is also identified by the \textit{ISBN} code and it has a specific \textit{number of pages}.
    \item \textbf{Master thesis}: it may contain \textit{references} to other publications.
    \item \textbf{PhD thesis}: it may contain \textit{references} to other publications, it may be published on a specific range of \textit{pages in a journal}, which in turn may be part of a \textit{volume}.
    \item \textbf{Website}: it may contain \textit{references} to other publications, it may be published on a specific range of \textit{pages in a journal} (an online journal), and it has a unique \textit{url}.
\end{itemize}
The other entities, namely \textbf{Publisher}, \textbf{Editor}, \textbf{Author}, \textbf{Journal}, \textbf{School} are secondary to the DBLP problem specification, which is storing metadata related to publications. They are simply identified by a unique \textit{id} and have a \textit{name} associated. They interact with the publication entity through the relationships defined in the ER above, with the cardinalities specified. Note that no secondary entity has a one to one relationship with \textit{Publication}: this justifies them to be separate entities and not simply attributes of \textit{Publication}.

\section{Neo4J Database Creation}

\subsection{The DBLP XML Data Format}

DBLP periodically publishes copies of its entire database as \textit{gzip}-compressed XML files. We
decide to download one of such dumps as a data source for our database.

We then decide to use an
existing parser for DBLP data, namely the \textit{ThomHurks/dblp-to-csv} project on GitHub\footnote{\url{https://github.com/ThomHurks/dblp-to-csv}}. This parser is designed to convert the DBLP xml file into several pairs of CSV files, each of them containing column headers and data for one DBLP article type. A
unique integer ID is assigned to each record, which is a feature we exploit when building our Neo4J
model. Finally, through the \texttt{--relations} command line option, the parser is able to extract some of the publication attributes as separate ``entities''. Each chosen attribute has its data stored in a separate CSV file, and an additional CSV file is generated containing an ID to ID mapping between the source publication and the extrapolated record.

Listing \ref{lst:importsh} shows how we downloaded the XML database dump and how we parsed it using
the \textit{dblp-to-csv}.

\begin{minipage}{\linewidth}
\begin{lstlisting}[caption=Commands used to generated the CSV files necessary for import in Neo4J.,language=bash,label=lst:importsh]
# download and decompress the DBLP XML dump
curl -o dblp.xml.gz https://dblp.org/xml/dblp.xml.gz
gzip -d dblp.xml.gz

# download the DTD specification of the DBLP XML format
curl -o dblp.dtd https://dblp.org/xml/dblp.dtd

# invoke the dblp-to-csv utility with the relations specified in the ER diagram
python3 ./XMLToCSV.py --annotate --neo4j dblp.xml dblp.dtd \
    output.csv --relations author:authored_by journal:published_in \
    publisher:published_by school:submitted_at editor:edited_by

# for each entity specified in the ER diagram
for t in article book mastersthesis phdthesis www; do
    # merge the column headers and data CSV files, and perform some substitutions
    # required by the Neo4J CSV parser
    tr ';' '\n' <output_%*\$*){t}_header.csv | sed 's/:.*//g' | \
        tr '\n' ';' | awk 1 | cat - output_%*\$*){t}.csv | \
        sed -E 's/\{?\\""\}?/""/g' > %*\$*){t}.csv;
done
\end{lstlisting}
\end{minipage}

\subsection{Graph Model}
\label{sec:gra}

Given the generated CSV files and the ER model we define in Section \ref{sec:er}, we are able to define a suitable graph diagram. Given the graph database techology we use is Neo4J, the notation we use in our graph diagram reflects the Neo4J notation rules for node and edge labels.

We decide to implement the total disjoint ISA relationship regarding publications by assigning to
each publication node both a \textit{:Publication} label and a label representing the specific
publication type. This strategy is analogous to the use of a discriminant field in relational databases to define the schema of leaf entities in an ISA relationship, however using labels instead of attributes.

We then define a series of nodes that may be connected to a \textit{:Publication}. These node types map one-to-one with the additional CSV files we extract with the \textit{dblp-to-csv} tool with the \texttt{--relations} flag. We then define the edges that connect publications with these auxiliary nodes. Additionally, we define some de-facto cardinalities for each edge kind, that are however informal and not enforceable with a Neo4J constraint definition.

Figure \ref{fig:graph} shows the resulting diagram.

\begin{figure}[H]
\includegraphics[width=\textwidth]{graph_diagram.pdf}
\caption{The graph diagram for the DBLP database. Neo4J node labels are highlighted in blue. Below each node label, attributes and their data types are listed. A \textit{:Publication} label is always paired with one and only one of the \textit{:Article}, \textit{:Book}, \textit{:MastersThesis}, \textit{:PhDThesis}, \textit{:WWW} labels. Below each edge label, the number of allowed edges with that label is specified (e.g.\ a \textit{:Publication} is the source of at least one \textit{:AUTHORED\_BY} edge).}
\label{fig:graph}
\end{figure}

\subsection{Import Queries}

Given the defined Graph model and the generated CSV files, we can write the necessary Cypher queries
to import the data in Neo4J.

Cypher is the language that the Neo4J DBMS supports for creating, manipulating and querying data. Cypher is to Neo4J what ANSI SQL is to a relational DBMS like PostgreSQL or MariaDB.

An important constraint we consider when implementing this project is collaboration. Since we have
to work in a team while populating the Neo4J database and designing queries we require a way to
share a single database instance. To achieve this, we use Neo4J \textit{AuraDB} SaaS database
service\footnote{\url{https://neo4j.com/cloud/platform/aura-graph-database/}} to create a cloud hosted Neo4J instance and then share its credentials. However, since users cannot access the
filesystem of an \textit{AuraDB} instance, we upload the generated CSV files to the
\href{https://maggioni.xyz}{\textit{maggioni.xyz}} website (owned by Claudio Maggioni) and them instruct
Neo4J to query the website and fetch the data. \marginpar[right text]{\color{white}\url{https://youtu.be/sNks5yZ6dfI}}

Listing \ref{lst:cypher-import} shows the resulting Cypher import script. When executing the script
through the Neo4J web interface, all statements starting with \texttt{LOAD CSV} must be prefixed with
\texttt{:auto} to enable transaction support.

Lines 1-75 of the script contain the commands necessary to create all the publication nodes.
As specified in Section \ref{sec:gra}, all publications are labeled with the \textit{:Publication}
label and another specific label matching the publication type. All \texttt{CREATE} statements are
executed in transactions with 1000 record chunks to improve performance.

Lines 77-130 are instead dedicated to importing the ``auxiliary'' nodes, such as \textit{:Author}
and \textit{:Journal} nodes. This lines use the \texttt{line['column']} notation for accessing CSV
cells instead of the \texttt{line.column} notation used in the previous lines due to special characters added by \textit{dblp-to-csv} in the header names.

Lines 139-187 create the edges between the publication and ``auxiliary'' entities by using the
CSV association tables generated by \textit{dblp-to-csv}. All statements are composed by two
\texttt{MATCH} clauses, which match respectively the publication node and the ``auxiliary'' node, and a \texttt{CREATE} clause, which creates the actual edge and labels it accordingly.

Lines 132-137 define indexes for each publication type based on the \texttt{id} attribute, which
is guaranteed to have unique values for each record by \textit{dblp-to-csv}. These indexes improve
significantly the performance of the edge creation statements, as the \texttt{MATCH} clauses are
defined just over the \texttt{id} attribute and thus Neo4J would set forth an execution plan using
the index. This improves the lookup algorithmic complexity from linear to logarithmic, as indexes
are by default implemented using BTrees.

This section shows 2 kinds of data creation commands (namely the node creations at lines 1-130 and
edge creations at lines 139-187), leaving us to implement 3 data manipulation statements.

\begin{lstlisting}[language=cypher, label=lst:cypher-import, caption=The Cypher instructions necessary to define and import the extracted DBLP database dump.]
LOAD CSV WITH HEADERS FROM
'https://maggioni.xyz/article.csv'
AS line FIELDTERMINATOR ';'
CALL {
    WITH line
    CREATE (:Publication:Article {
        id: line.article,
        references: line.ee,
        metadataVersion: line.mdate,
        pagesInJournal: line.pages,
        title: line.title,
        volume: line.volume,
        year: line.year
    })
} IN TRANSACTIONS OF 1000 ROWS;

LOAD CSV WITH HEADERS FROM
'https://maggioni.xyz/book.csv'
AS line FIELDTERMINATOR ';'
CALL {
    WITH line
    CREATE (:Publication:Book {
        id: line.book,
        title: line.title,
        isbn: line.isbn,
        metadataVersion: line.date,
        pages: line.pages,
        year: line.year
    })
} IN TRANSACTIONS OF 1000 ROWS;

LOAD CSV WITH HEADERS FROM
'https://maggioni.xyz/mastersthesis.csv'
AS line FIELDTERMINATOR ';'
CALL {
    WITH line
    CREATE (:Publication:MastersThesis {
        id: line.mastersthesis,
        title: line.title,
        references: line.ee,
        metadataVersion: line.mdate,
        year: line.year
    })
} IN TRANSACTIONS OF 1000 ROWS;

LOAD CSV WITH HEADERS FROM
'https://maggioni.xyz/phdthesis.csv'
AS line FIELDTERMINATOR ';'
CALL {
    WITH line
    CREATE (:Publication:PhDThesis {
        id: line.phdthesis,
        references: line.ee,
        metadataVersion: line.mdate,
        pagesInJournal: line.pages,
        title: line.title,
        volume: line.volume,
        year: line.year
    })
} IN TRANSACTIONS OF 1000 ROWS;

LOAD CSV WITH HEADERS FROM
'https://maggioni.xyz/www.csv'
AS line FIELDTERMINATOR ';'
CALL {
    WITH line
    CREATE (:Publication:WWW {
        id: line.www,
        references: line.ee,
        metadataVersion: line.mdate,
        title: line.title,
        year: line.year,
        url: line.url
    })
} IN TRANSACTIONS OF 1000 ROWS;

LOAD CSV WITH HEADERS FROM
'https://maggioni.xyz/output_author.csv'
AS line FIELDTERMINATOR ';'
CALL {
    WITH line
    CREATE (:Author {
        id: line[':ID'],
        name: line['author:string']
    })
} IN TRANSACTIONS OF 1000 ROWS;

LOAD CSV WITH HEADERS FROM
'https://maggioni.xyz/output_editor.csv'
AS line FIELDTERMINATOR ';'
CALL {
    WITH line
    CREATE (:Editor {
        id: line[':ID'],
        name: line['editor:string']
    })
} IN TRANSACTIONS OF 1000 ROWS;

LOAD CSV WITH HEADERS FROM
'https://maggioni.xyz/output_journal.csv'
AS line FIELDTERMINATOR ';'
CALL {
    WITH line
    CREATE (:Journal {
        id: line[':ID'],
        name: line['journal:string']
    })
} IN TRANSACTIONS OF 1000 ROWS;

LOAD CSV WITH HEADERS FROM
'https://maggioni.xyz/output_publisher.csv'
AS line FIELDTERMINATOR ';'
CALL {
    WITH line
    CREATE (:Publisher {
        id: line[':ID'],
        name: line['publisher:string']
    })
} IN TRANSACTIONS OF 1000 ROWS;

LOAD CSV WITH HEADERS FROM
'https://maggioni.xyz/output_school.csv'
AS line FIELDTERMINATOR ';'
CALL {
    WITH line
    CREATE (:School {
        id: line[':ID'],
        name: line['school:string']
    })
} IN TRANSACTIONS OF 1000 ROWS;

CREATE INDEX FOR (p:Publication) ON (p.id);
CREATE INDEX FOR (p:Author) ON (p.id);
CREATE INDEX FOR (p:Editor) ON (p.id);
CREATE INDEX FOR (p:Journal) ON (p.id);
CREATE INDEX FOR (p:Publisher) ON (p.id);
CREATE INDEX FOR (p:School) ON (p.id);

LOAD CSV WITH HEADERS FROM
'https://maggioni.xyz/output_author_authored_by.csv'
AS line FIELDTERMINATOR ';'
CALL {
    WITH line
    MATCH (p:Publication { id: line[':START_ID'] })
    MATCH (a:Author { id: line[':END_ID'] })
    CREATE (p)-[:AUTHORED_BY]->(a)
} IN TRANSACTIONS OF 1000 ROWS;

LOAD CSV WITH HEADERS FROM
'https://maggioni.xyz/output_editor_edited_by.csv'
AS line FIELDTERMINATOR ';'
CALL {
    WITH line
    MATCH (p:Publication { id: line[':START_ID'] })
    MATCH (a:Editor { id: line[':END_ID'] })
    CREATE (p)-[:EDITED_BY]->(a)
} IN TRANSACTIONS OF 1000 ROWS;

LOAD CSV WITH HEADERS FROM
'https://maggioni.xyz/output_journal_published_in.csv'
AS line FIELDTERMINATOR ';'
CALL {
    WITH line
    MATCH (p:Publication { id: line[':START_ID'] })
    MATCH (a:Journal { id: line[':END_ID'] })
    CREATE (p)-[:PUBLISHED_IN]->(a)
} IN TRANSACTIONS OF 1000 ROWS;

LOAD CSV WITH HEADERS FROM
'https://maggioni.xyz/output_publisher_published_by.csv'
AS line FIELDTERMINATOR ';'
CALL {
    WITH line
    MATCH (p:Publication { id: line[':START_ID'] })
    MATCH (a:Publisher { id: line[':END_ID'] })
    CREATE (p)-[:PUBLISHED_BY]->(a)
} IN TRANSACTIONS OF 1000 ROWS;

LOAD CSV WITH HEADERS FROM
'https://maggioni.xyz/output_school_submitted_at.csv'
AS line FIELDTERMINATOR ';'
CALL {
    WITH line
    MATCH (p:Publication { id: line[':START_ID'] })
    MATCH (a:School { id: line[':END_ID'] })
    CREATE (p)-[:SUBMITTED_AT]->(a)
} IN TRANSACTIONS OF 1000 ROWS;
\end{lstlisting}
\pagebreak
\section {Queries}
In this section we show the queries we have implemented. For performance reference, all queries are run
on a free tier Neo4J AuraDB instance over a subset of the DBLP data composed of 12.150 nodes and 6.216 edges.

\subsection{Publications that May Be Hosted on Secure Websites}
\textit{Query complexity: 3 nodes, conditions, aggregations}

This query returns the total number of publications of type \textit{Article} whose author
has a HTTPS website.

\begin{minipage}{\linewidth}
\begin{lstlisting}[language=cypher, label=lst:cypher-q01, caption=The Cypher code for the Morgan \& Claypool's Journal Relationships query.]
MATCH (p:Publication:Article)-[:AUTHORED_BY]->(a:Author)<-[:AUTHORED_BY]-(www:WWW)
WHERE www.url STARTS WITH 'https://'
RETURN COUNT(DISTINCT p) as publisherCount
\end{lstlisting}
\end{minipage}

By invoking the query with the \texttt{PROFILE} clause, Neo4J reports the following execution plan:
\begin{itemize}
    \item All possible matches for \texttt{(www:WWW) WHERE www.url STARTS WITH 'https://'} are found;
    \item The \texttt{:AUTHORED\_BY} edge expression finding all candidates for \texttt{a};
    \item \textit{a} candidates are filtered to match the \texttt{:Author} label;
    \item The \texttt{:AUTHORED\_BY} edge expression finding all candidates for \texttt{p};
    \item \textit{p} candidates are filtered to match the \texttt{:Publication} and \texttt{:Article} labels;
    \item Publishers are counted;
\end{itemize}

For the execution of this query, Neo4J reports 2840 total db hits in 89 ms.

\subsection{Author Count in Publications}
\textit{Query complexity: 3 nodes, conditions, aggregations}

This query considers the journal ``AI Mag.'' it queries the authors of its publications.
The query then returns the author count of author whose name does not contain the letter `M'. Results are ordered by author count.

\begin{minipage}{\linewidth}
\begin{lstlisting}[language=cypher, label=lst:cypher-q01, caption=The Cypher code for the Author Count in Publications query.]
MATCH (j:Journal {name: 'AI Mag.'})<-[:PUBLISHED_IN]-(p:Publication),
       (p)-[:AUTHORED_BY]->(a:Author)
WHERE NOT a.name =~ '.*[mM].*'
RETURN p.title, count(a) AS authorCount
ORDER BY authorCount DESC
\end{lstlisting}
\end{minipage}

By invoking the query with the \texttt{PROFILE} clause, Neo4J reports the following execution plan:
\begin{itemize}
    \item All possible matches for \texttt{(j:Journal {name: 'AI Mag.'})} are found;
    \item The \texttt{:PUBLISHED\_IN} edge expression finding all candidates for
    \texttt{p};
    \item \textit{p} candidates are filtered to match the \texttt{:Publication} label;
    \item The \texttt{:AUTHORED\_BY} edge expression finding all candidates for \texttt{a};
    \item \textit{a} candidates are filtered to match \texttt{(a:Author) WHERE NOT a.name =~ '.*[mM].*'}, i.e.\ to find authors whose name does not contain an uppercase or lowercase m;
    \item Authors are counted for each publication;
    \item Results are sorted by decreasing author count and then returned.
\end{itemize}

For the execution of this query, Neo4J reports 1866 total db hits in 84 ms.

\subsection{Morgan \& Claypool's Journal Relationships}
\textit{Query complexity: 5 nodes, conditions, aggregations, limits}

This query considers the publisher whose name starts with ``Morgan \& Claypool'' (only
one publisher matches this criteria in our database instance). We then query all the authors of publications published with this publisher. We then want to find all the  publications \textbf{not} published with ``Morgan \& Claypool'', and of these publications
we want to find journals where they were published in. The top 10 results by author count (i.e.\ the number of distinct authors involved with both the publisher and the journal) are returned, and each result yields the publisher name, the journal name, and the author count.

\begin{minipage}{\linewidth}
\begin{lstlisting}[language=cypher, label=lst:cypher-q01, caption=The Cypher code for the Morgan \& Claypool's Journal Relationships query.]
MATCH (p:Publisher)<-[:PUBLISHED_BY]-(pub1:Publication)-[:AUTHORED_BY]->(a:Author),
      (a)<-[:AUTHORED_BY]-(pub2:Publication)-[:PUBLISHED_IN]->(j:Journal)
WHERE p.name STARTS WITH 'Morgan & Claypool'
RETURN p.name, j.name, count(DISTINCT a) as authorCount
ORDER BY authorCount DESC
LIMIT 10
\end{lstlisting}
\end{minipage}

By invoking the query with the \texttt{PROFILE} clause, Neo4J reports the following execution plan:

\begin{itemize}
    \item All possible matches for \texttt{(p:Publisher) WHERE STARTS WITH 'Morgan \& Claypool'} are found;
    \item The \texttt{:PUBLISHED\_BY} edge expression finding all candidates for \texttt{pub1};
    \item \textit{pub1} candidates are filtered to match the \texttt{:Publication} label;
    \item The \texttt{:AUTHORED\_BY} edge expression finding all candidates for \texttt{a};
    \item \textit{a} candidates are filtered to match the \texttt{:Author} label;
    \item The \texttt{:PUBLISHED\_BY} edge expression finding all candidates for \texttt{pub2};
    \item \textit{pub2} candidates are filtered to match the \texttt{:Publication} label and \texttt{pub1 != pub2};
    \item The \texttt{:PUBLISHED\_IN} edge expression finding all candidates for \texttt{j};
    \item \textit{j} candidates are filtered to match the \texttt{:Journal} label;
    \item The authors are counted;
    \item Results other than the top 10 by count are discarded, and the remaining matches are sorted by descending author count.
\end{itemize}

For the execution of this query, Neo4J reports 68 total db hits in 110 ms.


\subsection{Author Chains}
\textit{Query complexity: 5 nodes, conditions, aggregations, limits}

This query finds the longest chains between authors whose name contain ``Jones'' and all other authors. The chain edges must all be of type \textit{AUTHORED\_BY} except one, whose type must be \textit{PUBLISHED\_IN}. Results are sorted in descending order by the length of the respective chains (more specifically by the total number of \textit{AUTHORED\_BY} hops), and the top 10 results are picked. The query returns for each result the names of the two authors, the number of \textit{AUTHORED\_BY} hops starting from the ``Jones'' authors (\texttt{startHopCount}) and the number of \textit{AUTHORED\_BY} hops ending on the other author (\textit{endHopCount}).

\begin{minipage}{\linewidth}
\begin{lstlisting}[language=cypher, label=lst:cypher-q01, caption=The Cypher code for the Author Chains query.]
MATCH (a:Author)-[r:AUTHORED_BY*..20]-(p1:Publication)-[:PUBLISHED_IN]->(j:Journal),
      (j)<-[:PUBLISHED_IN]-(p2:Publication)-[s:AUTHORED_BY*..20]-(t:Author)
WHERE a.id <> t.id AND t.name CONTAINS 'Jones'
RETURN a.name, t.name, COUNT(r) as endHopCount, COUNT(s) as startHopCount
ORDER BY startHopCount + endHopCount DESC, a.name ASC
LIMIT 10
\end{lstlisting}
\end{minipage}

By invoking the query with the \texttt{PROFILE} clause, Neo4J reports the following execution plan:

\begin{itemize}
    \item All possible matches for \texttt{(p:Author) WHERE t.name CONTAINS 'Jones'} are found;
    \item The variable length \texttt{:AUTHORED\_BY} edge expression is expanded to find all candidates for \texttt{p2};
    \item The \texttt{p2} candidates are filtered for the \texttt{:Publication} label;
    \item \texttt{:PUBLISHED\_BY} edges are traversed to find all candidates for \texttt{j};
    \item The \texttt{j} candidates are filtered for the \texttt{:Journal} label;
    \item \texttt{:PUBLISHED\_BY} edges are traversed to find all candidates for \texttt{p1};
    \item The \texttt{p1} candidates are filtered for the \texttt{:Publication} label and for \texttt{p1 <> p2};
    \item The variable length \texttt{:AUTHORED\_BY} edge expression is expanded to find all candidates for \texttt{a};
    \item \texttt{a} is checked to match  \texttt{(a:Author) WHERE a.id != t.id};
    \item \texttt{COUNT(r)} and \texttt{COUNT(s)} are computed;
    \item \texttt{COUNT(r) + COUNT(s)} is computed, and then the results are sorted and truncated according to the provided clauses.
\end{itemize}

For the execution of this query, Neo4J reports 34367 total db hits in 112 ms.

\subsection{Shortest Author-to-Author Chain}
\textit{Query complexity: functions}

This query finds the shortest path between each author whose name contain ``Jones'' and each other author.
Unlike the Author Chains query, the query does not constrain the edges in the path to be of any particular type. The query returns the top 10 results by path length, yielding for each result the
names of both authors, the length of the path, and a list of nodes in the path (each element being a pair of the node's label and the node's name/title).

\begin{minipage}{\linewidth}
\begin{lstlisting}[language=cypher, label=lst:cypher-q01, caption=The Cypher code for the Shortest Author-to-Author Chain query.]
MATCH (t:Author) WHERE t.name CONTAINS 'Jones'
MATCH (a:Author) WHERE a.id <> t.id
MATCH path = shortestPath((a)-[*..20]-(t))
RETURN t.name, a.name, length(path), [
    x IN nodes(path) |
    [reduce(s = '', x IN labels(x) | s + ':' + x), COALESCE(x.name, x.title)]
] AS nodes
ORDER BY length(path) DESC
LIMIT 10
\end{lstlisting}
\end{minipage}

By invoking the query with the \texttt{PROFILE} clause, Neo4J reports the following execution plan:

\begin{itemize}
    \item All \texttt{t}s matching \texttt{(t:Author) WHERE t.name CONTAINS 'Jones'} are found;
    \item In parallel with the previous step, all \textit{a}s matching \textit{(a:Author)} are found;
    \item The Cartesian product of \texttt{a}s and \texttt{t}s is computed;
    \item \texttt{a}s and \texttt{t}s are filtered by \texttt{a.id != t.id};
    \item The shortest path between each \texttt{a} and \texttt{t} is computed;
    \item The length of the path is computed;
    \item All but the top 10 results by path length are discarded;
    \item The expression yielding the \texttt{nodes} property is computed;
    \item The resulted are ordered by descending path length.
\end{itemize}

For the execution of this query, Neo4J reports 100681 total db hits in 1247 ms.

\subsection{Number of Authors Who Recently Published With a Publisher}
\textit{Query complexity: 3 nodes, conditions, aggregations}

This query finds the number of authors who recently published with a publisher, and we define recently to mean a publication year greater than 2010.

\begin{minipage}{\linewidth}
\begin{lstlisting}[language=cypher, label=lst:cypher-q01, caption=The Cypher code for the Number of Authors Who Recently Published With a Publisher query.]
MATCH (p:Publisher)<-[pb:PUBLISHED_BY]-(p2:Publication:Book),
      (a:Author)<-[ab:AUTHORED_BY]-(p2)
WHERE p.name = 'Springer' AND toInteger(p2.year) >= 2010
RETURN COUNT(a)
\end{lstlisting}
\end{minipage}

By invoking the query with the \texttt{PROFILE} clause, Neo4J reports the following execution plan:

\begin{itemize}
    \item All \texttt{p2}s matching \texttt{(p2:Author) WHERE toInteger(p2.year) >= 2010} are found;
    \item \texttt{:PUBLISHED\_BY} edges are traversed to find all candidates for \texttt{p};
    \item The \texttt{p} candidates are filtered for the \texttt{:Publication} label and for \texttt{p.name = 'Springer'};
    \item The \texttt{:AUTHORED\_BY} edge expression finding all candidates for \texttt{a};
    \item \textit{a} candidates are filtered to match the \texttt{:Author} label;
    \item Authors are counted.
\end{itemize}

For the execution of this query, Neo4J reports 86 total db hits in 38 ms.

\subsection{Authors of a Journal Whose Publication Metadata Are Not Up to Date}
\textit{Query complexity: 3 nodes, conditions}

\begin{minipage}{\linewidth}
\begin{lstlisting}
MATCH (a:Author)<-[ab:AUTHORED_BY]-(p:Publication),
      (j:Journal)<-[pi:PUBLISHED_IN]-(p:Publication)
WHERE toInteger(p.year) < date(p.metadataVersion).year AND j.name = 'Theor. Comput. Sci.'
RETURN a
\end{lstlisting}
\end{minipage}

By invoking the query with the \texttt{PROFILE} clause, Neo4J reports the following execution plan:

\begin{itemize}
    \item All possible matches for \texttt{(j:Journal) WHERE j.name = 'Theor. Comput. Sci.'} are found;
    \item The \texttt{:PUBLISHED\_IN} edge expression is expanded to find all candidates for \texttt{p};
    \item The \texttt{p} candidates are filtered for the \texttt{:Publication} label, and also filtered selecting only those with year lower than that of the metadata version;
    \item The \texttt{:AUTHORED\_BY} edge expression is expanded to find all candidates for \texttt{a};
    \item The \texttt{a} candidates are filtered for the \texttt{:Author} label;
    \item The result is ordered by \texttt{a.name}.
\end{itemize}

For the execution of this query, Neo4J reports 1813 total db hits in 98 ms.

\subsection{Find a book about Image processing published by Springer}
\textit{Query complexity: 3 nodes, 3 conditions}

This query finds a book about image processing published by Springer after 2000 finds also the author.

\begin{minipage}{\linewidth}
\begin{lstlisting}[language=cypher, label=lst:cypher-q01, caption=The Cypher code for the finding a book about image processing and the author.]
MATCH (p:Publisher)-[r]-(b:Book)-[r2]-(a:Author)
WHERE b.year > '2000' and p.name = 'Springer' and b.title contains 'image processing'
RETURN p,r,b,r2,a, b.title, a.name
\end{lstlisting}
\end{minipage}

By invoking the query with the \texttt{PROFILE} clause, Neo4J reports the following execution plan:

\begin{itemize}
    \item \textbf{Cypher version: , planner: COST, runtime: PIPELINED. 93 total db hits in 93 ms};
    \item Define the initial candidates by considering all publishers;
    \item Filter the publishers by considering "Springer" as only interested publisher;
    \item Expanding the rows by adding the details of the books linked to the publisher;
    \item Filter the book which are published in 2000 or later and the title are related to image processing;
    \item Adding the authors data to the books;
    \item checking that the relationships of publisher and authors are different (relationship with books);
    \item Prepare and project nodes and links to be visualized;
    \item Produce results which will be the list of book/books with authors name;

\end{itemize}


\subsection{Find a list and details of publication/thesis related to algorithms }
\textit{Query complexity: 3 nodes, conditions}

This query finds list of publications/thesis related to algorithms that are submited at school before 2009 .

\begin{minipage}{\linewidth}
\begin{lstlisting}[language=cypher, label=lst:cypher-q01, caption=The Cypher code for the
finding of thesis related to algorithms published before 2009.]
MATCH (s:School)-[r:SUBMITTED_AT]-(t)-[r2]-(a:Author)
WHERE t.title contains 'algorithm' and t.year <='2009'
RETURN t.title,a.name,t.year,s,r,t,r2,a
\end{lstlisting}
\end{minipage}


By invoking the query with the \texttt{PROFILE} clause, Neo4J reports the following execution plan:

\begin{itemize}
    \item \textbf{Cypher version: , planner: COST, runtime: PIPELINED. 304 total db hits in 36 ms};
    \item Find the initial candidates by considering all the thesis which are submitted at Schools;
    \item Filter the thesis that are submitted in 2009 or after and are related to "algorithms";
    \item Adding the details of authors to the thesis;
    \item Checking that the relationships of authors be different from relationships of schools (with thesis);
    \item Project all nodes and attributes which we are interested to visualize;
    \item Produce final results which is the list of articles, authors and the publication year;

\end{itemize}


\subsection{Find publications about network with higher number of authors }
\textit{Query complexity: 3 nodes, conditions, aggregation}

This query Find number of publications related to network which are published after 2005 and each one has more than 6 authors .

\begin{minipage}{\linewidth}
\begin{lstlisting}[language=cypher, label=lst:cypher-q01, caption=The Cypher code for finding publications about network with at least 6 authors.]
MATCH (a:Author)-[r0]-(p:Publication)-[r:PUBLISHED_IN]->(j:Journal)
WHERE toLower(p.title) contains 'network' and p.year >= '2005'
WITH p, count(a) as cnt
WHERE cnt > 6
RETURN count(p.title) as Nr_publications
\end{lstlisting}
\end{minipage}

By invoking the query with the \texttt{PROFILE} clause, Neo4J reports the following execution plan:

\begin{itemize}
    \item \textbf{Cypher version: , planner: COST, runtime: PIPELINED. 8234 total db hits in 38 ms};
    \item Considering the connection of publication to Journals through Published\_IN define all possible matches;
    \item Apply the filter on year of publication and considering that the articles title must include "network";
    \item Expanding the candidates by adding the nodes of Journals and authors;
    \item Control that the relationships publications with authors and journal be different;
    \item Count the number of authors for each publication;
    \item Filter the publications based on their authors which must bu higher than 6;
    \item Aggregation: counting total number of articles which have at least 6 authors;

\end{itemize}


\subsection{Find number of authors who have published a book and also an article on journals }
\textit{Query complexity: 5 nodes, conditions, aggregation}

This query Find number of authors who has written a book about user modeling published on ."Morgan \& Claypool Publishers" and published also an article on a Journal in the field of artificial intelligence.

\begin{minipage}{\linewidth}
\begin{lstlisting}[language=cypher, label=lst:cypher-q01, caption=The Cypher code for finding nr of authors who has published a book and also an article.]
MATCH (a:Author)-[r0]-(b:Book),(b)-[r]-(pb:Publisher)
,(a)-[r2]-(ar:Article)-[r3]-(j:Journal)
WHERE toLower(ar.title) contains 'user modeling'
and pb.name = 'Morgan & Claypool Publishers' and toLower(j.name) contains 'artif'
RETURN count(distinct a.name) as Nr_Author
\end{lstlisting}
\end{minipage}

By invoking the query with the \texttt{PROFILE} clause, Neo4J reports the following execution plan:

\begin{itemize}
    \item \textbf{Cypher version: , planner: COST, runtime: PIPELINED. 77 total db hits in 122 ms};
    \item All possible matches for pb:Publisher where pb.name = "Morgan \& Claypool Publishers" are found;
    \item Expand the path by linking books to publications;
    \item Filter books linked to the specific publisher;
    \item The candidates are filtered that the relationships r and r0 be different;
    \item The title of article is checked which must contain "user modeling";
    \item Expand the path by adding Journal node to Article;
    \item Apply filter of relationship and Article to the Journals;
    \item Filter the journals by considering toLower(j.name) contains 'artif';
    \item Count the unique authors who matches all the criteria;
\end{itemize}

\subsection{Update the MetadataVersion of a Publication}
\textit{Query complexity: data update}

This query updates the version of the metadata of a specific publication, for which the title is given.

\begin{minipage}{\linewidth}
\begin{lstlisting}[language=cypher, label=lst:cypher-q01, caption=The Cypher code to update the version of a publication metadata.]
MATCH (p:Publication)
WHERE p.title = 'On the Combinatorics of Finite Words.'
SET p.metadataVersion = '2022-11-07'
\end{lstlisting}
\end{minipage}

By invoking the query with the \texttt{PROFILE} clause, Neo4J reports the following execution plan:

\begin{itemize}
    \item All possible matches for \texttt{(p:Publication) WHERE p.title = 'On the Combinatorics of Finite Words.'} are found;
    \item The \texttt{metadataVersion} property is set to '2022-11-07';
    \item an empty result is displayed, since values are only set;
\end{itemize}

For the execution of this query, Neo4J reports 7199 total db hits in 19 ms.

\subsection{Publishing an Article on a Journal and removing it}
\textit{Query complexity: data creation/removal}

The first query publishes an article on a journal. The second one removes an article from a journal. The inputs of both queries are the name of the journal and the title of the article. Note that only the relationship between the nodes is modified in both cases: this means that the article must be already present in the database upon publication, while it is not removed from the database upon removal of it from the journal.

\begin{minipage}{\linewidth}
\begin{lstlisting}[language=cypher, label=lst:cypher-q01, caption=The Cypher code to publish an article on a journal.]
MATCH (j:Journal)
WHERE j.name = 'Theor. Comput. Sci.'
MATCH (p:Publication:Article)
WHERE p.title = 'On the least number of palindromes in two-dimensional words.'
MERGE (j)<-[pi:PUBLISHED_IN]-(p)
\end{lstlisting}
\end{minipage}

By invoking the query with the \texttt{PROFILE} clause, Neo4J reports the following execution plan:

\begin{itemize}
    \item All possible matches for \texttt{(p:Publication) WHERE p.title = 'On the least number of palindromes in two-dimensional words.'} are found;
    \item All possible matches for \texttt{(j:Journal) WHERE j.name = 'Theor. Comput. Sci.'} are found;
    \item It is performed a cartesian product \textit{(p,j)} with the results of the two queries;
    \item In parallel, the pattern \textit{(j)<-[pi:PUBLISHED\_IN]-(p)} is created and then applied to the cartesian product to create the new relationship, if this one exists (this is ensured by the LOCK(j,p) statement, where j and p are the previouslyy selected nodes);
    \item An empty result is created and displayed.
\end{itemize}

For the execution of this query, Neo4J reports 7487 total db hits in 47 ms.

\begin{minipage}{\linewidth}
\begin{lstlisting}[language=cypher, label=lst:cypher-q01, caption=The Cypher code to remove an article from a journal.]
MATCH (j:Journal)<-[pi:PUBLISHED_IN]-(p:Publication:Article)
WHERE j.name = 'Theor. Comput. Sci.'
    AND p.title = 'On the least number of palindromes in two-dimensional words.'
DELETE pi
\end{lstlisting}
\end{minipage}

By invoking the query with the \texttt{PROFILE} clause, Neo4J reports the following execution plan:

\begin{itemize}
    \item All possible matches for \texttt{(j:Journal) WHERE j.name = 'Theor. Comput. Sci.'} are found;
    \item The \texttt{:AUTHORED\_BY} edge expression is expanded to find all candidates for \texttt{p};
    \item The \texttt{p} candidates are filtered for the \texttt{:Publication} and \texttt{:Article} labels, and also filtered for \texttt{p.title = 'On the least number of palindromes in two-dimensional words.'};
    \item The found set of relationship \texttt{pi} is deleted (only one in our case);
    \item An empty result is created and displayed.
\end{itemize}

For the execution of this query, Neo4J reports 1675 total db hits in 3 ms.

\end{document}