Added commit id in report

2023-11-08 22:31:16 +01:00 · 2023-11-08 22:31:16 +01:00 · e2d6151c34
commit e2d6151c34
parent 07885943d7
2 changed files with 113 additions and 101 deletions
--- a/report/main.pdf
+++ b/report/main.pdf
--- a/report/main.tex
+++ b/report/main.tex
@ -34,6 +34,7 @@
 \usepackage{subcaption}
 \usepackage{amssymb}
 \usepackage{amsmath}
 \usepackage{changepage}
 \usepackage{hyperref}
 \title{Knowledge Management and Analysis \\ Project 01: Code Search}
@ -42,52 +43,63 @@
 \begin{document}
-\maketitle
+    \maketitle
-\subsection*{Section 1 - Data Extraction}
+    \begin{adjustwidth}{-4cm}{-4cm}
        \centering
        \begin{tabular}{cc}
            \toprule
            Repository URL & \url{https://github.com/kamclassroom2022/project-01-multi-search-maggicl} \\
            Commit ID      & \texttt{b8e0a2c3c41249e45b233b55607e0b04ebe1bad0}                         \\ \bottomrule
        \end{tabular}
    \end{adjustwidth}
    \vspace{1cm}
 The data extraction (implemented in the script \texttt{extract-data.py}) process scans through the files in the
 TensorFlow project to extract Python docstrings and symbol names for functions, classes and methods. A summary of the
 number of features extracted can be found in table~\ref{tab:count1}. The collected figures show that the number of
 classes is more than half the number of files, while the number of functions is about twice the number of files.
 Additionally, the data shows that a class has slightly more than 2 methods in it on average.
-\begin{table}[H]
+    \subsection*{Section 1 - Data Extraction}
 \centering
 \begin{tabular}{cc}
 \hline
 Type & Number \\
 \hline
 Python files & 2817 \\
 Classes & 1882 \\
 Functions & 4565 \\
 Methods & 5817 \\
 \hline
 \end{tabular}
 \caption{Count of created classes and properties.}
 \label{tab:count1}
 \end{table}
-\subsection*{Section 2: Training of search engines}
+    The data extraction (implemented in the script \texttt{extract-data.py}) process scans through the files in the
    TensorFlow project to extract Python docstrings and symbol names for functions, classes and methods. A summary of the
    number of features extracted can be found in table~\ref{tab:count1}. The collected figures show that the number of
    classes is more than half the number of files, while the number of functions is about twice the number of files.
    Additionally, the data shows that a class has slightly more than 2 methods in it on average.
-The training and model execution of the search engines is implemented in the Python script \texttt{search-data.py}.
+    \begin{table}[H]
-The training model loads the data extracted by \texttt{extract-data.py} and uses as classification features the
+        \centering
-identifier name and only the first line of the comment docstring. All other comment lines are filtered out as this
+        \begin{tabular}{cc}
-significantly increases performance when evaluating the models.
+            \toprule
            Type         & Number \\
            \midrule
            Python files & 2817   \\
            Classes      & 1882   \\
            Functions    & 4565   \\
            Methods      & 5817   \\
            \bottomrule
        \end{tabular}
        \caption{Count of created classes and properties.}
        \label{tab:count1}
    \end{table}
-The script is able to search a given natural language query among the extracted TensorFlow corpus using four techniques.
+    \subsection*{Section 2: Training of search engines}
 These are namely: Word Frequency Similarity, Term-Frequency Inverse Document-Frequency (TF-IDF) Similarity, Latent
 Semantic Indexing (LSI), and Doc2Vec.
-An example output of results generated from the query ``Gather gpu device info'' for the word frequency, TF-IDF, LSI
+    The training and model execution of the search engines is implemented in the Python script \texttt{search-data.py}.
-and Doc2Vec models are shown in
+    The training model loads the data extracted by \texttt{extract-data.py} and uses as classification features the
-figures~\ref{fig:search-freq},~\ref{fig:search-tfidf},~\ref{fig:search-lsi}~and~\ref{fig:search-doc2vec} respectively.
+    identifier name and only the first line of the comment docstring. All other comment lines are filtered out as this
-All four models are able to correctly report the ground truth required by the file \texttt{ground-truth-unique.txt} as
+    significantly increases performance when evaluating the models.
 the first result with $>90\%$ similarity, with the except of the Doc2Vec model which reports $71.63\%$ similarity.
-\begin{figure}[b]
+    The script is able to search a given natural language query among the extracted TensorFlow corpus using four techniques.
-    \small
+    These are namely: Word Frequency Similarity, Term-Frequency Inverse Document-Frequency (TF-IDF) Similarity, Latent
-    \begin{verbatim}
+    Semantic Indexing (LSI), and Doc2Vec.
    An example output of results generated from the query ``Gather gpu device info'' for the word frequency, TF-IDF, LSI
    and Doc2Vec models are shown in
    figures~\ref{fig:search-freq},~\ref{fig:search-tfidf},~\ref{fig:search-lsi}~and~\ref{fig:search-doc2vec} respectively.
    All four models are able to correctly report the ground truth required by the file \texttt{ground-truth-unique.txt} as
    the first result with $>90\%$ similarity, with the except of the Doc2Vec model which reports $71.63\%$ similarity.
    \begin{figure}[b]
        \small
        \begin{verbatim}
 Similarity: 90.45%
 Python function: gather_gpu_devices
 Description: Gather gpu device info. Returns: A list of test_log_pb2.GPUInf...
@ -118,13 +130,13 @@ Description: Gather list of devices available to TensorFlow. Returns: A lis...
 File: tensorflow/tensorflow/tools/test/system_info_lib.py
 Line: 126
        \end{verbatim}
-    \caption{Search result output for the query ``Gather gpu device info'' using the word frequency similarity model.}
+        \caption{Search result output for the query ``Gather gpu device info'' using the word frequency similarity model.}
-    \label{fig:search-freq}
+        \label{fig:search-freq}
-\end{figure}
+    \end{figure}
-\begin{figure}[b]
+    \begin{figure}[b]
-    \small
+        \small
-    \begin{verbatim}
+        \begin{verbatim}
 Similarity: 90.95%
 Python function: gather_gpu_devices
 Description: Gather gpu device info. Returns: A list of test_log_pb2.GPUInf...
@ -154,13 +166,13 @@ Python function: info
 File: tensorflow/tensorflow/python/platform/tf_logging.py
 Line: 167
        \end{verbatim}
-    \caption{Search result output for the query ``Gather gpu device info'' using the TF-IDF model.}
+        \caption{Search result output for the query ``Gather gpu device info'' using the TF-IDF model.}
-    \label{fig:search-tfidf}
+        \label{fig:search-tfidf}
-\end{figure}
+    \end{figure}
-\begin{figure}[b]
+    \begin{figure}[b]
-    \small
+        \small
-    \begin{verbatim}
+        \begin{verbatim}
 Similarity: 98.38%
 Python function: gather_gpu_devices
 Description: Gather gpu device info. Returns: A list of test_log_pb2.GPUInf...
@ -190,13 +202,13 @@ Python method: get_var_on_device
 File: tensorflow/tensorflow/python/distribute/packed_distributed_variable.py
 Line: 90
        \end{verbatim}
-    \caption{Search result output for the query ``Gather gpu device info'' using the LSI model.}
+        \caption{Search result output for the query ``Gather gpu device info'' using the LSI model.}
-    \label{fig:search-lsi}
+        \label{fig:search-lsi}
-\end{figure}
+    \end{figure}
-\begin{figure}[b]
+    \begin{figure}[b]
-    \small
+        \small
-    \begin{verbatim}
+        \begin{verbatim}
 Similarity: 71.63%
 Python function: gather_gpu_devices
 Description: Gather gpu device info. Returns: A list of test_log_pb2.GPUInf...
@ -227,59 +239,59 @@ Description: A list of device names for CPU hosts. Returns: A list of devic...
 File: tensorflow/tensorflow/python/tpu/tpu_embedding.py
 Line: 1011
        \end{verbatim}
-    \caption{Search result output for the query ``Gather gpu device info'' using the Doc2Vec model.}
+        \caption{Search result output for the query ``Gather gpu device info'' using the Doc2Vec model.}
-    \label{fig:search-doc2vec}
+        \label{fig:search-doc2vec}
-\end{figure}
+    \end{figure}
-\subsection*{Section 3: Evaluation of search engines}
+    \subsection*{Section 3: Evaluation of search engines}
-The evaluation over the given ground truth to compute precision, recall, and the T-SNE plots is performed by the script
+    The evaluation over the given ground truth to compute precision, recall, and the T-SNE plots is performed by the script
-\texttt{prec-recall.py}. The calculated average precision and recall values are reported in table~\ref{tab:tab2}.
+    \texttt{prec-recall.py}. The calculated average precision and recall values are reported in table~\ref{tab:tab2}.
-Precision and recall are quite high for all models.
+    Precision and recall are quite high for all models.
-The word frequency model has the highest precision and recall ($93.33\%$ and $100.00\%$ respectively), while the Doc2Vec
+    The word frequency model has the highest precision and recall ($93.33\%$ and $100.00\%$ respectively), while the Doc2Vec
-model has the lowest precision ($73.33\%$) and lowest recall ($80.00\%$).
+    model has the lowest precision ($73.33\%$) and lowest recall ($80.00\%$).
-\begin{table}[H]
+    \begin{table}[H]
-\centering
+        \centering
-\begin{tabular}{ccc}
+        \begin{tabular}{ccc}
-\hline
+            \toprule
-Engine & Avg Precision & Recall \\
+            Engine      & Avg Precision & Recall   \\
-\hline
+            \midrule
-Frequencies & 93.33\% & 100.00\% \\
+            Frequencies & 93.33\%       & 100.00\% \\
-TD-IDF & 90.00\% & 90.00\% \\
+            TD-IDF      & 90.00\%       & 90.00\%  \\
-LSI & 90.00\% & 90.00\% \\
+            LSI         & 90.00\%       & 90.00\%  \\
-Doc2Vec & 73.33\% & 80.00\% \\
+            Doc2Vec     & 73.33\%       & 80.00\%  \\
-\hline
+            \bottomrule
-\end{tabular}
+        \end{tabular}
-\caption{Evaluation of search engines.}
+        \caption{Evaluation of search engines.}
-\label{tab:tab2}
+        \label{tab:tab2}
-\end{table}
+    \end{table}
-\subsection*{TBD Section 4: Visualisation of query results}
+    \subsection*{TBD Section 4: Visualisation of query results}
-The two-dimensional T-SNE plots (computed with perplexity $= 2$) for the LSI and Doc2Vec models are respectively in
+    The two-dimensional T-SNE plots (computed with perplexity $= 2$) for the LSI and Doc2Vec models are respectively in
-figures~\ref{fig:tsne-lsi}~and~\ref{fig:tsne-doc2vec}.
+    figures~\ref{fig:tsne-lsi}~and~\ref{fig:tsne-doc2vec}.
-The T-SNE plot for the LSI model shows evidently the presence of outliers in the search result. The Doc2Vec plot shows
+    The T-SNE plot for the LSI model shows evidently the presence of outliers in the search result. The Doc2Vec plot shows
-fewer outliers and more distinct clusters for the results of each query and the query vector itself. However, even
+    fewer outliers and more distinct clusters for the results of each query and the query vector itself. However, even
-considering the good performance for both models, it is hard to distinguish from the plots given distinct ``regions''
+    considering the good performance for both models, it is hard to distinguish from the plots given distinct ``regions''
-where results and their respective query are located.
+    where results and their respective query are located.
-\begin{figure}
+    \begin{figure}
-\begin{center}
+        \begin{center}
-\includegraphics[width=\textwidth]{../out/lsi_plot}
+            \includegraphics[width=\textwidth]{../out/lsi_plot}
-\caption{T-SNE plot for the LSI model over the queries and ground truths given in \texttt{ground-truth-unique.txt}.}
+            \caption{T-SNE plot for the LSI model over the queries and ground truths given in \texttt{ground-truth-unique.txt}.}
-\label{fig:tsne-lsi}
+            \label{fig:tsne-lsi}
-\end{center}
+        \end{center}
-\end{figure}
+    \end{figure}
-\begin{figure}
+    \begin{figure}
-\begin{center}
+        \begin{center}
-\includegraphics[width=\textwidth]{../out/doc2vec_plot}
+            \includegraphics[width=\textwidth]{../out/doc2vec_plot}
-\caption{T-SNE plot for the Doc2Vec model over the queries and ground truths given in \texttt{ground-truth-unique.txt}.}
+            \caption{T-SNE plot for the Doc2Vec model over the queries and ground truths given in \texttt{ground-truth-unique.txt}.}
-\label{fig:tsne-doc2vec}
+            \label{fig:tsne-doc2vec}
-\end{center}
+        \end{center}
-\end{figure}
+    \end{figure}
 \end{document}