Added commit id in report
This commit is contained in:
parent
07885943d7
commit
e2d6151c34
2 changed files with 113 additions and 101 deletions
BIN
report/main.pdf
BIN
report/main.pdf
Binary file not shown.
182
report/main.tex
182
report/main.tex
|
@ -34,6 +34,7 @@
|
||||||
\usepackage{subcaption}
|
\usepackage{subcaption}
|
||||||
\usepackage{amssymb}
|
\usepackage{amssymb}
|
||||||
\usepackage{amsmath}
|
\usepackage{amsmath}
|
||||||
|
\usepackage{changepage}
|
||||||
\usepackage{hyperref}
|
\usepackage{hyperref}
|
||||||
|
|
||||||
\title{Knowledge Management and Analysis \\ Project 01: Code Search}
|
\title{Knowledge Management and Analysis \\ Project 01: Code Search}
|
||||||
|
@ -42,50 +43,61 @@
|
||||||
|
|
||||||
\begin{document}
|
\begin{document}
|
||||||
|
|
||||||
\maketitle
|
\maketitle
|
||||||
|
|
||||||
\subsection*{Section 1 - Data Extraction}
|
\begin{adjustwidth}{-4cm}{-4cm}
|
||||||
|
\centering
|
||||||
|
\begin{tabular}{cc}
|
||||||
|
\toprule
|
||||||
|
Repository URL & \url{https://github.com/kamclassroom2022/project-01-multi-search-maggicl} \\
|
||||||
|
Commit ID & \texttt{b8e0a2c3c41249e45b233b55607e0b04ebe1bad0} \\ \bottomrule
|
||||||
|
\end{tabular}
|
||||||
|
\end{adjustwidth}
|
||||||
|
\vspace{1cm}
|
||||||
|
|
||||||
The data extraction (implemented in the script \texttt{extract-data.py}) process scans through the files in the
|
|
||||||
TensorFlow project to extract Python docstrings and symbol names for functions, classes and methods. A summary of the
|
|
||||||
number of features extracted can be found in table~\ref{tab:count1}. The collected figures show that the number of
|
|
||||||
classes is more than half the number of files, while the number of functions is about twice the number of files.
|
|
||||||
Additionally, the data shows that a class has slightly more than 2 methods in it on average.
|
|
||||||
|
|
||||||
\begin{table}[H]
|
\subsection*{Section 1 - Data Extraction}
|
||||||
\centering
|
|
||||||
\begin{tabular}{cc}
|
|
||||||
\hline
|
|
||||||
Type & Number \\
|
|
||||||
\hline
|
|
||||||
Python files & 2817 \\
|
|
||||||
Classes & 1882 \\
|
|
||||||
Functions & 4565 \\
|
|
||||||
Methods & 5817 \\
|
|
||||||
\hline
|
|
||||||
\end{tabular}
|
|
||||||
\caption{Count of created classes and properties.}
|
|
||||||
\label{tab:count1}
|
|
||||||
\end{table}
|
|
||||||
|
|
||||||
\subsection*{Section 2: Training of search engines}
|
The data extraction (implemented in the script \texttt{extract-data.py}) process scans through the files in the
|
||||||
|
TensorFlow project to extract Python docstrings and symbol names for functions, classes and methods. A summary of the
|
||||||
|
number of features extracted can be found in table~\ref{tab:count1}. The collected figures show that the number of
|
||||||
|
classes is more than half the number of files, while the number of functions is about twice the number of files.
|
||||||
|
Additionally, the data shows that a class has slightly more than 2 methods in it on average.
|
||||||
|
|
||||||
The training and model execution of the search engines is implemented in the Python script \texttt{search-data.py}.
|
\begin{table}[H]
|
||||||
The training model loads the data extracted by \texttt{extract-data.py} and uses as classification features the
|
\centering
|
||||||
identifier name and only the first line of the comment docstring. All other comment lines are filtered out as this
|
\begin{tabular}{cc}
|
||||||
significantly increases performance when evaluating the models.
|
\toprule
|
||||||
|
Type & Number \\
|
||||||
|
\midrule
|
||||||
|
Python files & 2817 \\
|
||||||
|
Classes & 1882 \\
|
||||||
|
Functions & 4565 \\
|
||||||
|
Methods & 5817 \\
|
||||||
|
\bottomrule
|
||||||
|
\end{tabular}
|
||||||
|
\caption{Count of created classes and properties.}
|
||||||
|
\label{tab:count1}
|
||||||
|
\end{table}
|
||||||
|
|
||||||
The script is able to search a given natural language query among the extracted TensorFlow corpus using four techniques.
|
\subsection*{Section 2: Training of search engines}
|
||||||
These are namely: Word Frequency Similarity, Term-Frequency Inverse Document-Frequency (TF-IDF) Similarity, Latent
|
|
||||||
Semantic Indexing (LSI), and Doc2Vec.
|
|
||||||
|
|
||||||
An example output of results generated from the query ``Gather gpu device info'' for the word frequency, TF-IDF, LSI
|
The training and model execution of the search engines is implemented in the Python script \texttt{search-data.py}.
|
||||||
and Doc2Vec models are shown in
|
The training model loads the data extracted by \texttt{extract-data.py} and uses as classification features the
|
||||||
figures~\ref{fig:search-freq},~\ref{fig:search-tfidf},~\ref{fig:search-lsi}~and~\ref{fig:search-doc2vec} respectively.
|
identifier name and only the first line of the comment docstring. All other comment lines are filtered out as this
|
||||||
All four models are able to correctly report the ground truth required by the file \texttt{ground-truth-unique.txt} as
|
significantly increases performance when evaluating the models.
|
||||||
the first result with $>90\%$ similarity, with the except of the Doc2Vec model which reports $71.63\%$ similarity.
|
|
||||||
|
|
||||||
\begin{figure}[b]
|
The script is able to search a given natural language query among the extracted TensorFlow corpus using four techniques.
|
||||||
|
These are namely: Word Frequency Similarity, Term-Frequency Inverse Document-Frequency (TF-IDF) Similarity, Latent
|
||||||
|
Semantic Indexing (LSI), and Doc2Vec.
|
||||||
|
|
||||||
|
An example output of results generated from the query ``Gather gpu device info'' for the word frequency, TF-IDF, LSI
|
||||||
|
and Doc2Vec models are shown in
|
||||||
|
figures~\ref{fig:search-freq},~\ref{fig:search-tfidf},~\ref{fig:search-lsi}~and~\ref{fig:search-doc2vec} respectively.
|
||||||
|
All four models are able to correctly report the ground truth required by the file \texttt{ground-truth-unique.txt} as
|
||||||
|
the first result with $>90\%$ similarity, with the except of the Doc2Vec model which reports $71.63\%$ similarity.
|
||||||
|
|
||||||
|
\begin{figure}[b]
|
||||||
\small
|
\small
|
||||||
\begin{verbatim}
|
\begin{verbatim}
|
||||||
Similarity: 90.45%
|
Similarity: 90.45%
|
||||||
|
@ -120,9 +132,9 @@ Line: 126
|
||||||
\end{verbatim}
|
\end{verbatim}
|
||||||
\caption{Search result output for the query ``Gather gpu device info'' using the word frequency similarity model.}
|
\caption{Search result output for the query ``Gather gpu device info'' using the word frequency similarity model.}
|
||||||
\label{fig:search-freq}
|
\label{fig:search-freq}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
\begin{figure}[b]
|
\begin{figure}[b]
|
||||||
\small
|
\small
|
||||||
\begin{verbatim}
|
\begin{verbatim}
|
||||||
Similarity: 90.95%
|
Similarity: 90.95%
|
||||||
|
@ -156,9 +168,9 @@ Line: 167
|
||||||
\end{verbatim}
|
\end{verbatim}
|
||||||
\caption{Search result output for the query ``Gather gpu device info'' using the TF-IDF model.}
|
\caption{Search result output for the query ``Gather gpu device info'' using the TF-IDF model.}
|
||||||
\label{fig:search-tfidf}
|
\label{fig:search-tfidf}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
\begin{figure}[b]
|
\begin{figure}[b]
|
||||||
\small
|
\small
|
||||||
\begin{verbatim}
|
\begin{verbatim}
|
||||||
Similarity: 98.38%
|
Similarity: 98.38%
|
||||||
|
@ -192,9 +204,9 @@ Line: 90
|
||||||
\end{verbatim}
|
\end{verbatim}
|
||||||
\caption{Search result output for the query ``Gather gpu device info'' using the LSI model.}
|
\caption{Search result output for the query ``Gather gpu device info'' using the LSI model.}
|
||||||
\label{fig:search-lsi}
|
\label{fig:search-lsi}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
\begin{figure}[b]
|
\begin{figure}[b]
|
||||||
\small
|
\small
|
||||||
\begin{verbatim}
|
\begin{verbatim}
|
||||||
Similarity: 71.63%
|
Similarity: 71.63%
|
||||||
|
@ -229,57 +241,57 @@ Line: 1011
|
||||||
\end{verbatim}
|
\end{verbatim}
|
||||||
\caption{Search result output for the query ``Gather gpu device info'' using the Doc2Vec model.}
|
\caption{Search result output for the query ``Gather gpu device info'' using the Doc2Vec model.}
|
||||||
\label{fig:search-doc2vec}
|
\label{fig:search-doc2vec}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
\subsection*{Section 3: Evaluation of search engines}
|
\subsection*{Section 3: Evaluation of search engines}
|
||||||
|
|
||||||
The evaluation over the given ground truth to compute precision, recall, and the T-SNE plots is performed by the script
|
The evaluation over the given ground truth to compute precision, recall, and the T-SNE plots is performed by the script
|
||||||
\texttt{prec-recall.py}. The calculated average precision and recall values are reported in table~\ref{tab:tab2}.
|
\texttt{prec-recall.py}. The calculated average precision and recall values are reported in table~\ref{tab:tab2}.
|
||||||
|
|
||||||
Precision and recall are quite high for all models.
|
Precision and recall are quite high for all models.
|
||||||
The word frequency model has the highest precision and recall ($93.33\%$ and $100.00\%$ respectively), while the Doc2Vec
|
The word frequency model has the highest precision and recall ($93.33\%$ and $100.00\%$ respectively), while the Doc2Vec
|
||||||
model has the lowest precision ($73.33\%$) and lowest recall ($80.00\%$).
|
model has the lowest precision ($73.33\%$) and lowest recall ($80.00\%$).
|
||||||
|
|
||||||
\begin{table}[H]
|
\begin{table}[H]
|
||||||
\centering
|
\centering
|
||||||
\begin{tabular}{ccc}
|
\begin{tabular}{ccc}
|
||||||
\hline
|
\toprule
|
||||||
Engine & Avg Precision & Recall \\
|
Engine & Avg Precision & Recall \\
|
||||||
\hline
|
\midrule
|
||||||
Frequencies & 93.33\% & 100.00\% \\
|
Frequencies & 93.33\% & 100.00\% \\
|
||||||
TD-IDF & 90.00\% & 90.00\% \\
|
TD-IDF & 90.00\% & 90.00\% \\
|
||||||
LSI & 90.00\% & 90.00\% \\
|
LSI & 90.00\% & 90.00\% \\
|
||||||
Doc2Vec & 73.33\% & 80.00\% \\
|
Doc2Vec & 73.33\% & 80.00\% \\
|
||||||
\hline
|
\bottomrule
|
||||||
\end{tabular}
|
\end{tabular}
|
||||||
\caption{Evaluation of search engines.}
|
\caption{Evaluation of search engines.}
|
||||||
\label{tab:tab2}
|
\label{tab:tab2}
|
||||||
\end{table}
|
\end{table}
|
||||||
|
|
||||||
\subsection*{TBD Section 4: Visualisation of query results}
|
\subsection*{TBD Section 4: Visualisation of query results}
|
||||||
|
|
||||||
The two-dimensional T-SNE plots (computed with perplexity $= 2$) for the LSI and Doc2Vec models are respectively in
|
The two-dimensional T-SNE plots (computed with perplexity $= 2$) for the LSI and Doc2Vec models are respectively in
|
||||||
figures~\ref{fig:tsne-lsi}~and~\ref{fig:tsne-doc2vec}.
|
figures~\ref{fig:tsne-lsi}~and~\ref{fig:tsne-doc2vec}.
|
||||||
|
|
||||||
The T-SNE plot for the LSI model shows evidently the presence of outliers in the search result. The Doc2Vec plot shows
|
The T-SNE plot for the LSI model shows evidently the presence of outliers in the search result. The Doc2Vec plot shows
|
||||||
fewer outliers and more distinct clusters for the results of each query and the query vector itself. However, even
|
fewer outliers and more distinct clusters for the results of each query and the query vector itself. However, even
|
||||||
considering the good performance for both models, it is hard to distinguish from the plots given distinct ``regions''
|
considering the good performance for both models, it is hard to distinguish from the plots given distinct ``regions''
|
||||||
where results and their respective query are located.
|
where results and their respective query are located.
|
||||||
|
|
||||||
\begin{figure}
|
\begin{figure}
|
||||||
\begin{center}
|
\begin{center}
|
||||||
\includegraphics[width=\textwidth]{../out/lsi_plot}
|
\includegraphics[width=\textwidth]{../out/lsi_plot}
|
||||||
\caption{T-SNE plot for the LSI model over the queries and ground truths given in \texttt{ground-truth-unique.txt}.}
|
\caption{T-SNE plot for the LSI model over the queries and ground truths given in \texttt{ground-truth-unique.txt}.}
|
||||||
\label{fig:tsne-lsi}
|
\label{fig:tsne-lsi}
|
||||||
\end{center}
|
\end{center}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
\begin{figure}
|
\begin{figure}
|
||||||
\begin{center}
|
\begin{center}
|
||||||
\includegraphics[width=\textwidth]{../out/doc2vec_plot}
|
\includegraphics[width=\textwidth]{../out/doc2vec_plot}
|
||||||
\caption{T-SNE plot for the Doc2Vec model over the queries and ground truths given in \texttt{ground-truth-unique.txt}.}
|
\caption{T-SNE plot for the Doc2Vec model over the queries and ground truths given in \texttt{ground-truth-unique.txt}.}
|
||||||
\label{fig:tsne-doc2vec}
|
\label{fig:tsne-doc2vec}
|
||||||
\end{center}
|
\end{center}
|
||||||
\end{figure}
|
\end{figure}
|
||||||
|
|
||||||
\end{document}
|
\end{document}
|
||||||
|
|
Loading…
Reference in a new issue