Update on Overleaf.

profiling-pqc-kem-thesis · Jun 5, 2021 · 8cb4684 · 8cb4684
1 parent e8fdb21
commit 8cb4684
Show file tree

Hide file tree

Showing 5 changed files with 19 additions and 12 deletions.
diff --git a/acknowledgements.tex b/acknowledgements.tex
@@ -1,6 +1,6 @@
 \acknowledgments
 
-\noindent We would like to thank our university advisor \textbf{Prof. Håkan Grahn} for his commitment to keep us on the right path, focusing on the goal of the thesis.
+\noindent We would like to thank our university advisor \textbf{Prof. Håkan Grahn} for his commitment to keep us on the right path, focusing on the goal of the thesis. His valuable feedback, inspiring words and sense of humor kept us going through stressful times.
 \hfill\par\hfill\par
 \noindent We extend our gratitude to \textbf{Robert Nyqvist} for his engaging courses in mathematics and cryptology over the years. As Robert provided us with numerous challenges throughout the years, we would like to return the favor. At the bottom of this page is a small puzzle of sorts.
 \hfill\par\hfill\par

diff --git a/chapters/discussion/main.tex b/chapters/discussion/main.tex
@@ -55,15 +55,15 @@ \section{The Performance of Post-Quantum Key Encapsulation Mechanisms}
 When performing our tests on cloud hardware, we anticipated a less consistent result than on dedicated consumer hardware. We believed that, due the virtualized and shared nature of the resources, the cloud environments would yield varied results over time as other users of the system utilized the hardware. We found that the Cloud Provider 2 environment had several performance discrepancies over time when running \glspl{kem} in sequential iterations. We also found, however, that Cloud Provider 1 largely functioned as the dedicated consumer hardware we tested. Although it is difficult to conclude from the small sample of cloud providers in our tests, we argue that there is in fact a non-zero chance that virtualized cloud hardware performs less consistently than dedicated hardware, given that Cloud Provider 2 had performance discrepancies in all of our sequential benchmarks. 
 
 % Modern Laptop - Cache misses, oregelbundna minnesaccess
-Another phenomena found in our data is how the Modern Laptop environment consistently yields the largest number of cache misses. Despite having a considerably newer CPU and more available cache than the Old Mid-Range Laptop and the Old Low-Range Laptop, the Modern Laptop environment performed much worse, as seen in Tables \ref{table:results:micro:cache-misses-mceliece-8192128f-enc} and \ref{table:results:micro:cache-misses-ntru-hrss701-enc}. We believe that this is due to the aggressive prefetch mechanisms found in newer CPUs. These mechanism could badly predict what memory is necessary for future computation and as such evict memory that is used by the algorithms we benchmarked. The older machines could have less aggressive mechanisms, or lack them all together, leading to fewer faults. We believe that this prefetching of cache did not constitute an issue for the Modern Workstation as it had double the amount of cache, resulting in virtually zero cache misses across the board.
+Another phenomena found in our data is how the Modern Laptop environment consistently yields the largest number of cache misses. Despite having a considerably newer CPU and more available cache than the Old Mid-Range Laptop and the Old Low-Range Laptop, the Modern Laptop environment performed much worse, as seen in Tables \ref{table:results:micro:cache-misses-mceliece-8192128f-enc} and \ref{table:results:micro:cache-misses-ntru-hrss701-enc}. We believe that this is due to the aggressive prefetch mechanisms found in newer CPUs. These mechanism could badly predict what memory is necessary for future computation and as such evict memory that is used by the algorithms we benchmarked. The older machines could have less aggressive mechanisms, or lack them all together, leading to fewer faults. We believe that this prefetching of cache did not constitute an issue for the Modern Workstation as it had double the amount of cache, resulting in virtually zero cache misses.
 
 %%% === TODO LARGE CHANGE IN TOPIC HERE === %%%
 % -- mceliece använder betydligt mycket mer minne än övriga algoritmer - inte lämpligt för IoT etc?
 When considering the memory footprint of the proposed \glspl{kem}, one must take into account the stack usage, heap usage as well as the parameter sizes of the algorithms. We found that the heap usage of the algorithms is negligible. The parameter sizes, however, vary considerably between \gls{mceliece} and \gls{ntru}. \gls{mceliece} requires between one and one point three megabytes of memory to store a public key, whilst \gls{ntru} requires a hundredth of that - roughly 1000 bytes. Although \gls{mceliece}'s private key is considerably smaller, it is still ten times as large as \gls{ntru}'s private key at roughly 13KiB. We therefore believe that \gls{mceliece} is impractical for use in low-memory environments such as embedded devices.
 
 %%% === TODO LARGE CHANGE IN TOPIC HERE === %%%
 % -- ntru skalar mycket bättre än mceliece sett till trådar
-The data we collected for throughput and scaling of the algorithms identified that none of the classical algorithms scaled well when increasing the number of threads that concurrently executed the algorithms. In fact, it seemed as if the worst scaling and throughput was found in \gls{dhe}, followed by \gls{mceliece}. The algorithms saw virtually no increase in throughput once the number of threads surpassed the number of cores of the system. The performance of \gls{ecdhe} varied depending on the environment it was run in, but in general, the scaling was better than that of \gls{dhe}, with improvements made beyond the system's core count. The best scaling by far was found in the \gls{avx2} optimized \gls{ntru} HRSS 701 implementations that saw a near-linear increase in performance with regards to the number of threads - even when passing the core count of the machine itself. Furthermore, the scaling of \gls{ntru} HRSS 701 was found to be the best in Modern Laptop and Modern Workstation, with the latest CPUs included in the test. We therefore strongly believe that \gls{ntru} HRSS 701 is a top candidate for \gls{post-quantum} \glspl{kem}, when only considering performance. Although not strictly comparable, we believe that one may expect similar performance of \gls{ntru} and the classical \gls{ecdhe}, judging from our measurements of throughput and scaling.
+The data we collected for throughput and scaling of the algorithms identified that none of the classical algorithms scaled well when increasing the number of threads that concurrently executed the algorithms. In fact, it seemed as if the worst scaling and throughput was found in \gls{dhe}, followed by \gls{mceliece}. The algorithms saw virtually no increase in throughput once the number of threads surpassed the number of cores of the system. The performance of \gls{ecdhe} varied depending on the environment it was run in, but in general, the scaling was better than that of \gls{dhe}, with improvements made beyond the system's core count. The best scaling by far was found in the \gls{avx2}-optimized \gls{ntru} HRSS 701 implementations that saw a near-linear increase in performance with regards to the number of threads - even when passing the core count of the machine itself. Furthermore, the scaling of \gls{ntru} HRSS 701 was found to be the best in Modern Laptop and Modern Workstation, with the latest CPUs included in the test. We therefore strongly believe that \gls{ntru} HRSS 701 is a top candidate for \gls{post-quantum} \glspl{kem}, when only considering performance.
 
 \todo[inline]{
 -- argue that non-openssl aes is better due to less memory use?
@@ -73,7 +73,7 @@ \section{The Security of Post-Quantum Key Encapsulation Mechanisms}
 \label{section:discussion:post-quantum-security}
 
 % -- ntru mycket snabbare än mceliece, men säkerhetskategorin är lägre. HRSS 701 snabbare än HPS4096821, men säkerhetsnivån är lägre.
-Our study has disregarded the security of the \gls{nist} submissions, letting us focus on the performance of the algorithms. It goes without saying that the security of the algorithms is of upmost importance. As we have not performed a study of the security of the algorithms on our own, we rely on the information presented by the \gls{nist} submissions themselves. As presented in \ref{table:background:submissions-security-level}, all of the \gls{mceliece} variants we tested are security level 5. The security level of \gls{ntru} is either 3 or 5 for the HPS 4096821 variant and either 1 or 3 for the HRSS 701 variant, depending on the locality model used. We found that the HRSS 701 variant of \gls{ntru} overall performed the best out of all \gls{kem} algorithms tested. We further found that \gls{mceliece} variants performed the worst. We therefore believe, given our results, that there may be a correlation between the performance and security level of \gls{post-quantum} algorithms.
+Our study has disregarded the security of the \gls{nist} submissions, letting us focus on the performance of the algorithms. It goes without saying that the security of the algorithms is of upmost importance. As we have not performed a study of the security of the algorithms on our own, we rely on the information presented by the \gls{nist} submissions themselves. As presented in \ref{table:background:submissions-security-level}, all of the \gls{mceliece} variants we tested are security level 5. The security level of \gls{ntru} is either 3 or 5 for the HPS 4096821 variant and either 1 or 3 for the HRSS 701 variant, depending on the locality model used. We found that the HRSS 701 variant of \gls{ntru} overall performed the best out of all \gls{kem} algorithms tested. We further found that \gls{mceliece} variants performed the worst. Given our results, there may be a correlation between the performance and security level of \gls{post-quantum} algorithms.
 
 Although one may believe our sample set is small, we argue that one has to consider the broader picture. As mentioned in~\cite{ntru2020}, HPS 4096821 and HRSS 701 originated from two different \gls{nist} submissions. We therefore believe that, in part, these algorithms are different from one another - further increasing the size of the sample set. We do believe, however, that a further study of the correlation is required to definitively state whether or not there is a correlation between the security level of an algorithm and its performance.
 

diff --git a/chapters/discussion/validity.tex b/chapters/discussion/validity.tex
@@ -42,7 +42,7 @@ \subsection{Construct validity}
 
 % Construct validity evaluates whether a measurement tool really represents the thing we are interested in measuring. It’s central to establishing the overall validity of a method.
 
-As previously mentioned in section \ref{section:method:experiment:phase1:variables}, we were interested in measuring throughput-related values such as CPU cycles, instruction count, wall-clock time as well as memory-related measurements such as heap and stack usage. For some of these measurements, we relied on the standard Linux kernel-based API named perf (perf\_event\_open). The API was introduced in Linux 2.6.31 which was released in 2009~\cite{linux:perf-released}. The API has grown and as is tradition with the Linux development, each iteration of the API has been reviewed extensively by multiple people throughout the years. We are confident that the API provides as accurate data as the kernel is able to collect. To make the API usable, we used a lightweight instrumentation tool\footnote{\url{https://github.com/profiling-pqc-kem-thesis/perforator}} which allowed us to use the perf API to measure events for specific regions of code. As with other third-party tools, we validated its function by comparing the results to other tools. By using Linux trace APIs to monitor the target binary, we were able to insert measurements around a function call by interrupting the program of the measurement tool. As the target program was frozen during the handling of these measurements, we strongly believe that no overhead added by the measurement tool was included in the end result. By running the instrumented benchmark separately from the benchmarks measuring wall-clock time or memory allocation, we are certain that we achieved accurate values for all of our measurements.
+As previously mentioned in section \ref{section:method:experiment:phase1:variables}, we were interested in measuring throughput-related values such as CPU cycles, instruction count, wall-clock time as well as memory-related measurements such as heap and stack usage. For some of these measurements, we relied on the standard Linux kernel-based API named perf. The API was introduced in Linux 2.6.31 which was released in 2009~\cite{linux:perf-released}. The API has grown and as is tradition with the Linux development, each iteration of the API has been reviewed extensively by multiple people throughout the years. We are confident that the API provides as accurate data as the kernel is able to collect. To make the API usable, we used a lightweight instrumentation tool\footnote{\url{https://github.com/profiling-pqc-kem-thesis/perforator}} which allowed us to use the perf API to measure events for specific regions of code. As with other third-party tools, we validated its function by comparing the results to other tools. By using Linux trace APIs to monitor the target binary, we were able to insert measurements around a function call by interrupting the program of the measurement tool. As the target program was frozen during the handling of these measurements, we strongly believe that no overhead added by the measurement tool was included in the end result. By running the instrumented benchmark separately from the benchmarks measuring wall-clock time or memory allocation, we are certain that we achieved accurate values for all of our measurements.
 
 When studying the data amassed after applying our toolset for micro-benchmarks, we found that the value 9223372036854775808 occurred a considerable amount of times. As it was considerably larger than other values and since we were not expecting similar values for completely different events, we analyzed the fault. Given size of the problem space, we were unable to identify the root cause. We found that 0.7\% of the values recorded were affected by this issue and that it likely originates in an incorrect handling of unsigned 64-bit integers as the value is one higher than the maximum number a signed 64-bit integer may store. In order to clarify the error, we marked the data and ignored them in the data presented in this thesis. Given the low number of affected measurements, we feel confident in our handling of these errors. One measurement that did show a considerable amount of errors, however, is those for the region syndrome\_asm. The measurements for the region consisted of 33\% of these erroneous measurements. Other regions consisted of about 2\% errors. All of our data is published alongside this work for further verification efforts from third parties.