diff --git a/acknowledgements.tex b/acknowledgements.tex index 6adaa4e..8e90df6 100644 --- a/acknowledgements.tex +++ b/acknowledgements.tex @@ -1,6 +1,6 @@ \acknowledgments -\noindent We would like to thank our university advisor \textbf{Prof. Håkan Grahn} for his commitment to keep us on the right path, focusing on the goal of the thesis. +\noindent We would like to thank our university advisor \textbf{Prof. Håkan Grahn} for his commitment to keep us on the right path, focusing on the goal of the thesis. His valuable feedback, inspiring words and sense of humor kept us going through stressful times. \hfill\par\hfill\par \noindent We extend our gratitude to \textbf{Robert Nyqvist} for his engaging courses in mathematics and cryptology over the years. As Robert provided us with numerous challenges throughout the years, we would like to return the favor. At the bottom of this page is a small puzzle of sorts. \hfill\par\hfill\par diff --git a/chapters/discussion/main.tex b/chapters/discussion/main.tex index 7ee9b44..d968d41 100644 --- a/chapters/discussion/main.tex +++ b/chapters/discussion/main.tex @@ -55,7 +55,7 @@ \section{The Performance of Post-Quantum Key Encapsulation Mechanisms} When performing our tests on cloud hardware, we anticipated a less consistent result than on dedicated consumer hardware. We believed that, due the virtualized and shared nature of the resources, the cloud environments would yield varied results over time as other users of the system utilized the hardware. We found that the Cloud Provider 2 environment had several performance discrepancies over time when running \glspl{kem} in sequential iterations. We also found, however, that Cloud Provider 1 largely functioned as the dedicated consumer hardware we tested. Although it is difficult to conclude from the small sample of cloud providers in our tests, we argue that there is in fact a non-zero chance that virtualized cloud hardware performs less consistently than dedicated hardware, given that Cloud Provider 2 had performance discrepancies in all of our sequential benchmarks. % Modern Laptop - Cache misses, oregelbundna minnesaccess -Another phenomena found in our data is how the Modern Laptop environment consistently yields the largest number of cache misses. Despite having a considerably newer CPU and more available cache than the Old Mid-Range Laptop and the Old Low-Range Laptop, the Modern Laptop environment performed much worse, as seen in Tables \ref{table:results:micro:cache-misses-mceliece-8192128f-enc} and \ref{table:results:micro:cache-misses-ntru-hrss701-enc}. We believe that this is due to the aggressive prefetch mechanisms found in newer CPUs. These mechanism could badly predict what memory is necessary for future computation and as such evict memory that is used by the algorithms we benchmarked. The older machines could have less aggressive mechanisms, or lack them all together, leading to fewer faults. We believe that this prefetching of cache did not constitute an issue for the Modern Workstation as it had double the amount of cache, resulting in virtually zero cache misses across the board. +Another phenomena found in our data is how the Modern Laptop environment consistently yields the largest number of cache misses. Despite having a considerably newer CPU and more available cache than the Old Mid-Range Laptop and the Old Low-Range Laptop, the Modern Laptop environment performed much worse, as seen in Tables \ref{table:results:micro:cache-misses-mceliece-8192128f-enc} and \ref{table:results:micro:cache-misses-ntru-hrss701-enc}. We believe that this is due to the aggressive prefetch mechanisms found in newer CPUs. These mechanism could badly predict what memory is necessary for future computation and as such evict memory that is used by the algorithms we benchmarked. The older machines could have less aggressive mechanisms, or lack them all together, leading to fewer faults. We believe that this prefetching of cache did not constitute an issue for the Modern Workstation as it had double the amount of cache, resulting in virtually zero cache misses. %%% === TODO LARGE CHANGE IN TOPIC HERE === %%% % -- mceliece använder betydligt mycket mer minne än övriga algoritmer - inte lämpligt för IoT etc? @@ -63,7 +63,7 @@ \section{The Performance of Post-Quantum Key Encapsulation Mechanisms} %%% === TODO LARGE CHANGE IN TOPIC HERE === %%% % -- ntru skalar mycket bättre än mceliece sett till trådar -The data we collected for throughput and scaling of the algorithms identified that none of the classical algorithms scaled well when increasing the number of threads that concurrently executed the algorithms. In fact, it seemed as if the worst scaling and throughput was found in \gls{dhe}, followed by \gls{mceliece}. The algorithms saw virtually no increase in throughput once the number of threads surpassed the number of cores of the system. The performance of \gls{ecdhe} varied depending on the environment it was run in, but in general, the scaling was better than that of \gls{dhe}, with improvements made beyond the system's core count. The best scaling by far was found in the \gls{avx2} optimized \gls{ntru} HRSS 701 implementations that saw a near-linear increase in performance with regards to the number of threads - even when passing the core count of the machine itself. Furthermore, the scaling of \gls{ntru} HRSS 701 was found to be the best in Modern Laptop and Modern Workstation, with the latest CPUs included in the test. We therefore strongly believe that \gls{ntru} HRSS 701 is a top candidate for \gls{post-quantum} \glspl{kem}, when only considering performance. Although not strictly comparable, we believe that one may expect similar performance of \gls{ntru} and the classical \gls{ecdhe}, judging from our measurements of throughput and scaling. +The data we collected for throughput and scaling of the algorithms identified that none of the classical algorithms scaled well when increasing the number of threads that concurrently executed the algorithms. In fact, it seemed as if the worst scaling and throughput was found in \gls{dhe}, followed by \gls{mceliece}. The algorithms saw virtually no increase in throughput once the number of threads surpassed the number of cores of the system. The performance of \gls{ecdhe} varied depending on the environment it was run in, but in general, the scaling was better than that of \gls{dhe}, with improvements made beyond the system's core count. The best scaling by far was found in the \gls{avx2}-optimized \gls{ntru} HRSS 701 implementations that saw a near-linear increase in performance with regards to the number of threads - even when passing the core count of the machine itself. Furthermore, the scaling of \gls{ntru} HRSS 701 was found to be the best in Modern Laptop and Modern Workstation, with the latest CPUs included in the test. We therefore strongly believe that \gls{ntru} HRSS 701 is a top candidate for \gls{post-quantum} \glspl{kem}, when only considering performance. \todo[inline]{ -- argue that non-openssl aes is better due to less memory use? @@ -73,7 +73,7 @@ \section{The Security of Post-Quantum Key Encapsulation Mechanisms} \label{section:discussion:post-quantum-security} % -- ntru mycket snabbare än mceliece, men säkerhetskategorin är lägre. HRSS 701 snabbare än HPS4096821, men säkerhetsnivån är lägre. -Our study has disregarded the security of the \gls{nist} submissions, letting us focus on the performance of the algorithms. It goes without saying that the security of the algorithms is of upmost importance. As we have not performed a study of the security of the algorithms on our own, we rely on the information presented by the \gls{nist} submissions themselves. As presented in \ref{table:background:submissions-security-level}, all of the \gls{mceliece} variants we tested are security level 5. The security level of \gls{ntru} is either 3 or 5 for the HPS 4096821 variant and either 1 or 3 for the HRSS 701 variant, depending on the locality model used. We found that the HRSS 701 variant of \gls{ntru} overall performed the best out of all \gls{kem} algorithms tested. We further found that \gls{mceliece} variants performed the worst. We therefore believe, given our results, that there may be a correlation between the performance and security level of \gls{post-quantum} algorithms. +Our study has disregarded the security of the \gls{nist} submissions, letting us focus on the performance of the algorithms. It goes without saying that the security of the algorithms is of upmost importance. As we have not performed a study of the security of the algorithms on our own, we rely on the information presented by the \gls{nist} submissions themselves. As presented in \ref{table:background:submissions-security-level}, all of the \gls{mceliece} variants we tested are security level 5. The security level of \gls{ntru} is either 3 or 5 for the HPS 4096821 variant and either 1 or 3 for the HRSS 701 variant, depending on the locality model used. We found that the HRSS 701 variant of \gls{ntru} overall performed the best out of all \gls{kem} algorithms tested. We further found that \gls{mceliece} variants performed the worst. Given our results, there may be a correlation between the performance and security level of \gls{post-quantum} algorithms. Although one may believe our sample set is small, we argue that one has to consider the broader picture. As mentioned in~\cite{ntru2020}, HPS 4096821 and HRSS 701 originated from two different \gls{nist} submissions. We therefore believe that, in part, these algorithms are different from one another - further increasing the size of the sample set. We do believe, however, that a further study of the correlation is required to definitively state whether or not there is a correlation between the security level of an algorithm and its performance. diff --git a/chapters/discussion/validity.tex b/chapters/discussion/validity.tex index 026b6d0..c5ac04b 100644 --- a/chapters/discussion/validity.tex +++ b/chapters/discussion/validity.tex @@ -42,7 +42,7 @@ \subsection{Construct validity} % Construct validity evaluates whether a measurement tool really represents the thing we are interested in measuring. It’s central to establishing the overall validity of a method. -As previously mentioned in section \ref{section:method:experiment:phase1:variables}, we were interested in measuring throughput-related values such as CPU cycles, instruction count, wall-clock time as well as memory-related measurements such as heap and stack usage. For some of these measurements, we relied on the standard Linux kernel-based API named perf (perf\_event\_open). The API was introduced in Linux 2.6.31 which was released in 2009~\cite{linux:perf-released}. The API has grown and as is tradition with the Linux development, each iteration of the API has been reviewed extensively by multiple people throughout the years. We are confident that the API provides as accurate data as the kernel is able to collect. To make the API usable, we used a lightweight instrumentation tool\footnote{\url{https://github.com/profiling-pqc-kem-thesis/perforator}} which allowed us to use the perf API to measure events for specific regions of code. As with other third-party tools, we validated its function by comparing the results to other tools. By using Linux trace APIs to monitor the target binary, we were able to insert measurements around a function call by interrupting the program of the measurement tool. As the target program was frozen during the handling of these measurements, we strongly believe that no overhead added by the measurement tool was included in the end result. By running the instrumented benchmark separately from the benchmarks measuring wall-clock time or memory allocation, we are certain that we achieved accurate values for all of our measurements. +As previously mentioned in section \ref{section:method:experiment:phase1:variables}, we were interested in measuring throughput-related values such as CPU cycles, instruction count, wall-clock time as well as memory-related measurements such as heap and stack usage. For some of these measurements, we relied on the standard Linux kernel-based API named perf. The API was introduced in Linux 2.6.31 which was released in 2009~\cite{linux:perf-released}. The API has grown and as is tradition with the Linux development, each iteration of the API has been reviewed extensively by multiple people throughout the years. We are confident that the API provides as accurate data as the kernel is able to collect. To make the API usable, we used a lightweight instrumentation tool\footnote{\url{https://github.com/profiling-pqc-kem-thesis/perforator}} which allowed us to use the perf API to measure events for specific regions of code. As with other third-party tools, we validated its function by comparing the results to other tools. By using Linux trace APIs to monitor the target binary, we were able to insert measurements around a function call by interrupting the program of the measurement tool. As the target program was frozen during the handling of these measurements, we strongly believe that no overhead added by the measurement tool was included in the end result. By running the instrumented benchmark separately from the benchmarks measuring wall-clock time or memory allocation, we are certain that we achieved accurate values for all of our measurements. When studying the data amassed after applying our toolset for micro-benchmarks, we found that the value 9223372036854775808 occurred a considerable amount of times. As it was considerably larger than other values and since we were not expecting similar values for completely different events, we analyzed the fault. Given size of the problem space, we were unable to identify the root cause. We found that 0.7\% of the values recorded were affected by this issue and that it likely originates in an incorrect handling of unsigned 64-bit integers as the value is one higher than the maximum number a signed 64-bit integer may store. In order to clarify the error, we marked the data and ignored them in the data presented in this thesis. Given the low number of affected measurements, we feel confident in our handling of these errors. One measurement that did show a considerable amount of errors, however, is those for the region syndrome\_asm. The measurements for the region consisted of 33\% of these erroneous measurements. Other regions consisted of about 2\% errors. All of our data is published alongside this work for further verification efforts from third parties. diff --git a/chapters/results/main.tex b/chapters/results/main.tex index 103a383..33299af 100644 --- a/chapters/results/main.tex +++ b/chapters/results/main.tex @@ -415,9 +415,11 @@ \subsection{Throughput Performance} \input{chapters/results/throughput/ecdh_25519_keypair} -When looking at the throughput of \gls{mceliece} in environments which do not support \gls{avx2}, such as the IBM Community Cloud presented in figure \ref{figure:results:throughput:mceliece-ibm-community-cloud}, we found that the keypair and decrypt throughput was significantly lower than that of the encrypt stage. The performance of the decrypt stage was roughly $0.1\%$ of the encrypt stage. Furthermore, the parameter set with the largest parameters - \gls{mceliece} 8192128f achieved a higher encryption throughput than the \gls{mceliece} 6960119f variant. +When looking at the throughput of \gls{mceliece} in environments which do not support \gls{avx2}, such as the IBM Community Cloud presented in figure \ref{figure:results:throughput:mceliece-ibm-community-cloud}, we found that the keypair and decrypt throughput was significantly lower than that of the encrypt stage. The performance of the decrypt stage was roughly a tenth of the encrypt stage. Furthermore, the parameter set with the largest parameters - \gls{mceliece} 8192128f achieved a higher encryption throughput than the \gls{mceliece} 6960119f variant. -Furthermore, there was a large difference between the encrypt stage of the subjects compiled and optimized using GCC and those using Clang. Lastly, it seemed as if the throughput leveled off after the number of used threads surpassed the number of available threads. As mentioned, the same overall behavior was found in all of the environments using the optimized reference implementation for benchmarking the parallel throughput, such as Old Mid-Range Laptop in figure \ref{figure:results:throughput:mceliece-old-mid-range-laptop}. +Furthermore, there was a large difference between the encrypt stage of the subjects compiled and optimized using GCC and those using Clang. The 8192128f parameter set of \gls{mceliece} had a throughput of about 1.5x that of the Clang implementation. In the case of the 6960191f parameter set, GCC saw much higher increase in performance than the Clang implementation when the resulting binary was run on multiple threads. + +\noindent Lastly, it seemed as if the throughput leveled off after the number of used threads surpassed the number of available threads. As mentioned, the same behavior was found in all of the environments using the optimized reference implementation for benchmarking the parallel throughput, such as Old Mid-Range Laptop in figure \ref{figure:results:throughput:mceliece-old-mid-range-laptop}. \begin{figure} \centering @@ -426,7 +428,7 @@ \subsection{Throughput Performance} \label{figure:results:throughput:mceliece-ibm-community-cloud} \end{figure} -\noindent When \gls{mceliece} ran in environments with support for \gls{avx2}, such as the Modern Workstation shown in figure \ref{figure:results:throughput:mceliece:modern-workstation}, the results were more stable. Not only were the compilers much more consistent with one another, but the difference in throughput between the two variants seemed to become smaller. It is still clear to see, however, that the smallest parameter size had an overall lower throughput for encryption than the largest parameter size tested. The keypair and decrypt results seem to show that the smallest parameter size performed the best. +When \gls{mceliece} ran in environments with support for \gls{avx2}, such as the Modern Workstation shown in figure \ref{figure:results:throughput:mceliece:modern-workstation}, the results were more stable. Not only were the compilers much more consistent with one another, but the difference in throughput between the two variants seemed to become smaller. It is still clear to see, however, that the smallest parameter size had an overall lower throughput for encryption than the largest parameter size tested. The keypair and decrypt results seem to show that the smallest parameter size performed the best. \begin{figure} \centering @@ -439,6 +441,11 @@ \subsection{Throughput Performance} In Table \ref{table:results:throughput:ntru-hrss701-decrypt}, the measurements for various compilers and thread counts are shown for all of the tested environments. The top value for each row is the throughput as decryptions per second and the bottom value the relative throughput compared to the GCC implementation using one thread. The best scaling was found in the Modern Workstation environment, followed by the Modern Laptop. Although the tested cloud environments supported \gls{avx2}, they did not see the same performance increase when more threads were used. %For all environments except IBM Community Cloud and Cloud Provider 2, GCC seems to have produced the highest increase in throughput over the reference implementation compiled with GCC, without performance optimizations. +% Please don't ask, the LaTeX gods weren't with us in this case :( +\newpage +\input{chapters/results/throughput/ntru_hrss701_decrypt} +\newpage + \begin{figure} \centering \includegraphics[scale=0.75]{chapters/results/throughput/Modern Workstation_ntru.pdf} @@ -446,8 +453,6 @@ \subsection{Throughput Performance} \label{figure:results:throughput:ntru-modern-workstation} \end{figure} -\input{chapters/results/throughput/ntru_hrss701_decrypt} - \subsection{Micro-benchmarks} \begin{table}[t] @@ -495,8 +500,9 @@ \subsection{Micro-benchmarks} With an average of 69116705 cache misses when the GCC reference implementation was run in the Cloud Provider 2 environment, keypair generation in \gls{mceliece} 8192128 had the largest number of cache misses recorded during our testing. For the same algorithm and compiler configuration, the Modern Workstation had an average of 20403 cache misses. As the 8192128 parameter set of \gls{mceliece} is the systematic variant, it may run further iterations than the semi-systematic variant of \gls{mceliece}. As this is the only difference between the 8192128 and the 8192128f parameter set, it is likely to be the cause of the higher number of cache misses. In the lower end of the spectrum, the Modern Workstation saw an average of a single cache miss for all configurations of \gls{ntru} HPS 4096821 keypair generation. The environment had the largest amount of cache out of all the dedicated hardware tested. Using the same configurations, the Cloud Provider 2 environment saw an average of between 1440 and 3468 cache misses. As dedicated hardware had similar amounts of cache misses as the tested cloud environments, it may be difficult to directly determine if the type of environment affects the number of cache misses. -That the \gls{mceliece} implementations had a drastically larger amount of cache misses than the \gls{ntru} implementations was found in all of the tests of the experiment. +That the \gls{mceliece} implementations had a drastically larger amount of cache misses than the \gls{ntru} implementations was found in all of the tests. +\newpage \noindent Another value we measured was the number of page-faults that occurred during each stage of the algorithms. As with other micro-benchmarks, this data was collected for the \gls{post-quantum} \glspl{kem} in environments with support for measuring them. \gls{ntru} had zero page-faults in all of the runs, across all implementations and environments. Like \gls{ntru}, \gls{mceliece} had zero page-faults, except during key-pair generation. In Table \ref{table:results:micro:page-faults-mceliece} the mean of each parameter set is presented for the implementations which had page faults. All page faults occured during the keypair stage and for the GCC compiler. diff --git a/chapters/results/throughput/ntru_hrss701_decrypt.tex b/chapters/results/throughput/ntru_hrss701_decrypt.tex index 682d096..7785416 100644 --- a/chapters/results/throughput/ntru_hrss701_decrypt.tex +++ b/chapters/results/throughput/ntru_hrss701_decrypt.tex @@ -1,4 +1,5 @@ - \begin{table} + \begin{table}[H] + \vspace{6.5em} \centering \footnotesize \caption{Parallel throughput runs for NTRU HRSS701 decryption}