The following are what is implemented in our final submission to the OS challenge:
- Multithreading with a pool-thread (6 threads)
- Priority Cost function for scheduling tasks
- Multiprocessing
- Hash-tables for caching
All tests have run on the same computer (using Vagrant). The specifications of the computer is listed below:
Specification | Value |
---|---|
CPU | Intel i7-9750H |
CPU clock speed | 2.6 GHz |
No. of CPU cores | 4 |
RAM amount | 32 GB |
RAM type | 2666 MHz DDR4 |
OS | W11 22H2 |
Although all tests have been conducted on the same machine there is still the possibility for errors coming from background OS tasks. The tests were run back-to-back.
This overall experiment seeks to figure out how much (if any) we can benefit from multithreading, i.e. handling multiple client request simultaneously with threads. The reason we at all bother to do these experiments is because the computer which are running the test have a multi-core CPU with 4 cores. Splitting the workload between these 4 cores should result in a speed boost.
It was clear that we needed to use a first parallelization method to drastically improve the performance of our server. To start, we decided to implement a multi-threaded version of the server.
In this experiment, we want to compare an implementation using only one process (sequential) and an implementation using multiple threads. The motivation for this experiment is to solve the weakness described above. The thread model is somehow similar to the process model however threads are more lightweight than processes and thus changing threads is much faster than changing processes and creating and terminating threads is also much faster. Furthermore, threads share the same address space as opposite to processes.
The implementation of multithreading has been done in several parts to improve its performance and to make it safe:
First of all, we had to limit the number of threads created to limit the use of the CPU. Moreover, an infinite number of threads does not help anymore at a certain point. This is why the creation of a pool-thread was necessary. It ensures we don't create an unbounded number of threads. we used first in first out (FIFO) scheduling to start and wait for threads, i.e. when we had started the maximum number of threads we called join() on the thread which was first started, and started the new thread when join() returned.
But even if we got comfortable performances (see results below), we got lucky. There was still a major bug in the code. The shared data structure we created to stack the connection (the queue) was not thread safe: we were still subject to race conditions. Hence, we needed to protect those calls (enqueue() and dequeue()) with the help of a mutex lock.
At the beginning of the experiment I had added (wrongly) variable conditions to avoid threads doing busy waiting on the queue while waiting for it to fill up. But in our situation this was not really useful since when threads freed CPU, they had no one to share it with since they were all in the same waiting situation. This would have worked if several threads were actually working on a common work.
All tests regarding this experiment has been executed on the same computer, we have tried to keep the workload constant during the test session, however this is nearly impossible and background activity could result in errors. The configuration parameters of the client was the following:
Setting | Value |
---|---|
SERVER | 192.168.101.10 |
PORT | 5003 |
SEED | 3435245 |
TOTAL | 100 |
START | 0 |
DIFFICULTY | 30000000 |
REP_PROB_PERCENT | 20 |
DELAY_US | 600000 |
PRIO_LAMBDA | 1.50 |
Below are the results of the tests:
Run | Sequential | Threads |
---|---|---|
First run | 95.220.820 | 24.671.300 |
Second run | 97.668.541 | 25.518.803 |
Third run | 92.979.926 | 24.280.959 |
Average | 95.289.762 | 24.823.687 |
Looking at the results there is a noticeable difference when comparing the averages, the multi-thread model is clearly faster that the sequential model. However, there are still some problem with the thread model. It starts more threads than we have CPU cores, which result in context switches. We will now try to fix the problem with having more threads than CPU cores.
In this experiment we want to address the problem with having more threads than CPU cores, the reason that this is a problem is, that if we start more threads than we have cores, the scheduler have to do context (thread) switches (which takes time even if thread switching is very efficient and much cheaper than process switching. The motivation is to avoid making these context switches. This should be as simple as setting the number of threads equal to the number of cores, however we will try some different number of threads to see which number actually give us the best performance.
Setting | Value |
---|---|
SERVER | 192.168.101.10 |
PORT | 5003 |
SEED | 3435245 |
TOTAL | 100 |
START | 0 |
DIFFICULTY | 30000000 |
REP_PROB_PERCENT | 20 |
DELAY_US | 600000 |
PRIO_LAMBDA | 1.50 |
We did the experiment by changing the maximum number of threads to 3, 4, 5, 6, 7, 8 and 20. The result can be found in the table below. The listed number of threads is only how many threads that are actively calculating client requests, in addition to those there are also the main thread which is managing the threads as well as setting up the server initially.
Run | 3 Threads | 4 Threads | 5 Threads | 6 Threads | 7 Threads | 8 Threads | 20 Threads |
---|---|---|---|---|---|---|---|
First Run | 29.801.572 | 24.671.300 | 20.899.400 | 19.063.160 | 20.425.256 | 21.532.380 | 23.463.315 |
Second Run | 29.603.170 | 25.518.803 | 21.005.840 | 23.730.163 | 20.681.164 | 21.664.842 | 22.048.472 |
Third Run | 28.995.181 | 24.280.959 | 20.995.528 | 20.129.427 | 22.083.564 | 21.868.455 | 22.498.278 |
Average | 29.464.635 | 24.818.339 | 20.966.867 | 20.882.172 | 21.050.894 | 21.688.119 | 22.662.390 |
Below is a graphical representation of the results:
Looking at the graph the best number of threads seems to be 6. We initially expected that the best number would be 3 (4 including the main thread) because this would match the number of cores, and thereby we would avoid making expensive context switches. However, a valid point is that 5 (6 including the main thread) could result in a better performance, because the main thread is idle most of the time, so it could be worth to switch between the main thread and a thread handling client request. The reason that 6 is the best is more difficult to answer, a guess could be that it is better to start more threads because of the FIFO problem described in the "Process Model vs Thread Model" experiment (Idle threads if the first one takes a lot of time).
In this experiment, the goal is to optimize the score by choosing tasks based on their priority. At first, we will sort the tasks with the priority value, and then with: (priority)/(end - start)
Elements are put into a priority queue, implemented as a linked list. Worker threads then draw from this queue, taking the higher value computations first, and run them to completion. For these tests, I set the number of threads in the worker pool-thread to be 6.
Setting | Value |
---|---|
SERVER | 192.168.101.10 |
PORT | 5003 |
SEED | 3435245 |
TOTAL | 100 |
START | 0 |
DIFFICULTY | 30000000 |
REP_PROB_PERCENT | 20 |
DELAY_US | 600000 |
PRIO_LAMBDA | 1.50 |
Run | FIFO | Priority-Queue sorted by taking into account p | Priority-Queue sorted by taking into account start, end and p |
---|---|---|---|
First Run | 19063160 | 18648245 | 17955248 |
Second Run | 23730163 | 19413889 | 17968412 |
Third Run | 20129427 | 18515116 | 18045548 |
Average | 20882172 | 18854961.44 | 17989691.95 |
Below is a graphical representation of the results:
Using the 6 threads pool-thread, we achieve a noticeably better score than the FIFO. Moreover, taking into account the distance between the start and the end also allows to gain slightly in performance.
Due to the structure of the priority queue - a linked-list - this code is most efficient when the requests are coming at approximately the speed they can be solved or slower. If requests come too fast, then each request needs to be inserted into the linked list individually, leading to O(n^2) performance of insertion sort. This overhead bogs down the main thread and might interfere with network receive operations.
This result of this experiment is as expected - a more intelligent scheduling algorithm outperforms a simple FIFO.
The aim of this experiment is to test another type of parallelization. Indeed, we want to compare the concurrent implementations using multiples threads and multiple processes. Even if we believe that the multiprocess version will be slower than the multi-threads one, we want to give it a try. Of course, in this experiment, we keep exactly the same configuration as the last kept version (priority queue...). We only replacing the multithreading by multiprocessing.
The process model works that way: every time the server receives a client request a new process is created using the fork system call. We need to choose wisely the number of processes that can live simultaneously, in our case it is four since the computer which is running the test have a multi-core CPU with 4 cores. This way there is no shared memory (critical sections) and thus no need for synchronisation. If we want to increase the number of processes we will need to protect the critical sections. We will see later what is the optimal choice.
In this section we seek to find the optimal number of processes.
All tests regarding this experiment has been executed on the same computer, we have tried to keep the workload constant during the test session, however this is nearly impossible and background activity could result in errors. The configuration parameters of the client was the following:
Setting | Value |
---|---|
SERVER | 192.168.101.10 |
PORT | 5003 |
SEED | 3435245 |
TOTAL | 100 |
START | 0 |
DIFFICULTY | 30000000 |
REP_PROB_PERCENT | 20 |
DELAY_US | 600000 |
PRIO_LAMBDA | 1.50 |
Below are the results of the tests: Note that when the number of processes exceeds 4, some security of the critical region have been implemented.
Run | 4 Processes | 5 Processes | 10 Processes | 20 Processes | 50 Processes | 100 Processes |
---|---|---|---|---|---|---|
First Run | 43.490.995 | 56.853.092 | 28.273.468 | 25.301.062 | 32.465.741 | 38.502.073 |
Second Run | 28.869.131 | 32.835.505 | 33.762.657 | 29.910.688 | 39.938.347 | 33.823.337 |
Third Run | 32.406.480 | 33.714.794 | 26.862.934 | 37.785.497 | 34.472.936 | 33.482.823 |
Average | 34.922.202 | 41.134.463 | 29.633.019 | 30.999.082 | 35.625.674 | 35.269.411 |
Below is a graphical representation of the results:
Looking at the graph the best number of processes seems to be between 10 and 20. We initially expected that the best number would be 3 (4 including the parent process) because this would match the number of cores.
In this section we compare the two parallelization techniques. Indeed, we run both models three times with their respective optimal number of threads/processes.
All tests regarding this experiment has been executed on the same computer, we have tried to keep the workload constant during the test session, however this is nearly impossible and background activity could result in errors. The configuration parameters of the client was the following:
Setting | Value |
---|---|
SERVER | 192.168.101.10 |
PORT | 5003 |
SEED | 3435245 |
TOTAL | 100 |
START | 0 |
DIFFICULTY | 30000000 |
REP_PROB_PERCENT | 20 |
DELAY_US | 600000 |
PRIO_LAMBDA | 1.50 |
Below are the results of the tests: Note that we used 6 threads and 10 processes.
Run | Processes | Threads |
---|---|---|
First run | 28.273.468 | 18.689.768 |
Second run | 33.762.657 | 23.543.790 |
Third run | 26.862.934 | 20.635.433 |
Average | 29.633.019 | 20.956.330 |
Below is a graphical representation of the results:
Looking at the result there is a noticeable difference when comparing the averages, the thread model is approximately 1.5 times faster than the process model. That was expected since threads are more lightweight than processes and thus changing threads is much faster than changing processes and creating and terminating threads is also much faster. Furthermore, threads share the same address space as opposite to processes. Therefore, we won't keep the multiprocess option in the final configuration.
Repetition of taks or events is something common in our daily lives. For example, accessing DTU web page multiple times during the same day. For this purpose, web caching was invented, it’s the activity of storing data for reuse, such as a copy of a web page served by a web server. It’s cached or stored the first time a user visits the page and the next time a user requests the same page, a cache will serve the copy, which helps keep the origin server from getting overloaded as well as it enhances page delivery speed significantly and reduce the work needed to be done by the server.
It’s crystal clear to see the analogy between the given example and the fact that a server in our OS-Challenge can receive a duplicate of a previously sent request. So, a way to avoid computing the same reverse hashing, is to save or cache the previous hashes and their original values. So, whenever a request is repeated, there is no need to waste time on decoding, we can extract the right answer faster.
In practice, we can use several data structures for caching but in our project, we implemented a hash table which is known to be more efficient than other data structures since the insert and search operations have a time complexity O(1) . So, the idea was to implement an array of structs containing both the hash and its corresponding value. We also used closed hashing which is a method of collision resolution that consists in searching through alternative locations in the array until either the target record is found or an unused array slot is found, which indicates that there is no such key in the table. (The key here is going to be the hash).
All tests regarding this experiment has been executed on the same machine.
Setting | Value |
---|---|
SERVER | 192.168.101.10 |
PORT | 5003 |
SEED | 3435245 |
TOTAL | 100 |
START | 0 |
DIFFICULTY | 30000000 |
REP_PROB_PERCENT | 20 |
DELAY_US | 600000 |
PRIO_LAMBDA | 1.50 |
Below are the results of the tests:
Run | Sequential without hash table | Sequential with hash table |
---|---|---|
First run | 95.220.820 | 71.792.487 |
Second run | 97.668.541 | 72.894.615 |
Third run | 92.979.926 | 70.524.722 |
Average | 95.289.762 | 71.737.275 |
Below is a graphical representation of the results :
We can see that using a hashtable is faster than reverse hashing whenever we receive a request. Although, the difference is not so large since the repetition probability is only 20% but it is still better than using the simple sequential model. NB : below we increased the repetition probability, the results are better as expected .
Run | repetition probability 20% | repetition probability 50% |
---|---|---|
First run | 71.792.487 | 41.191.500 |
Second run | 72.894.615 | 43.275.962 |
Third run | 70.524.722 | 40.760.749 |
Average | 71.737.275 | 41.742.737 |
And this graph is a representation of the results :
As we saw previously, both multi-threading and caching have a good impact on the server’s performance since these two experiments are efficient and makes the server faster when it comes to answering the client’s requests. That’s why we decided to combine these two features for our final version of OS-Challenge.
The hash table being used by multiple threads at the same time has a need for some kind of synchronization. The latter has to guarantee consistency of the hash table with parallel inserts, updates and queries on the data. To do so, we will use locks.
Setting | Value |
---|---|
SERVER | 192.168.101.10 |
PORT | 5003 |
SEED | 3435245 |
TOTAL | 100 |
START | 0 |
DIFFICULTY | 30000000 |
REP_PROB_PERCENT | 20 |
DELAY_US | 600000 |
PRIO_LAMBDA | 1.50 |
Below are the results of the tests:
Run | Sequential | Multi-threading without hash table | Multi-threading with hash table |
---|---|---|---|
First run | 95.220.820 | 18.689.768 | 18.958.127 |
Second run | 97.668.541 | 23.543.790 | 17.198.736 |
Third run | 92.979.926 | 20.635.433 | 17.763.928 |
Average | 95.289.762 | 20.956.330 | 17.973.597 |
Below is a graphical representation of the results :
The results show that a performance boost can be achieved by combining multi-threading and caching. Although this improvement seems to be slight since the repetition probability is not high but we still cannot ignore it. We tried to test with a repetition probability of 50% and these are the results :
Run | repetition probability 20% | repetition probability 50% |
---|---|---|
First run | 18.958.127 | 4.855.266 |
Second run | 17.198.736 | 4.974.597 |
Third run | 17.763.928 | 4.873.498 |
Average | 17.973.597 | 4.901.121 |
And the corresponding graphical representation :
Even though there are other experiments that could have a good impact on the server, multi-threading with an optimized numer of threads are the one that give the best results.