Merge main with riccardo (#1)

* basic implementation * need to pass in the port and add host * Caching + removing extensions * Dump * dump * created scripts * no changes. * working on scripts * Add deploy and run scripts. Add README description. * added the command line parsing * work in progress * fixed command line args parsing * work in progress * DNS stuff * Dump TODO * SCRIPTS ARE RUNNING! * Added grading beacon endpoint * edited the echo messages * added heap library for reference * implemented the deploy caching stage * Lay the groundwork for CDN cache * Load deploy-time cache into RAM * Handle cache miss (need to fetch from origin) * Implement disk cache hit and promotion logic * Dump * Enforce memory limits * New caching strategy Treat memory and disk as same. Promotion only attempted for adding an un-cached article that was grabbed from the origin. No promotion between disk and memory. * Refactor logic for attempting to evict and add to disk and memory into their own functions * Rename BUFFER_OFFSET_AUTO to APPEND * Wire up the cache to the HTTP server, respond with 404 for media requests * RepliCache.get now returns boolean along with data to indicate if the key was found. Also fixed a bug where uncompressed article was cached after fetching from origin * added replicas and cache.py to deploy script * saving client's IP addresses in a global set * Make ON_DISK=-1 and NOT_CACHED=-2. Makes more semantic sense. * Refactor Resolver to include replicas and client ips. * Add server locations * changed scope of replicas and client_ips * Serve a random replica from dns server. * fixed bash list * implemented measure function that returns a dictionary containing rtt and destination IP address * GeoIP CSV hacking DUMP * Deploy script now concurrently builds cache on all replicas * Implement haversine * database implementation * db location implementation done * Add /measure endpoint * fixed geo.py bug * Process each request on a separate thread + Run measurements on a separate thread * Implement best replica map. * Add threads to dns get measurements functionality. * Testing measurements in dns server. * added db to deploy script for dns server * Testing measure endpoint functionality. * merge * Change REPLICAS to be a dict * merge * latest updates * edit run script. * fixing bug in run script * edit run script. * edit stop script. * Edit stop script. * dump * downloaded cache and modified .gitignore * latest commit * Fix routing table key bug in dnsserver * Store respective log files on individual servers * Ridiculously fast full-scan script * Add a few scripts for housekeeping * Add /debug_cache endpoint to HTTP server to quickly debug the cache * Be smart with URL encodings First check if the path is already URL-encoded. If not, encode it before checking the cache. If yes, nothing to do. * Fix the scamper "avg" thing * Delete unnecessary vendor stuff * Return the right A record for DNS * added pydocs and comments * modified return type of get_local_ip * Sanity check test script * Add stage latency and pause info to sanity_check * Bump time to sleep to 20 * updates * Exclude maxsize RTT from map. Co-authored-by: Rohit Awate <[email protected]> Co-authored-by: Diego <[email protected]>
riccardoprosdocimi · Dec 12, 2022 · b0bc85c · b0bc85c
1 parent 1ea642b
commit b0bc85c
Show file tree

Hide file tree

Showing 256 changed files with 3,972 additions and 1 deletion.
diff --git a/.gitignore b/.gitignore
@@ -1,4 +1,5 @@
 .idea
+scamper-cvs-20211212d
 # Byte-compiled / optimized / DLL files
 __pycache__/
 *.py[cod]
@@ -128,3 +129,9 @@ dmypy.json
 
 # Pyre type checker
 .pyre/
+
+
+# Custom directories
+.vscode/
+.idea/
+logs/
diff --git a/GeoLite2-City.mmdb b/GeoLite2-City.mmdb
diff --git a/Makefile b/Makefile
@@ -0,0 +1,4 @@
+permissions:
+	chmod u+x ./deployCDN
+	chmod u+x ./runCDN
+	chmod u+x ./stopCDN
diff --git a/README.md b/README.md
@@ -1 +1,27 @@
-# CS5700-Project-4
+# CS5700-Project-4
+
+## High Level
+
+### DNS Server
+
+The DNS resolver takes a name from the client, checks that the name is within jurisdiction, and gets a replica IP address. The aux library dnslib has been used to facilitate implementation of main functionality.
+
+### HTTP Replica Server
+
+Each replica runs an http server to process client requests. It uses a page views csv file to determine caching priorities. When serving content to the client, if the content has not yet been cached to memory, it serves the content directly from the origin.
+
+The server also has a measurements endpoint, which the DNS resolver uses to get active measurements from all replicas.
+
+## Challenges
+
+The following are key challenges we faced, and how they were addressed:
+
+- 1.  Serving best replica to client: We want to deliver the best replica to a client without slowing down performance. If we were to ping different replicas from the resolver to determine the best RTT when a client first connects to the resolver, the outcome would be slow performance. As a result, the best approach at first is to serve the closest replica to the client. Then, the DNS resolver can ping replicas in the background to determine the best replica to server to a given client on any subsequent client requests once the initial TTL expires.
+
+## Work Distribution
+
+Together, we all strategized and contributed to all parts, with each person spending a bit more time developing the following:
+
+- Diego: worked on deploy/run/stop scripts
+- Ricardo: worked on http server
+- Rohit: worked on dns server
diff --git a/TODO.md b/TODO.md
@@ -0,0 +1,28 @@
+- add endpoint to http server which reports the current CPU usage
+- save log file on the server instead of the local machine
+- heapq to 
+
+maxmind username: [email protected]
+maxmind password: cs5700p4123
+
+
+DONE:
+- ask professor how to do active measurements without having access to clients
+    - WHAT to ping from the HTTP servers?
+
+TO ASK:
+- Can we assume client is capable of accepting a gzip compressed file using the Content-Encoding: gzip header?
+- How long can the deploy stage take? We want to pre-cache top pages beforehand on disk and load them at runtime.
+- This should take a minute or two during the deploy stage. Is that acceptable?
+
+Add another layer of indirection by not using article names anywhere internally within the cache. Have a dictionary
+at the entry which maps a string to an int or something which is lighter to duplicate within a LookupInfo object
+
+How to test this out?
+- easy to test: uncached and in-memory cache
+- moderately hard to test: on disk cache
+    - first request will lead to uncached, which gets cached to disk
+    - subsequent requests should be served from disk
+- hard to test: promotion
+    - setup a test example
+    - 
diff --git a/build_cache.py b/build_cache.py
@@ -0,0 +1,66 @@
+import csv
+import os
+import socket
+from argparse import Namespace, ArgumentParser
+from urllib.error import HTTPError
+from urllib.parse import quote
+from urllib.request import urlopen
+
+import utils
+
+ORIGIN_SERVER = "cs5700cdnorigin.ccs.neu.edu"
+ORIGIN_PORT = "8080"
+
+
+def fetch_from_origin(path: str) -> bytes:
+    """
+    Fetches the articles objects from the origin server.
+
+    :param path: the URL path to the article object
+    :return: the article object
+    """
+
+    with urlopen(
+            "http://" + ORIGIN_SERVER + ":" + ORIGIN_PORT + "/" + path
+    ) as response:
+        return response.read()
+
+
+def parse_args() -> Namespace:
+    """
+    Parses the command line arguments.
+
+    :return: the command line arguments
+    """
+
+    parser = ArgumentParser()
+    parser.add_argument("-o", type=str, help="the name of the origin server")
+    args = parser.parse_args()
+    return args
+
+
+def main():
+    if not os.path.exists("cache"):  # if cache directory doesn't exist already
+        os.mkdir("cache")  # create it
+    global ORIGIN_SERVER
+    args = parse_args()
+    ORIGIN_SERVER = args.o
+    with open("pageviews.csv") as article_file:  # cache the most viewed articles
+        reader = csv.DictReader(article_file)
+        for row in reader:
+            try:
+                path = quote(row["article"].replace(" ", "_"))  # replace spaces with underscores
+                article = fetch_from_origin(path)  # fetch article from origin
+                compressed_article = utils.compress_article(article)  # compress article
+                with open(f"cache/{path}", "wb") as cache_file:  # save the compressed article as binary
+                    cache_file.write(compressed_article)
+            except HTTPError as e:  # error in fetching the article
+                print(e)
+                print(f"Error for this article: {row['article']}")
+            except IOError:  # disk quota reached
+                print(f"Caching complete: {socket.gethostname()}")
+                break
+
+
+if __name__ == "__main__":
+    main()