Skip to content

Commit

Permalink
Merge main with riccardo (#1)
Browse files Browse the repository at this point in the history
* basic implementation

* need to pass in the port and add host

* Caching + removing extensions

* Dump

* dump

* created scripts

* no changes.

* working on scripts

* Add deploy and run scripts. Add README description.

* added the command line parsing

* work in progress

* fixed command line args parsing

* work in progress

* DNS stuff

* Dump TODO

* SCRIPTS ARE RUNNING!

* Added grading beacon endpoint

* edited the echo messages

* added heap library for reference

* implemented the deploy caching stage

* Lay the groundwork for CDN cache

* Load deploy-time cache into RAM

* Handle cache miss (need to fetch from origin)

* Implement disk cache hit and promotion logic

* Dump

* Enforce memory limits

* New caching strategy

Treat memory and disk as same. Promotion only attempted for adding an un-cached article that was grabbed from the origin. No promotion between disk and memory.

* Refactor logic for attempting to evict and add to disk
and memory into their own functions

* Rename BUFFER_OFFSET_AUTO to APPEND

* Wire up the cache to the HTTP server, respond with 404 for
media requests

* RepliCache.get now returns boolean along with data to
indicate if the key was found. Also fixed a bug where
uncompressed article was cached after fetching from origin

* added replicas and cache.py to deploy script

* saving client's IP addresses in a global set

* Make ON_DISK=-1 and NOT_CACHED=-2. Makes more semantic sense.

* Refactor Resolver to include replicas and client ips.

* Add server locations

* changed scope of replicas and client_ips

* Serve a random replica from dns server.

* fixed bash list

* implemented measure function that returns a dictionary containing rtt and destination IP address

* GeoIP CSV hacking DUMP

* Deploy script now concurrently builds cache on all replicas

* Implement haversine

* database implementation

* db location implementation done

* Add /measure endpoint

* fixed geo.py bug

* Process each request on a separate thread + Run measurements on a separate thread

* Implement best replica map.

* Add threads to dns get measurements functionality.

* Testing measurements in dns server.

* added db to deploy script for dns server

* Testing measure endpoint functionality.

* merge

* Change REPLICAS to be a dict

* merge

* latest updates

* edit run script.

* fixing bug in run script

* edit run script.

* edit stop script.

* Edit stop script.

* dump

* downloaded cache and modified .gitignore

* latest commit

* Fix routing table key bug in dnsserver

* Store respective log files on individual servers

* Ridiculously fast full-scan script

* Add a few scripts for housekeeping

* Add /debug_cache endpoint to HTTP server to quickly debug the cache

* Be smart with URL encodings

First check if the path is already URL-encoded. If not,
encode it before checking the cache. If yes, nothing to do.

* Fix the scamper "avg" thing

* Delete unnecessary vendor stuff

* Return the right A record for DNS

* added pydocs and comments

* modified return type of get_local_ip

* Sanity check test script

* Add stage latency and pause info to sanity_check

* Bump time to sleep to 20

* updates

* Exclude maxsize RTT from map.

Co-authored-by: Rohit Awate <[email protected]>
Co-authored-by: Diego <[email protected]>
  • Loading branch information
3 people authored Dec 12, 2022
1 parent 1ea642b commit b0bc85c
Show file tree
Hide file tree
Showing 256 changed files with 3,972 additions and 1 deletion.
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
.idea
scamper-cvs-20211212d
# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
Expand Down Expand Up @@ -128,3 +129,9 @@ dmypy.json

# Pyre type checker
.pyre/


# Custom directories
.vscode/
.idea/
logs/
Binary file added GeoLite2-City.mmdb
Binary file not shown.
4 changes: 4 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -0,0 +1,4 @@
permissions:
chmod u+x ./deployCDN
chmod u+x ./runCDN
chmod u+x ./stopCDN
28 changes: 27 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
@@ -1 +1,27 @@
# CS5700-Project-4
# CS5700-Project-4

## High Level

### DNS Server

The DNS resolver takes a name from the client, checks that the name is within jurisdiction, and gets a replica IP address. The aux library dnslib has been used to facilitate implementation of main functionality.

### HTTP Replica Server

Each replica runs an http server to process client requests. It uses a page views csv file to determine caching priorities. When serving content to the client, if the content has not yet been cached to memory, it serves the content directly from the origin.

The server also has a measurements endpoint, which the DNS resolver uses to get active measurements from all replicas.

## Challenges

The following are key challenges we faced, and how they were addressed:

- 1. Serving best replica to client: We want to deliver the best replica to a client without slowing down performance. If we were to ping different replicas from the resolver to determine the best RTT when a client first connects to the resolver, the outcome would be slow performance. As a result, the best approach at first is to serve the closest replica to the client. Then, the DNS resolver can ping replicas in the background to determine the best replica to server to a given client on any subsequent client requests once the initial TTL expires.

## Work Distribution

Together, we all strategized and contributed to all parts, with each person spending a bit more time developing the following:

- Diego: worked on deploy/run/stop scripts
- Ricardo: worked on http server
- Rohit: worked on dns server
28 changes: 28 additions & 0 deletions TODO.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@
- add endpoint to http server which reports the current CPU usage
- save log file on the server instead of the local machine
- heapq to

maxmind username: [email protected]
maxmind password: cs5700p4123


DONE:
- ask professor how to do active measurements without having access to clients
- WHAT to ping from the HTTP servers?

TO ASK:
- Can we assume client is capable of accepting a gzip compressed file using the Content-Encoding: gzip header?
- How long can the deploy stage take? We want to pre-cache top pages beforehand on disk and load them at runtime.
- This should take a minute or two during the deploy stage. Is that acceptable?

Add another layer of indirection by not using article names anywhere internally within the cache. Have a dictionary
at the entry which maps a string to an int or something which is lighter to duplicate within a LookupInfo object

How to test this out?
- easy to test: uncached and in-memory cache
- moderately hard to test: on disk cache
- first request will lead to uncached, which gets cached to disk
- subsequent requests should be served from disk
- hard to test: promotion
- setup a test example
-
66 changes: 66 additions & 0 deletions build_cache.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
import csv
import os
import socket
from argparse import Namespace, ArgumentParser
from urllib.error import HTTPError
from urllib.parse import quote
from urllib.request import urlopen

import utils

ORIGIN_SERVER = "cs5700cdnorigin.ccs.neu.edu"
ORIGIN_PORT = "8080"


def fetch_from_origin(path: str) -> bytes:
"""
Fetches the articles objects from the origin server.
:param path: the URL path to the article object
:return: the article object
"""

with urlopen(
"http://" + ORIGIN_SERVER + ":" + ORIGIN_PORT + "/" + path
) as response:
return response.read()


def parse_args() -> Namespace:
"""
Parses the command line arguments.
:return: the command line arguments
"""

parser = ArgumentParser()
parser.add_argument("-o", type=str, help="the name of the origin server")
args = parser.parse_args()
return args


def main():
if not os.path.exists("cache"): # if cache directory doesn't exist already
os.mkdir("cache") # create it
global ORIGIN_SERVER
args = parse_args()
ORIGIN_SERVER = args.o
with open("pageviews.csv") as article_file: # cache the most viewed articles
reader = csv.DictReader(article_file)
for row in reader:
try:
path = quote(row["article"].replace(" ", "_")) # replace spaces with underscores
article = fetch_from_origin(path) # fetch article from origin
compressed_article = utils.compress_article(article) # compress article
with open(f"cache/{path}", "wb") as cache_file: # save the compressed article as binary
cache_file.write(compressed_article)
except HTTPError as e: # error in fetching the article
print(e)
print(f"Error for this article: {row['article']}")
except IOError: # disk quota reached
print(f"Caching complete: {socket.gethostname()}")
break


if __name__ == "__main__":
main()
Loading

0 comments on commit b0bc85c

Please sign in to comment.