Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Redis Support for FedScale #170

Draft
wants to merge 13 commits into
base: master
Choose a base branch
from
Draft

Redis Support for FedScale #170

wants to merge 13 commits into from

Conversation

xuyehe
Copy link

@xuyehe xuyehe commented Aug 23, 2022

Why are these changes needed?

To integrate Redis into aggregator for saving aggregation data.

Related issue number

N/A

Checks

  • I've included any doc changes needed for https://fedscale.readthedocs.io/en/latest/
  • I've made sure the following tests are passing.
  • Testing Configurations
    • Dry Run (20 training rounds & 1 evaluation round)
    • Cifar 10 (20 training rounds & 1 evaluation round)
    • Femnist (20 training rounds & 1 evaluation round)

Note:

  1. All tests are run on cpu.
  2. Met with the following KeyError bug on the same line in Femnist once with Redis, and also once without Redis (i.e. original code). Sample error output:
File "/users/xuyehe/FedScale-rd/fedscale/core/aggregation/aggregator.py", line 386, in client_completion_handler
    duration=self.virtual_client_clock[results['clientId']]['computation'] +
KeyError: 665

@fanlai0990 fanlai0990 changed the title Redis Redis Support for FedScale Aug 24, 2022
docker/driver.py Outdated
@@ -143,6 +145,12 @@ def terminate(job_name):
print_help: bool = False
if len(sys.argv) > 1:
if sys.argv[1] == 'submit' or sys.argv[1] == 'start':
redis_exec = '/usr/bin/redis-server'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we remove the hardcoded path? Instead, use some like os.environ["REDIS_PATH"]

@@ -6,6 +6,7 @@ dependencies:
- tensorboard
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I remember you mentioned that some tf version is not compatible. Can you please update tf version here too?

@@ -32,6 +34,13 @@ def __init__(self, args):
self.device = args.cuda_device if args.use_cuda else torch.device(
'cpu')

# ======== redis client ========
# <host, port, ...> should be specified in args
host = "127.0.0.1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove any hard code in the future, using args.xxx instead.

@fanlai0990
Copy link
Member

Thanks a lot for your contribution! Let's try to run an e2e evaluation to validate the learning curve and Redis overhead first.

@fanlai0990 fanlai0990 mentioned this pull request Aug 29, 2022
5 tasks
@fanlai0990
Copy link
Member

fanlai0990 commented Oct 12, 2022

Thanks for your contribution. Can you please

  • remove all prints or replace them with logging.info
  • summarize the results of the overhead analysis

We really appreciate your help.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants