Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FedScale Core] fail to run if it's not in simulation mode. #247

Open
whr819987540 opened this issue Dec 27, 2023 · 0 comments
Open

[FedScale Core] fail to run if it's not in simulation mode. #247

whr819987540 opened this issue Dec 27, 2023 · 0 comments
Labels
bug Something isn't working

Comments

@whr819987540
Copy link

What happened + What you expected to happen

If I set the experiment_mode to "standalone" for example, which is not "simulation", FedScale fails to run. The femnist_cluster.yml is:

# Configuration file of FAR training experiment

# ========== Cluster configuration ========== 
# ip address of the parameter server (need 1 GPU process)
ps_ip: 192.168.124.102

# ip address of each worker:# of available gpus process on each gpu in this node
# Note that if we collocate ps and worker on same GPU, then we need to decrease this number of available processes on that GPU by 1
# E.g., master node has 4 available processes, then 1 for the ps, and worker should be set to: worker:3
worker_ips:
    - 192.168.124.104:[1]
    - 192.168.124.105:[1]
    - 192.168.124.106:[1]

exp_path: $FEDSCALE_HOME/fedscale/cloud

# Entry function of executor and aggregator under $exp_path
executor_entry: execution/executor.py

aggregator_entry: aggregation/aggregator.py

auth:
    ssh_user: "whr"
    ssh_private_key: ~/.ssh/id_rsa

# cmd to run before we can indeed run FAR (in order)
setup_commands:
    - source /usr/local/miniconda3/bin/activate fedscale

# ========== Additional job configuration ========== 
# Default parameters are specified in config_parser.py, wherein more description of the parameter can be found

job_conf: 
    - job_name: femnist_cluster                   # Generate logs under this folder: log_path/job_name/time_stamp
    - log_path: $FEDSCALE_HOME/benchmark # Path of log files
    - num_participants: 2                 # Number of participants per round, we use K=100 in our paper, large K will be much slower
    - data_set: femnist                     # Dataset: openImg, google_speech, stackoverflow
    - data_dir: $FEDSCALE_HOME/benchmark/dataset/data/femnist    # Path of the dataset
    - data_map_file: $FEDSCALE_HOME/benchmark/dataset/data/femnist/client_data_mapping/train.csv              # Allocation of data to each client, turn to iid setting if not provided
    - device_conf_file: $FEDSCALE_HOME/benchmark/dataset/data/device_info/client_device_capacity     # Path of the client trace
    - device_avail_file: $FEDSCALE_HOME/benchmark/dataset/data/device_info/client_behave_trace
    - model: resnet18             # NOTE: Please refer to our model zoo README and use models for these small image (e.g., 32x32x3) inputs
#    - model_zoo: fedscale-torch-zoo
    - eval_interval: 10                     # How many rounds to run a testing on the testing set
    - rounds: 1000                          # Number of rounds to run this training. We use 1000 in our paper, while it may converge w/ ~400 rounds
    - filter_less: 21                       # Remove clients w/ less than 21 samples
    - num_loaders: 2
    - local_steps: 5
    - learning_rate: 0.05
    - batch_size: 20
    - test_bsz: 20
    - use_cuda: True
    - save_checkpoint: False
    
    - experiment_mode: standalone
    - overcommitment: 1.0

The log is:

2023-12-27 14:39:19.056225: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-27 14:39:19.152964: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-12-27 14:39:19.480238: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:19.480270: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:19.480272: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(12-27) 14:39:19 INFO     [aggregator.py:44] Job args Namespace(adam_epsilon=1e-08, backbone='./resnet50.pth', backend='gloo', batch_size=20, bidirectional=True, blacklist_max_len=0.3, blacklist_rounds=-1, block_size=64, cfg_file='./utils/rcnn/cfgs/res101.yml', clf_block_size=32, clip_bound=0.9, clip_threshold=3.0, clock_factor=2.4368231046931412, conf_path='~/dataset/', connection_timeout=60, cuda_device=None, cut_off_util=0.05, data_cache='', data_dir='/home/whr/code/FedScale/benchmark/dataset/data/femnist', data_map_file='/home/whr/code/FedScale/benchmark/dataset/data/femnist/client_data_mapping/train.csv', data_set='femnist', decay_factor=0.98, decay_round=10, device_avail_file='/home/whr/code/FedScale/benchmark/dataset/data/device_info/client_behave_trace', device_conf_file='/home/whr/code/FedScale/benchmark/dataset/data/device_info/client_device_capacity', dump_epoch=10000000000.0, embedding_file='glove.840B.300d.txt', engine='pytorch', epsilon=0.9, eval_interval=10, executor_configs='192.168.124.104:[1]=192.168.124.105:[1]=192.168.124.106:[1]', experiment_mode='standalone', exploration_alpha=0.3, exploration_decay=0.98, exploration_factor=0.9, exploration_min=0.3, filter_less=21, filter_more=1000000000000000.0, finetune=False, gamma=0.9, gradient_policy=None, hidden_layers=7, hidden_size=256, input_dim=0, input_shape=[1, 3, 28, 28], job_name='femnist_cluster', labels_path='labels.json', learning_rate=0.05, line_by_line=False, local_steps=5, log_path='/home/whr/code/FedScale/benchmark', loss_decay=0.2, malicious_factor=1000000000000000.0, max_concurrency=10, max_staleness=5, memory_capacity=2000, min_learning_rate=5e-05, mlm=False, mlm_probability=0.15, model='resnet18', model_size=65536, model_zoo='torchcv', n_actions=2, n_states=4, noise_dir=None, noise_factor=0.1, noise_max=0.5, noise_min=0.0, noise_prob=0.4, num_class=62, num_classes=35, num_executors=3, num_loaders=2, num_participants=3, output_dim=0, overcommitment=1.0, overwrite_cache=False, pacer_delta=5, pacer_step=20, proxy_mu=0.1, ps_ip='192.168.124.102', ps_port='29500', qfed_q=1.0, rnn_type='lstm', round_penalty=2.0, round_threshold=30, rounds=1000, sample_mode='random', sample_rate=16000, sample_seed=233, sample_window=5.0, save_checkpoint=True, spec_augment=False, speed_volume_perturb=False, target_delta=0.0001, target_replace_iter=15, task='cv', test_bsz=20, test_manifest='data/test_manifest.csv', test_output_dir='./logs/server', test_ratio=1.0, test_size_file='', this_rank=0, time_stamp='1227_143917', train_manifest='data/train_manifest.csv', train_size_file='', train_uniform=False, use_cuda=True, vocab_tag_size=500, vocab_token_size=10000, wandb_token='', weight_decay=0, window='hamming', window_size=0.02, window_stride=0.01, yogi_beta=0.9, yogi_beta2=0.99, yogi_eta=0.003, yogi_tau=1e-08)
(12-27) 14:39:20 INFO     [aggregator.py:164] Initiating control plane communication ...
(12-27) 14:39:20 INFO     [aggregator.py:188] %%%%%%%%%% Opening aggregator server using port [::]:29500 %%%%%%%%%%
(12-27) 14:39:20 INFO     [fllibs.py:97] Initializing the model ...
(12-27) 14:39:20 INFO     [aggregator.py:967] Start monitoring events ...
2023-12-27 14:39:31.090474: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-27 14:39:31.169358: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-12-27 14:39:31.478808: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:31.478836: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:31.478838: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(12-27) 14:39:31 INFO     [fllibs.py:97] Initializing the model ...
(12-27) 14:39:31 INFO     [executor.py:77] (EXECUTOR:1) is setting up environ ...
(12-27) 14:39:32 INFO     [executor.py:123] Data partitioner starts ...
(12-27) 14:39:32 INFO     [divide_data.py:62] Partitioning data by profile /home/whr/code/FedScale/benchmark/dataset/data/femnist/client_data_mapping/train.csv...
(12-27) 14:39:32 INFO     [divide_data.py:74] Trace names are client_id, sample_path, label_name, label_id
(12-27) 14:39:32 INFO     [divide_data.py:105] Randomly partitioning data, 81674 samples...
(12-27) 14:39:32 INFO     [executor.py:141] Data partitioner completes ...
(12-27) 14:39:32 INFO     [channel_context.py:21] %%%%%%%%%% Opening grpc connection to 192.168.124.102 %%%%%%%%%%
(12-27) 14:39:32 INFO     [executor.py:404] Start monitoring events ...
(12-27) 14:39:32 INFO     [aggregator.py:318] Received executor 1 information, 1/3
(12-27) 14:39:32 INFO     [aggregator.py:274] Loading 2800 client traces ...
(12-27) 14:39:32 INFO     [aggregator.py:304] Info of all feasible clients {'total_feasible_clients': 2799, 'total_num_samples': 637858}
2023-12-27 14:39:33.925569: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-27 14:39:34.012208: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-12-27 14:39:34.334770: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:34.334812: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:34.334815: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(12-27) 14:39:34 INFO     [fllibs.py:97] Initializing the model ...
(12-27) 14:39:34 INFO     [executor.py:77] (EXECUTOR:2) is setting up environ ...
2023-12-27 14:39:35.087146: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations:  AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-27 14:39:35.167337: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
(12-27) 14:39:35 INFO     [executor.py:123] Data partitioner starts ...
(12-27) 14:39:35 INFO     [divide_data.py:62] Partitioning data by profile /home/whr/code/FedScale/benchmark/dataset/data/femnist/client_data_mapping/train.csv...
(12-27) 14:39:35 INFO     [divide_data.py:74] Trace names are client_id, sample_path, label_name, label_id
2023-12-27 14:39:35.479452: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:35.479481: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:35.479484: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(12-27) 14:39:35 INFO     [divide_data.py:105] Randomly partitioning data, 81674 samples...
(12-27) 14:39:35 INFO     [executor.py:141] Data partitioner completes ...
(12-27) 14:39:35 INFO     [channel_context.py:21] %%%%%%%%%% Opening grpc connection to 192.168.124.102 %%%%%%%%%%
(12-27) 14:39:35 INFO     [executor.py:404] Start monitoring events ...
(12-27) 14:39:35 INFO     [aggregator.py:318] Received executor 2 information, 2/3
(12-27) 14:39:35 INFO     [aggregator.py:274] Loading 2800 client traces ...
(12-27) 14:39:35 INFO     [aggregator.py:304] Info of all feasible clients {'total_feasible_clients': 5598, 'total_num_samples': 1275716}
(12-27) 14:39:35 INFO     [fllibs.py:97] Initializing the model ...
(12-27) 14:39:35 INFO     [executor.py:77] (EXECUTOR:3) is setting up environ ...
(12-27) 14:39:36 INFO     [executor.py:123] Data partitioner starts ...
(12-27) 14:39:36 INFO     [divide_data.py:62] Partitioning data by profile /home/whr/code/FedScale/benchmark/dataset/data/femnist/client_data_mapping/train.csv...
(12-27) 14:39:36 INFO     [divide_data.py:74] Trace names are client_id, sample_path, label_name, label_id
(12-27) 14:39:36 INFO     [divide_data.py:105] Randomly partitioning data, 81674 samples...
(12-27) 14:39:36 INFO     [executor.py:141] Data partitioner completes ...
(12-27) 14:39:36 INFO     [channel_context.py:21] %%%%%%%%%% Opening grpc connection to 192.168.124.102 %%%%%%%%%%
(12-27) 14:39:36 INFO     [executor.py:404] Start monitoring events ...
(12-27) 14:39:36 INFO     [aggregator.py:318] Received executor 3 information, 3/3
(12-27) 14:39:36 INFO     [aggregator.py:274] Loading 2800 client traces ...
(12-27) 14:39:36 INFO     [aggregator.py:304] Info of all feasible clients {'total_feasible_clients': 8397, 'total_num_samples': 1913574}
(12-27) 14:39:36 INFO     [aggregator.py:583] Wall clock: 0 s, round: 1, Planned participants: 0, Succeed participants: 0, Training loss: 0.0
(12-27) 14:39:36 INFO     [client_manager.py:195] Wall clock time: 0, 0 clients online, 8397 clients offline
(12-27) 14:39:36 INFO     [aggregator.py:605] Selected participants to run: []

Apparently, it selects no participants to run and the program is stuck here.

Versions / Dependencies

FedScale: 7ec441c
Python: 3.7.16
OS: Ubuntu20.04

Reproduction script

I put the aforementioned yml under $WORKDIR. So, the starting command is python $WORKDIR/docker/driver.py submit $WORKDIR/femnist_cluster.yml.

Issue Severity

None

@whr819987540 whr819987540 added the bug Something isn't working label Dec 27, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant