You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
If I set the experiment_mode to "standalone" for example, which is not "simulation", FedScale fails to run. The femnist_cluster.yml is:
# Configuration file of FAR training experiment# ========== Cluster configuration ========== # ip address of the parameter server (need 1 GPU process)ps_ip: 192.168.124.102# ip address of each worker:# of available gpus process on each gpu in this node# Note that if we collocate ps and worker on same GPU, then we need to decrease this number of available processes on that GPU by 1# E.g., master node has 4 available processes, then 1 for the ps, and worker should be set to: worker:3worker_ips:
- 192.168.124.104:[1]
- 192.168.124.105:[1]
- 192.168.124.106:[1]exp_path: $FEDSCALE_HOME/fedscale/cloud# Entry function of executor and aggregator under $exp_pathexecutor_entry: execution/executor.pyaggregator_entry: aggregation/aggregator.pyauth:
ssh_user: "whr"ssh_private_key: ~/.ssh/id_rsa# cmd to run before we can indeed run FAR (in order)setup_commands:
- source /usr/local/miniconda3/bin/activate fedscale# ========== Additional job configuration ========== # Default parameters are specified in config_parser.py, wherein more description of the parameter can be foundjob_conf:
- job_name: femnist_cluster # Generate logs under this folder: log_path/job_name/time_stamp
- log_path: $FEDSCALE_HOME/benchmark # Path of log files
- num_participants: 2# Number of participants per round, we use K=100 in our paper, large K will be much slower
- data_set: femnist # Dataset: openImg, google_speech, stackoverflow
- data_dir: $FEDSCALE_HOME/benchmark/dataset/data/femnist # Path of the dataset
- data_map_file: $FEDSCALE_HOME/benchmark/dataset/data/femnist/client_data_mapping/train.csv # Allocation of data to each client, turn to iid setting if not provided
- device_conf_file: $FEDSCALE_HOME/benchmark/dataset/data/device_info/client_device_capacity # Path of the client trace
- device_avail_file: $FEDSCALE_HOME/benchmark/dataset/data/device_info/client_behave_trace
- model: resnet18 # NOTE: Please refer to our model zoo README and use models for these small image (e.g., 32x32x3) inputs# - model_zoo: fedscale-torch-zoo
- eval_interval: 10# How many rounds to run a testing on the testing set
- rounds: 1000# Number of rounds to run this training. We use 1000 in our paper, while it may converge w/ ~400 rounds
- filter_less: 21# Remove clients w/ less than 21 samples
- num_loaders: 2
- local_steps: 5
- learning_rate: 0.05
- batch_size: 20
- test_bsz: 20
- use_cuda: True
- save_checkpoint: False
- experiment_mode: standalone
- overcommitment: 1.0
The log is:
2023-12-27 14:39:19.056225: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-27 14:39:19.152964: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-12-27 14:39:19.480238: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:19.480270: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:19.480272: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(12-27) 14:39:19 INFO [aggregator.py:44] Job args Namespace(adam_epsilon=1e-08, backbone='./resnet50.pth', backend='gloo', batch_size=20, bidirectional=True, blacklist_max_len=0.3, blacklist_rounds=-1, block_size=64, cfg_file='./utils/rcnn/cfgs/res101.yml', clf_block_size=32, clip_bound=0.9, clip_threshold=3.0, clock_factor=2.4368231046931412, conf_path='~/dataset/', connection_timeout=60, cuda_device=None, cut_off_util=0.05, data_cache='', data_dir='/home/whr/code/FedScale/benchmark/dataset/data/femnist', data_map_file='/home/whr/code/FedScale/benchmark/dataset/data/femnist/client_data_mapping/train.csv', data_set='femnist', decay_factor=0.98, decay_round=10, device_avail_file='/home/whr/code/FedScale/benchmark/dataset/data/device_info/client_behave_trace', device_conf_file='/home/whr/code/FedScale/benchmark/dataset/data/device_info/client_device_capacity', dump_epoch=10000000000.0, embedding_file='glove.840B.300d.txt', engine='pytorch', epsilon=0.9, eval_interval=10, executor_configs='192.168.124.104:[1]=192.168.124.105:[1]=192.168.124.106:[1]', experiment_mode='standalone', exploration_alpha=0.3, exploration_decay=0.98, exploration_factor=0.9, exploration_min=0.3, filter_less=21, filter_more=1000000000000000.0, finetune=False, gamma=0.9, gradient_policy=None, hidden_layers=7, hidden_size=256, input_dim=0, input_shape=[1, 3, 28, 28], job_name='femnist_cluster', labels_path='labels.json', learning_rate=0.05, line_by_line=False, local_steps=5, log_path='/home/whr/code/FedScale/benchmark', loss_decay=0.2, malicious_factor=1000000000000000.0, max_concurrency=10, max_staleness=5, memory_capacity=2000, min_learning_rate=5e-05, mlm=False, mlm_probability=0.15, model='resnet18', model_size=65536, model_zoo='torchcv', n_actions=2, n_states=4, noise_dir=None, noise_factor=0.1, noise_max=0.5, noise_min=0.0, noise_prob=0.4, num_class=62, num_classes=35, num_executors=3, num_loaders=2, num_participants=3, output_dim=0, overcommitment=1.0, overwrite_cache=False, pacer_delta=5, pacer_step=20, proxy_mu=0.1, ps_ip='192.168.124.102', ps_port='29500', qfed_q=1.0, rnn_type='lstm', round_penalty=2.0, round_threshold=30, rounds=1000, sample_mode='random', sample_rate=16000, sample_seed=233, sample_window=5.0, save_checkpoint=True, spec_augment=False, speed_volume_perturb=False, target_delta=0.0001, target_replace_iter=15, task='cv', test_bsz=20, test_manifest='data/test_manifest.csv', test_output_dir='./logs/server', test_ratio=1.0, test_size_file='', this_rank=0, time_stamp='1227_143917', train_manifest='data/train_manifest.csv', train_size_file='', train_uniform=False, use_cuda=True, vocab_tag_size=500, vocab_token_size=10000, wandb_token='', weight_decay=0, window='hamming', window_size=0.02, window_stride=0.01, yogi_beta=0.9, yogi_beta2=0.99, yogi_eta=0.003, yogi_tau=1e-08)
(12-27) 14:39:20 INFO [aggregator.py:164] Initiating control plane communication ...
(12-27) 14:39:20 INFO [aggregator.py:188] %%%%%%%%%% Opening aggregator server using port [::]:29500 %%%%%%%%%%
(12-27) 14:39:20 INFO [fllibs.py:97] Initializing the model ...
(12-27) 14:39:20 INFO [aggregator.py:967] Start monitoring events ...
2023-12-27 14:39:31.090474: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-27 14:39:31.169358: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-12-27 14:39:31.478808: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:31.478836: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:31.478838: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(12-27) 14:39:31 INFO [fllibs.py:97] Initializing the model ...
(12-27) 14:39:31 INFO [executor.py:77] (EXECUTOR:1) is setting up environ ...
(12-27) 14:39:32 INFO [executor.py:123] Data partitioner starts ...
(12-27) 14:39:32 INFO [divide_data.py:62] Partitioning data by profile /home/whr/code/FedScale/benchmark/dataset/data/femnist/client_data_mapping/train.csv...
(12-27) 14:39:32 INFO [divide_data.py:74] Trace names are client_id, sample_path, label_name, label_id
(12-27) 14:39:32 INFO [divide_data.py:105] Randomly partitioning data, 81674 samples...
(12-27) 14:39:32 INFO [executor.py:141] Data partitioner completes ...
(12-27) 14:39:32 INFO [channel_context.py:21] %%%%%%%%%% Opening grpc connection to 192.168.124.102 %%%%%%%%%%
(12-27) 14:39:32 INFO [executor.py:404] Start monitoring events ...
(12-27) 14:39:32 INFO [aggregator.py:318] Received executor 1 information, 1/3
(12-27) 14:39:32 INFO [aggregator.py:274] Loading 2800 client traces ...
(12-27) 14:39:32 INFO [aggregator.py:304] Info of all feasible clients {'total_feasible_clients': 2799, 'total_num_samples': 637858}
2023-12-27 14:39:33.925569: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-27 14:39:34.012208: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
2023-12-27 14:39:34.334770: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:34.334812: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:34.334815: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(12-27) 14:39:34 INFO [fllibs.py:97] Initializing the model ...
(12-27) 14:39:34 INFO [executor.py:77] (EXECUTOR:2) is setting up environ ...
2023-12-27 14:39:35.087146: I tensorflow/core/platform/cpu_feature_guard.cc:193] This TensorFlow binary is optimized with oneAPI Deep Neural Network Library (oneDNN) to use the following CPU instructions in performance-critical operations: AVX2 AVX_VNNI FMA
To enable them in other operations, rebuild TensorFlow with the appropriate compiler flags.
2023-12-27 14:39:35.167337: I tensorflow/core/util/port.cc:104] oneDNN custom operations are on. You may see slightly different numerical results due to floating-point round-off errors from different computation orders. To turn them off, set the environment variable `TF_ENABLE_ONEDNN_OPTS=0`.
(12-27) 14:39:35 INFO [executor.py:123] Data partitioner starts ...
(12-27) 14:39:35 INFO [divide_data.py:62] Partitioning data by profile /home/whr/code/FedScale/benchmark/dataset/data/femnist/client_data_mapping/train.csv...
(12-27) 14:39:35 INFO [divide_data.py:74] Trace names are client_id, sample_path, label_name, label_id
2023-12-27 14:39:35.479452: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer.so.7'; dlerror: libnvinfer.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:35.479481: W tensorflow/compiler/xla/stream_executor/platform/default/dso_loader.cc:64] Could not load dynamic library 'libnvinfer_plugin.so.7'; dlerror: libnvinfer_plugin.so.7: cannot open shared object file: No such file or directory
2023-12-27 14:39:35.479484: W tensorflow/compiler/tf2tensorrt/utils/py_utils.cc:38] TF-TRT Warning: Cannot dlopen some TensorRT libraries. If you would like to use Nvidia GPU with TensorRT, please make sure the missing libraries mentioned above are installed properly.
(12-27) 14:39:35 INFO [divide_data.py:105] Randomly partitioning data, 81674 samples...
(12-27) 14:39:35 INFO [executor.py:141] Data partitioner completes ...
(12-27) 14:39:35 INFO [channel_context.py:21] %%%%%%%%%% Opening grpc connection to 192.168.124.102 %%%%%%%%%%
(12-27) 14:39:35 INFO [executor.py:404] Start monitoring events ...
(12-27) 14:39:35 INFO [aggregator.py:318] Received executor 2 information, 2/3
(12-27) 14:39:35 INFO [aggregator.py:274] Loading 2800 client traces ...
(12-27) 14:39:35 INFO [aggregator.py:304] Info of all feasible clients {'total_feasible_clients': 5598, 'total_num_samples': 1275716}
(12-27) 14:39:35 INFO [fllibs.py:97] Initializing the model ...
(12-27) 14:39:35 INFO [executor.py:77] (EXECUTOR:3) is setting up environ ...
(12-27) 14:39:36 INFO [executor.py:123] Data partitioner starts ...
(12-27) 14:39:36 INFO [divide_data.py:62] Partitioning data by profile /home/whr/code/FedScale/benchmark/dataset/data/femnist/client_data_mapping/train.csv...
(12-27) 14:39:36 INFO [divide_data.py:74] Trace names are client_id, sample_path, label_name, label_id
(12-27) 14:39:36 INFO [divide_data.py:105] Randomly partitioning data, 81674 samples...
(12-27) 14:39:36 INFO [executor.py:141] Data partitioner completes ...
(12-27) 14:39:36 INFO [channel_context.py:21] %%%%%%%%%% Opening grpc connection to 192.168.124.102 %%%%%%%%%%
(12-27) 14:39:36 INFO [executor.py:404] Start monitoring events ...
(12-27) 14:39:36 INFO [aggregator.py:318] Received executor 3 information, 3/3
(12-27) 14:39:36 INFO [aggregator.py:274] Loading 2800 client traces ...
(12-27) 14:39:36 INFO [aggregator.py:304] Info of all feasible clients {'total_feasible_clients': 8397, 'total_num_samples': 1913574}
(12-27) 14:39:36 INFO [aggregator.py:583] Wall clock: 0 s, round: 1, Planned participants: 0, Succeed participants: 0, Training loss: 0.0
(12-27) 14:39:36 INFO [client_manager.py:195] Wall clock time: 0, 0 clients online, 8397 clients offline
(12-27) 14:39:36 INFO [aggregator.py:605] Selected participants to run: []
Apparently, it selects no participants to run and the program is stuck here.
What happened + What you expected to happen
If I set the experiment_mode to "standalone" for example, which is not "simulation", FedScale fails to run. The
femnist_cluster.yml
is:The log is:
Apparently, it selects no participants to run and the program is stuck here.
Versions / Dependencies
FedScale: 7ec441c
Python: 3.7.16
OS: Ubuntu20.04
Reproduction script
I put the aforementioned yml under $WORKDIR. So, the starting command is
python $WORKDIR/docker/driver.py submit $WORKDIR/femnist_cluster.yml
.Issue Severity
None
The text was updated successfully, but these errors were encountered: