Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add analytics logging to MosaicMLLogger #3106

Open
wants to merge 45 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
f2e5537
add `log_analytics` function to `MosaicMLLogger`
angel-ruiz7 Mar 9, 2024
87f3d09
Merge branch 'dev' of github.com:mosaicml/composer into angel/add-dat…
angel-ruiz7 Mar 9, 2024
55e738f
add `optimizers`, `loggers`, `algorithms`, `device_mesh`, and `save_i…
angel-ruiz7 Mar 12, 2024
da1a179
fix pyright tests + formatting
angel-ruiz7 Mar 12, 2024
e0b559d
Merge branch 'dev' of github.com:mosaicml/composer into angel/add-dat…
angel-ruiz7 Mar 12, 2024
44cb283
log cloud providers from `load_path` / `save_folder`
angel-ruiz7 Mar 12, 2024
681e166
run formatter
angel-ruiz7 Mar 12, 2024
362f9ba
get rid of circular imports
angel-ruiz7 Mar 12, 2024
21cce59
access mosiacml_logger in a differnent way that doesn't affect tests
angel-ruiz7 Mar 12, 2024
be30004
smol improvements to style
angel-ruiz7 Mar 13, 2024
77d2c25
oops get rid of more circular imports 0_0
angel-ruiz7 Mar 13, 2024
82d771a
log `train_loader_workers` and `eval_loaders` to analytics
angel-ruiz7 Mar 13, 2024
b9aa219
fix type checks / access for `torch.utils.data.DataLoader`
angel-ruiz7 Mar 13, 2024
250cfff
Merge branch 'dev' into angel/add-data-to-metadata-for-analytics
angel-ruiz7 Mar 13, 2024
d841895
remove unnecessary comment
angel-ruiz7 Mar 14, 2024
67fd0d8
merge + resolve conflicts
angel-ruiz7 Mar 15, 2024
4a9da31
log analytics on `EVENT.INIT`
angel-ruiz7 Mar 15, 2024
8a5d9df
comment adjustment
angel-ruiz7 Mar 15, 2024
b8032c5
make sure `Logger.destinations` is iterable
angel-ruiz7 Mar 15, 2024
bca595b
Merge branch 'dev' into angel/add-data-to-metadata-for-analytics
angel-ruiz7 Mar 18, 2024
52db068
update default for `backward_prefetch` and move analytics logging to …
angel-ruiz7 Mar 19, 2024
f44ac72
Merge branch 'angel/add-data-to-metadata-for-analytics' of github.com…
angel-ruiz7 Mar 19, 2024
b499ab4
more formatting 🤡
angel-ruiz7 Mar 20, 2024
45b1f8b
get rid of unnecessary diff
angel-ruiz7 Mar 20, 2024
06cd615
make tests for `get_logger_type`
angel-ruiz7 Mar 20, 2024
53ec76c
Merge branch 'dev' of github.com:mosaicml/composer into angel/add-dat…
angel-ruiz7 Mar 20, 2024
a08bc51
add analytics metadata test, log `optimizer` and `algorithms` using `…
angel-ruiz7 Mar 21, 2024
192fb4b
run formatters
angel-ruiz7 Mar 21, 2024
0c6439e
adjust type hint for `get_logger_type` and delete test `Exception`
angel-ruiz7 Mar 21, 2024
6abd957
fix formatting on docstring
angel-ruiz7 Mar 21, 2024
f10e8a9
remove indent in comment
angel-ruiz7 Mar 21, 2024
cee3876
Merge branch 'dev' into angel/add-data-to-metadata-for-analytics
angel-ruiz7 Mar 21, 2024
11eb853
remove underscored fields and `param_groups` from `composer/optimizer…
angel-ruiz7 Mar 21, 2024
51a0dd0
Merge branch 'angel/add-data-to-metadata-for-analytics' of github.com…
angel-ruiz7 Mar 21, 2024
59dea8f
display name and data in one field for `optimizers` and `algorithms`
angel-ruiz7 Mar 21, 2024
5037ffb
fix docstring
angel-ruiz7 Mar 22, 2024
ea2eabc
Merge branch 'dev' into angel/add-data-to-metadata-for-analytics
angel-ruiz7 Mar 22, 2024
8c867bc
Make `MosaicAnalyticsData` class, change cloud path names, and log `f…
angel-ruiz7 Mar 26, 2024
5753020
just log algorithm names for analytics
angel-ruiz7 Mar 26, 2024
a51d5d9
Merge branch 'dev' into angel/add-data-to-metadata-for-analytics
angel-ruiz7 Mar 26, 2024
1083ae6
just pass `evaluator.label` for `eval_loaders`
angel-ruiz7 Mar 26, 2024
25e3ceb
Merge branch 'angel/add-data-to-metadata-for-analytics' of github.com…
angel-ruiz7 Mar 26, 2024
bf02105
fix `fsdp_config`, `eval_loaders`, and `loggers`. also `warn` when an…
angel-ruiz7 Mar 26, 2024
3be10ed
update docstring
angel-ruiz7 Mar 26, 2024
ba907b7
Merge branch 'dev' into angel/add-data-to-metadata-for-analytics
angel-ruiz7 Apr 1, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
78 changes: 76 additions & 2 deletions composer/loggers/mosaicml_logger.py
Original file line number Diff line number Diff line change
Expand Up @@ -14,17 +14,21 @@
import time
import warnings
from concurrent.futures import wait
from dataclasses import dataclass
from functools import reduce
from typing import TYPE_CHECKING, Any, Dict, List, Optional
from typing import TYPE_CHECKING, Any, Callable, Dict, List, Optional, Tuple, Union

import mcli
import torch
import torch.utils.data

from composer.core.time import TimeUnit
from composer.core.event import Event
from composer.core.time import Time, TimeUnit
from composer.loggers import Logger
from composer.loggers.logger_destination import LoggerDestination
from composer.loggers.wandb_logger import WandBLogger
from composer.utils import dist
from composer.utils.file_helpers import parse_uri

if TYPE_CHECKING:
from composer.core import State
Expand All @@ -46,6 +50,18 @@ class MosaicMLLogger(LoggerDestination):
Logs metrics to the MosaicML platform. Logging only happens on rank 0 every ``log_interval``
seconds to avoid performance issues.

Additionally, The following metrics are logged upon ``INIT``:
- ``composer/autoresume``: Whether or not the run can be stopped / resumed during training.
- ``composer/precision``: The precision to use for training.
- ``composer/eval_loaders``: A list containing the labels of each evaluation dataloader.
- ``composer/optimizers``: A list of dictionaries containing information about each optimizer.
- ``composer/algorithms``: A list containing the names of the algorithms used for training.
- ``composer/loggers``: A list containing the loggers used in the ``Trainer``.
- ``composer/cloud_provided_load_path``: The cloud provider for the load path.
- ``composer/cloud_provided_save_folder``: The cloud provider for the save folder.
- ``composer/save_interval``: The save interval for the run.
- ``composer/fsdp_config``: The FSDP config used for training.

When running on the MosaicML platform, the logger is automatically enabled by Trainer. To disable,
the environment variable 'MOSAICML_PLATFORM' can be set to False.

Expand All @@ -62,17 +78,20 @@ class MosaicMLLogger(LoggerDestination):

(default: ``None``)
ignore_exceptions: Flag to disable logging exceptions. Defaults to False.
analytics_data (Dict[str, Any], optional): A dictionary containing variables used to log analytics. Defaults to ``None``.
"""

def __init__(
self,
log_interval: int = 60,
ignore_keys: Optional[List[str]] = None,
ignore_exceptions: bool = False,
analytics_data: Optional[MosaicAnalyticsData] = None,
) -> None:
self.log_interval = log_interval
self.ignore_keys = ignore_keys
self.ignore_exceptions = ignore_exceptions
self.analytics_data = analytics_data
self._enabled = dist.get_global_rank() == 0
if self._enabled:
self.time_last_logged = 0
Expand All @@ -96,10 +115,57 @@ def log_hyperparameters(self, hyperparameters: Dict[str, Any]):
def log_metrics(self, metrics: Dict[str, Any], step: Optional[int] = None) -> None:
self._log_metadata(metrics)

def log_analytics(self, state: State, loggers: Tuple[LoggerDestination, ...]) -> None:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be a callback instead? It seems like all loggers and experiment tracking tools would benefit from having these extra information for navigation or reproducibility purposes.

if self.analytics_data is None:
return

metrics: Dict[str, Any] = {
'composer/autoresume': self.analytics_data.autoresume,
'composer/precision': state.precision,
}
metrics['composer/eval_loaders'] = [evaluator.label for evaluator in state.evaluators]
metrics['composer/optimizers'] = [{
optimizer.__class__.__name__: optimizer.defaults,
} for optimizer in state.optimizers]
metrics['composer/algorithms'] = [algorithm.__class__.__name__ for algorithm in state.algorithms]
metrics['composer/loggers'] = [logger.__class__.__name__ for logger in loggers]

# Take the service provider out of the URI and log it to metadata. If no service provider
# is found (i.e. backend = ''), then we assume 'local' for the cloud provider.
if self.analytics_data.load_path is not None:
backend, _, _ = parse_uri(self.analytics_data.load_path)
metrics['composer/cloud_provided_load_path'] = backend if backend else 'local'
if self.analytics_data.save_folder is not None:
backend, _, _ = parse_uri(self.analytics_data.save_folder)
metrics['composer/cloud_provided_save_folder'] = backend if backend else 'local'

# Save interval can be passed in w/ multiple types. If the type is a function, then
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there some idea for utility of analytics on save interval? Nothing is coming to mind for me

Copy link
Author

@angel-ruiz7 angel-ruiz7 Mar 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was requested in a Slack thread a while back. this included some less-helpful metrics (i.e. num_workers) so if it isn't useful i'll take it out. @dakinggg

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah i think just drop this one

# we log 'callable' as the save_interval value for analytics.
if isinstance(self.analytics_data.save_interval, Union[str, int]):
save_interval_str = str(self.analytics_data.save_interval)
elif isinstance(self.analytics_data.save_interval, Time):
save_interval_str = f'{self.analytics_data.save_interval._value}{self.analytics_data.save_interval._unit}'
else:
save_interval_str = 'callable'
metrics['composer/save_interval'] = save_interval_str

if state.fsdp_config:
# Keys need to be sorted so they can be parsed consistently in SQL queries
metrics['composer/fsdp_config'] = json.dumps(state.fsdp_config, sort_keys=True)

self.log_metrics(metrics)
self._flush_metadata(force_flush=True)

def log_exception(self, exception: Exception):
self._log_metadata({'exception': exception_to_json_serializable_dict(exception)})
self._flush_metadata(force_flush=True)

def init(self, state: State, logger: Logger) -> None:
try:
self.log_analytics(state, logger.destinations)
except:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we log the exception here? We should also probably only log once. if its failing consistently, we don't want to spam the run with warnings

warnings.warn('Failed to log analytics data to MosaicML. Continuing without logging analytics data.')

def after_load(self, state: State, logger: Logger) -> None:
# Log model data downloaded and initialized for run events
log.debug(f'Logging model initialized time to metadata')
Expand Down Expand Up @@ -229,6 +295,14 @@ def _get_training_progress_metrics(self, state: State) -> Dict[str, Any]:
return training_progress_metrics


@dataclass(frozen=True)
class MosaicAnalyticsData:
autoresume: bool
save_interval: Union[str, int, Time, Callable[[State, Event], bool]]
load_path: Union[str, None]
save_folder: Union[str, None]


def format_data_to_json_serializable(data: Any):
"""Recursively formats data to be JSON serializable.

Expand Down
24 changes: 22 additions & 2 deletions composer/trainer/trainer.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,7 +83,11 @@
RemoteUploaderDownloader,
WandBLogger,
)
from composer.loggers.mosaicml_logger import MOSAICML_ACCESS_TOKEN_ENV_VAR, MOSAICML_PLATFORM_ENV_VAR
from composer.loggers.mosaicml_logger import (
MOSAICML_ACCESS_TOKEN_ENV_VAR,
MOSAICML_PLATFORM_ENV_VAR,
MosaicAnalyticsData,
)
from composer.models import ComposerModel
from composer.optim import ComposerScheduler, DecoupledSGDW, compile_composer_scheduler
from composer.profiler import Profiler
Expand Down Expand Up @@ -1284,8 +1288,24 @@ def __init__(
MOSAICML_ACCESS_TOKEN_ENV_VAR,
) is not None and not any(isinstance(x, MosaicMLLogger) for x in loggers):
log.info('Detected run on MosaicML platform. Adding MosaicMLLogger to loggers.')
mosaicml_logger = MosaicMLLogger()

analytics_data = MosaicAnalyticsData(
autoresume=autoresume,
save_interval=save_interval,
load_path=load_path,
save_folder=save_folder,
)
mosaicml_logger = MosaicMLLogger(analytics_data=analytics_data,)
loggers.append(mosaicml_logger)
elif any(isinstance(x, MosaicMLLogger) for x in loggers):
# If a MosaicMLLogger is already present (i.e. passed into the Trainer), update the analytics data
mosaicml_logger = next((logger for logger in loggers if isinstance(logger, MosaicMLLogger)))
mosaicml_logger.analytics_data = MosaicAnalyticsData(
autoresume=autoresume,
save_interval=save_interval,
load_path=load_path,
save_folder=save_folder,
)

# Remote Uploader Downloader
# Keep the ``RemoteUploaderDownloader`` below client-provided loggers so the loggers init callbacks run before
Expand Down
30 changes: 30 additions & 0 deletions tests/loggers/test_mosaicml_logger.py
Original file line number Diff line number Diff line change
Expand Up @@ -384,3 +384,33 @@ def test_epoch_zero_no_dataloader_progress_metrics():
assert training_progress['training_progress'] == '[epoch=1/3]'
assert 'training_sub_progress' in training_progress
assert training_progress['training_sub_progress'] == '[batch=1]'


def test_logged_metrics(monkeypatch):
mock_mapi = MockMAPI()
monkeypatch.setenv('MOSAICML_PLATFORM', 'True')
monkeypatch.setattr(mcli, 'update_run_metadata', mock_mapi.update_run_metadata)
run_name = 'test-run-name'
monkeypatch.setenv('RUN_NAME', run_name)
trainer = Trainer(
model=SimpleModel(),
train_dataloader=DataLoader(RandomClassificationDataset()),
train_subset_num_batches=1,
max_duration='4ba',
loggers=[MosaicMLLogger()],
)
trainer.fit()

# Check that analytics metrics were logged
metadata = mock_mapi.run_metadata[run_name]
analytics = {k: v for k, v in metadata.items() if k.startswith('mosaicml/composer/')}
assert len(analytics) > 0

key_name = lambda x: f'mosaicml/composer/{x}'
assert key_name('autoresume') in analytics and analytics[key_name('autoresume')] == False
assert key_name('precision') in analytics and analytics[key_name('precision')] == 'Precision.FP32'
assert key_name('eval_loaders') in analytics and analytics[key_name('eval_loaders')] == []
assert key_name('algorithms') in analytics and analytics[key_name('algorithms')] == []
assert key_name('loggers') in analytics and analytics[key_name('loggers')
] == ['MosaicMLLogger', 'ProgressBarLogger']
assert key_name('save_interval') in analytics and analytics[key_name('save_interval')] == '1ep'
Loading