Releases: Netflix/metaflow
2.8.3
-
Features
-
Improvements
Features
Introduce support for tmpfs for executions on AWS Batch
It is typical for the user code in a Metaflow step to download assets from an object store, e.g. S3. Examples include serialized models and raw input data, such unstructured media or structured Parquet files. The amount of data loaded in a task is typically 10-100GB, allowing even terabytes to be handled in a foreach.
To reduce IO bottlenecks in such tasks, we provide an optimized client for S3, metaflow.S3 that makes it possible to download data using all available network bandwidth. Notably, in a modern instance the available network bandwidth can be higher than the local disk bandwidth. Consider: SATA 3.0 provides 6Gbit/s whereas a large instance can have 20Gbit/s network throughput. Even Gen3 NVMe provides just 16Git/s. To benefit from the full network bandwidth, local disk IO must be bypassed. The metaflow.S3 client accomplishes this by relying on the page cache: Nominally files are downloaded in a temporary directory on disk but practically all data stays in the page cache. This is assuming that the downloaded data can fit in memory, which can be ensured by having a high enough @resources(memory=) setting.
The above setup, which can provide excellent IO performance in general, has a small gotcha: The instance needs to have enough local disk space to back all the data, although no data actually hits the disk. Increasingly, instances may have more memory than local disk space available, so this superfluous requirement becomes a problem. The issue is further amplified by the fact that as of today, it is impossible to add ephemeral volumes on the fly on AWS Batch. This puts users in a strange situation: The instance has enough RAM to hold all the data in memory, and there are ways to download it quickly from S3, but the lack of local disk space (that is not even needed), makes it impossible to access the data.
AWS Batch supports mounting a tmpfs filesystem on the fly. Using this feature, the user can create a memory-backed file system which can be used as a temporary space for downloaded data. This removes the need to have to deal with any local disks. One can simply use a minimal root filesystem, which greatly simplifies the infrastructure setup.
With this release, we introduce a new config option - METAFLOW_TEMPDIR, which, if defined, is used as the default metaflow.S3(tmproot). If METAFLOW_TEMPDIR is not defined, tmproot=’.’ as before. In addition, a few new attributes are introduced for @Batch decorator -
Attribute (default) | Default behavior | Override semantics |
---|---|---|
use_tmpfs=False | tmpfs disabled | use_tmpfs=True enables tmpfs |
tmpfs_tempdir=True | sets METAFLOW_TEMPDIR=tmpfs_path | tmpfs_tempdir=False doesn't set METAFLOW_TEMPDIR |
tmpfs_size=None | sets tmpfs size to 50% of @resources(memory) | tmpfs size in megabytes |
tmpfs_path=None | use /metaflow_temp as tmpfs_path | custom mount point |
Examples
Handle large amounts of data in-memory with Batch:
@batch(memory=100000, use_tmpfs=True)
In this case, at most 50GB is available for tmpfs and it is used by S3 by default. Note that tmpfs only consumes the amount of memory corresponding to the data stored, so there is no downside in setting a large size by default.
Increase tmpfs size:
@batch(memory=100000, tmpfs_size=100000)
Let tmpfs use all available memory. Note that use_tmpfs=True doesn’t have to be specified redundantly.
Custom tmpfs use case:
@batch(memory=100000, tmpfs_size=10000, tmpfs_path=’/data’, tmpfs_tempdir=False)
Full control over settings - metaflow.S3 doesn’t use the tmpfs volume in this case.
Besides metaflow.S3, the user may want to use the tmpfs volume for their own use cases. In particular, many modern ML libraries require a local cache. To support these use cases, tmpfs_path is exposed through the current object, as current.tempdir.
This allows the user to leverage the volume straightforwardly:
AutoModelForSeq2SeqLM.from_pretrained(
model_path,
cache_dir=current.tempdir,
device_map='auto',
load_in_8bit=True,
)
Introduce auto-completion support for metaflow client in ipython notebooks
With this release, Metaflow client objects will support autocomplete in ipython notebooks
from metaflow import Flow, Metaflow
Metaflow().flows
>>> [Flow('HelloFlow'), Flow('MovieStatsFlow')]
flow = Flow('HelloFlow') # No autocomplete here
flow._ipython_key_completions_()
>>>
['1680815181013681',
'1680815178214737',
'1680432265121345',
'1680430310127401']
run = flow["1680815178214737"]
run._ipython_key_completions_()
>>> ['end', 'hello', 'start']
step = run["hello"]
step._ipython_key_completions_()
>>> ['2']
task = step["2"]
task._ipython_key_completions_()
>>> ['name']
Improvements
Reduce metadata service network calls for faster execution of flows
With this release, Metaflow flows should execute a tad bit faster since a few network calls to Metaflow's metadata service are now cached. Expect continued further improvements in flow execution times over the next few releases.
Handle unsupported data types for pandas.DataFrame gracefully for Metaflow's default card
With this release, Metaflow card creation will handle non-JSON parseable types gracefully by replacing the column values with UnsupportedType : <TYPENAME>
.
In case you need any assistance or have feedback for us, ping us at chat.metaflow.org or open a GitHub issue.
What's Changed
- Introduce codeql by @savingoyal in #1272
- fix: GitHub Workflow security recommendations by @saikonen in #1334
- Add docstring style to contribution code style guide by @jimbudarz in #1328
- remove METAFLOW_DATATOOLS_SYSROOT_S3 from configuration command by @tfurmston in #1312
- Fix #1326 and strips ext_info from blobs passed to schedulers by @romain-intel in #1329
- Namespace check skip feature from #1271 by @romain-intel in #1341
- Introduce tmpfs config options for @Batch by @savingoyal in #1287
- fix: kubernetes ec2 instance metadata timeout by @saikonen in #1335
- Make the contact information displayed by the Metaflow command configurable by @romain-intel in #1340
- Safely parse
pandas.DataFrame
fordefault
card by @valayDave in #1344 - Reduce multiple metadata service rtts using cached version. by @shrinandj in #1347
- Kubernetes running job cancellation to fallback to patching parallelism by @jackie-ob in #1353
- Remove encoding for JSON.loads by @wangchy27 in #1352
- Prep for 2.8.3 release by @savingoyal in #1354
New Contributors
- @wangchy27 made their first contribution in #1352
Full Changelog: 2.8.2...2.8.3
2.8.2
- Features
Features
Introduce support for Metaflow sandboxes for Metaflow tutorials
With this release, the Metaflow tutorials can now be executed within the Metaflow sandboxes, making it trivial to evaluate whether Metaflow is a good fit for your organization without committing to deploying the necessary cloud infrastructure upfront.
Display Metaflow UI URL on the terminal when a flow is executed via step-functions trigger
or argo-workflows trigger
With this release, if the Metaflow config (in ~/.metaflow_config
) includes a reference to the deployed Metaflow UI (assigned to METAFLOW_UI_URL
), the user-facing logs in the terminal will indicate the direct link to the relevant run view
in the Metaflow UI.
In case you need any assistance or have feedback for us, ping us at chat.metaflow.org or open a GitHub issue.
What's Changed
- Add a way to create aliases to other parts of metaflow by @romain-intel in #1304
- feature: emit UI url for argo workflows and step-functions by @saikonen in #1311
- fix: update cards dependencies by @saikonen in #1314
- Sync tutorials for Outerbounds sandbox by @emattia in #1299
- Fix the
logs
command in cases where the step/task hasn't finished by @romain-intel in #1315 - Update version to 2.8.2 by @savingoyal in #1325
New Contributors
Full Changelog: 2.8.1...2.8.2
2.8.1
- Features
Features
Add ec2 instance metadata in task.metadata_dict
when a task executes on AWS Batch
With this release, task.metadata_dict
will include the fields - ec2-instance-id
, ec2-instance-type
, ec2-region
, and ec2-availability-zone
whenever the Metaflow task is executed on AWS Batch and the task container has access to ec2 metadata magic URL.
Display Metaflow UI URL on the terminal when a flow is executed either via run
or resume
With this release, if the Metaflow config (in ~/.metaflow_config
) includes a reference to the deployed Metaflow UI (assigned to METAFLOW_UI_URL
), the user-facing logs in the terminal will indicate the direct link to the relevant run view
in the Metaflow UI.
In case you need any assistance or have feedback for us, ping us at chat.metaflow.org or open a GitHub issue.
2.8.0
Features
Introduce capability to schedule Metaflow flows with Apache Airflow
With this release, we are introducing an integration with Apache Airflow similar to our integrations with AWS Step Functions and Argo Workflows where Metaflow users can easily deploy & schedule their DAGs by simply executing
python myflow.py airflow create mydag.py
which will create an Airflow DAG for them. With this feature, Metaflow users can now enjoy all the features of Metaflow on top of Apache Airflow - including a more user-friendly and productive development API for data scientists and data engineers, without needing to change anything in your existing pipelines or operational playbooks, as described in its announcement blog post. To learn how to deploy and operate the integration, see Using Airflow with Metaflow.
When running on Airflow, Metaflow code works exactly as it does locally: No changes are required in the code. With this integration, Metaflow users can inspect their flows deployed on Apache Airflow as before and debug and reproduce results from Apache Airflow on their local laptop or within a notebook. All tasks are run on Kubernetes respecting the @resources decorator as if the @kubernetes decorator was added to all steps, as explained in Executing Tasks Remotely.
The main benefits of using Metaflow with Airflow are:
- You get to use the human-friendly API of Metaflow to define and test workflows. Almost all features of Metaflow work with Airflow out of the box, except nested foreaches, which are not yet supported by Airflow, and @Batch as the current integration only supports @kubernetes at the moment.
- You can deploy Metaflow flows to your existing Airflow server without having to change anything operationally. From Airflow's point of view, Metaflow flows look like any other Airflow DAG.
- If you want to consider moving to another orchestrator supported by Metaflow, you can test them easily just by changing one command to deploy to Argo Workflows or AWS Step Functions.
In case you need any assistance or have feedback for us, ping us at chat.metaflow.org or open a GitHub issue.
Metaflow 2.7.23
What's Changed
- New MF configs for Argo Workflows by @jackie-ob in #1267
- Added typing information for all public APIs by @romain-intel in #1158
- When packaging
metaflow_extensions
, add an empty__init__.py
file. by @romain-intel in #1276 - Replace instances of "INFO" with a constant by @romain-intel in #1275
- Fix an issue with threading and the escape hatch. by @romain-intel in #1274
Full Changelog: 2.7.22...2.7.23
Metaflow 2.7.22
What's Changed
- Test metaflow.s3 on multiple Python versions across Linux and MacOS by @savingoyal in #1246
- fix metaflow.s3 tests by @savingoyal in #1248
- support timezone when scheduling with Argo workflows by @amerberg in #1250
- Airflow V2 PR (Foreach + Sensors + GCP + MWAA Support) by @valayDave in #1256
- Expose Kubernetes Node IP in task metadata by @savingoyal in #1254
- Fix timezone code in Argo Workflows by @jackie-ob in #1258
- Implement @Secrets, with AWS support by @jackie-ob in #1251
- Expose AWS instance metadata for @kubernetes by @savingoyal in #1263
- Fix an issue with configurations for env escape extensions by @romain-intel in #1264
- Bump version to 2.7.22 by @savingoyal in #1247
Full Changelog: 2.7.21...2.7.22
Metaflow 2.7.21
What's Changed
- Fix extension support on Python 3.5 by @romain-intel in #1245
Full Changelog: 2.7.20...2.7.21
Metaflow 2.7.20
What's Changed
- Bug/long card by @obgibson in #1233
- Restrict token permissions for test jobs by @romain-intel in #1238
- Since we're now providing type annotations, add typed marker as per PEP-561 by @oavdeev in #1239
- Allow configuration of plugins/cmds using METAFLOW_ENABLED_* variable by @romain-intel in #1212
- Bump setup.py to 2.7.20 by @romain-intel in #1244
Incompatible change
If you are using the unsupported Metaflow Extensions mechanism, you may have to change them slightly. Please see https://github.com/Netflix/metaflow-extensions-template/blob/master/CHANGES.md for more details.
Full Changelog: 2.7.19...2.7.20
Metaflow 2.7.19
What's Changed
- Reduce @Environment arg length for step-functions create by @savingoyal in #1215
- Add support for Kubernetes tolerations by @odracci in #1207
- Support for AWS inferentia instances on Batch by @DanCorvesor in #1205
- Fix CVE-2007-4559 (tar.extractall) by @romain-intel in #1213
- Support .conda packages by @savingoyal in #1221
New Contributors
- @odracci made their first contribution in #1207
- @DanCorvesor made their first contribution in #1205
Full Changelog: 2.7.18...2.7.19
Metaflow 2.7.18
What's Changed
- Adds check for tutorials dir and flattens if necessary by @ashrielbrian in #1211
- Fix bug with datastore backend instantiation by @savingoyal in #1210
New Contributors
- @ashrielbrian made their first contribution in #1211
Full Changelog: 2.7.17...2.7.18