Features
- Introduce support for tmpfs for executions on AWS Batch
- Introduce auto-completion support for metaflow client in ipython notebooks
Improvements
- Reduce metadata service network calls for faster execution of flows
- Handle unsupported data types for pandas.DataFrame gracefully for Metaflow's default card

Features

Introduce support for tmpfs for executions on AWS Batch

It is typical for the user code in a Metaflow step to download assets from an object store, e.g. S3. Examples include serialized models and raw input data, such unstructured media or structured Parquet files. The amount of data loaded in a task is typically 10-100GB, allowing even terabytes to be handled in a foreach.

To reduce IO bottlenecks in such tasks, we provide an optimized client for S3, metaflow.S3 that makes it possible to download data using all available network bandwidth. Notably, in a modern instance the available network bandwidth can be higher than the local disk bandwidth. Consider: SATA 3.0 provides 6Gbit/s whereas a large instance can have 20Gbit/s network throughput. Even Gen3 NVMe provides just 16Git/s. To benefit from the full network bandwidth, local disk IO must be bypassed. The metaflow.S3 client accomplishes this by relying on the page cache: Nominally files are downloaded in a temporary directory on disk but practically all data stays in the page cache. This is assuming that the downloaded data can fit in memory, which can be ensured by having a high enough @resources(memory=) setting.

The above setup, which can provide excellent IO performance in general, has a small gotcha: The instance needs to have enough local disk space to back all the data, although no data actually hits the disk. Increasingly, instances may have more memory than local disk space available, so this superfluous requirement becomes a problem. The issue is further amplified by the fact that as of today, it is impossible to add ephemeral volumes on the fly on AWS Batch. This puts users in a strange situation: The instance has enough RAM to hold all the data in memory, and there are ways to download it quickly from S3, but the lack of local disk space (that is not even needed), makes it impossible to access the data.

AWS Batch supports mounting a tmpfs filesystem on the fly. Using this feature, the user can create a memory-backed file system which can be used as a temporary space for downloaded data. This removes the need to have to deal with any local disks. One can simply use a minimal root filesystem, which greatly simplifies the infrastructure setup.

With this release, we introduce a new config option - METAFLOW_TEMPDIR, which, if defined, is used as the default metaflow.S3(tmproot). If METAFLOW_TEMPDIR is not defined, tmproot=’.’ as before. In addition, a few new attributes are introduced for @Batch decorator -

Attribute (default)	Default behavior	Override semantics
use_tmpfs=False	tmpfs disabled	use_tmpfs=True enables tmpfs
tmpfs_tempdir=True	sets METAFLOW_TEMPDIR=tmpfs_path	tmpfs_tempdir=False doesn't set METAFLOW_TEMPDIR
tmpfs_size=None	sets tmpfs size to 50% of @resources(memory)	tmpfs size in megabytes
tmpfs_path=None	use /metaflow_temp as tmpfs_path	custom mount point

Examples

Handle large amounts of data in-memory with Batch:

@batch(memory=100000, use_tmpfs=True)

In this case, at most 50GB is available for tmpfs and it is used by S3 by default. Note that tmpfs only consumes the amount of memory corresponding to the data stored, so there is no downside in setting a large size by default.

Increase tmpfs size:

@batch(memory=100000, tmpfs_size=100000)

Let tmpfs use all available memory. Note that use_tmpfs=True doesn’t have to be specified redundantly.

Custom tmpfs use case:

@batch(memory=100000, tmpfs_size=10000, tmpfs_path=’/data’, tmpfs_tempdir=False)

Full control over settings - metaflow.S3 doesn’t use the tmpfs volume in this case.

Besides metaflow.S3, the user may want to use the tmpfs volume for their own use cases. In particular, many modern ML libraries require a local cache. To support these use cases, tmpfs_path is exposed through the current object, as current.tempdir.
This allows the user to leverage the volume straightforwardly:

AutoModelForSeq2SeqLM.from_pretrained(
            model_path,
            cache_dir=current.tempdir,
            device_map='auto',
            load_in_8bit=True,
        )

Introduce auto-completion support for metaflow client in ipython notebooks

With this release, Metaflow client objects will support autocomplete in ipython notebooks

from metaflow import Flow, Metaflow

Metaflow().flows
>>> [Flow('HelloFlow'), Flow('MovieStatsFlow')]

flow = Flow('HelloFlow') # No autocomplete here
flow._ipython_key_completions_()
>>> 
['1680815181013681',
 '1680815178214737',
 '1680432265121345',
 '1680430310127401']

run = flow["1680815178214737"]
run._ipython_key_completions_()
>>> ['end', 'hello', 'start']

step = run["hello"]
step._ipython_key_completions_()
>>> ['2']

task = step["2"]
task._ipython_key_completions_()
>>> ['name']

Improvements

Reduce metadata service network calls for faster execution of flows

With this release, Metaflow flows should execute a tad bit faster since a few network calls to Metaflow's metadata service are now cached. Expect continued further improvements in flow execution times over the next few releases.

Handle unsupported data types for pandas.DataFrame gracefully for Metaflow's default card

With this release, Metaflow card creation will handle non-JSON parseable types gracefully by replacing the column values with UnsupportedType : <TYPENAME>.

In case you need any assistance or have feedback for us, ping us at chat.metaflow.org or open a GitHub issue.

What's Changed

Introduce codeql by @savingoyal in #1272
fix: GitHub Workflow security recommendations by @saikonen in #1334
Add docstring style to contribution code style guide by @jimbudarz in #1328
remove METAFLOW_DATATOOLS_SYSROOT_S3 from configuration command by @tfurmston in #1312
Fix #1326 and strips ext_info from blobs passed to schedulers by @romain-intel in #1329
Namespace check skip feature from #1271 by @romain-intel in #1341
Introduce tmpfs config options for @Batch by @savingoyal in #1287
fix: kubernetes ec2 instance metadata timeout by @saikonen in #1335
Make the contact information displayed by the Metaflow command configurable by @romain-intel in #1340
Safely parse pandas.DataFrame for default card by @valayDave in #1344
Reduce multiple metadata service rtts using cached version. by @shrinandj in #1347
Kubernetes running job cancellation to fallback to patching parallelism by @jackie-ob in #1353
Remove encoding for JSON.loads by @wangchy27 in #1352
Prep for 2.8.3 release by @savingoyal in #1354

New Contributors

@wangchy27 made their first contribution in #1352

Full Changelog: 2.8.2...2.8.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

2.8.3