From d54b74fd9666cba6f217bf333e1b72f301ff597b Mon Sep 17 00:00:00 2001 From: Ryan Despain <2043940+ryaminal@users.noreply.github.com> Date: Wed, 10 Apr 2024 08:03:13 -0600 Subject: [PATCH] Update and rename README.rst to README.md Rename and convert rst to markdown... because github is "dumb" and apparently occasionally just stops syntax-highlighting on rst code-blocks... --- README.md | 551 +++++++++++++++++++++++++++++++++++++++++++++++++++++ README.rst | 511 ------------------------------------------------- 2 files changed, 551 insertions(+), 511 deletions(-) create mode 100644 README.md delete mode 100644 README.rst diff --git a/README.md b/README.md new file mode 100644 index 00000000..4fce8eca --- /dev/null +++ b/README.md @@ -0,0 +1,551 @@ +# smart_open — utils for streaming large files in Python + +![License](https://img.shields.io/pypi/l/smart_open.svg)\_ +![GHA](https://github.com/RaRe-Technologies/smart_open/workflows/Test/badge.svg)\_ +![Coveralls](https://coveralls.io/repos/github/RaRe-Technologies/smart_open/badge.svg?branch=develop)\_ +![Downloads](https://pepy.tech/badge/smart-open/month)\_ + +## What? + +`smart_open` is a Python 3 library for **efficient streaming of very +large files** from/to storages such as S3, GCS, Azure Blob Storage, +HDFS, WebHDFS, HTTP, HTTPS, SFTP, or local filesystem. It supports +transparent, on-the-fly (de-)compression for a variety of different +formats. + +`smart_open` is a drop-in replacement for Python's built-in `open()`: it +can do anything `open` can (100% compatible, falls back to native `open` +wherever possible), plus lots of nifty extra stuff on top. + +**Python 2.7 is no longer supported. If you need Python 2.7, please +use** [smart_open +1.10.1](https://github.com/RaRe-Technologies/smart_open/releases/tag/1.10.0), +**the last version to support Python 2.** + +## Why? + +Working with large remote files, for example using Amazon's +[boto3](https://boto3.amazonaws.com/v1/documentation/api/latest/index.html) +Python library, is a pain. `boto3`'s `Object.upload_fileobj()` and +`Object.download_fileobj()` methods require gotcha-prone boilerplate to +use successfully, such as constructing file-like object wrappers. +`smart_open` shields you from that. It builds on boto3 and other remote +storage libraries, but offers a **clean unified Pythonic API**. The +result is less code for you to write and fewer bugs to make. + +## How? + +`smart_open` is well-tested, well-documented, and has a simple Pythonic +API: + +
+ +``` python +>>> from smart_open import open +>>> +>>> # stream lines from an S3 object +>>> for line in open('s3://commoncrawl/robots.txt'): +... print(repr(line)) +... break +'User-Agent: *\n' + +>>> # stream from/to compressed files, with transparent (de)compression: +>>> for line in open('smart_open/tests/test_data/1984.txt.gz', encoding='utf-8'): +... print(repr(line)) +'It was a bright cold day in April, and the clocks were striking thirteen.\n' +'Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n' +'wind, slipped quickly through the glass doors of Victory Mansions, though not\n' +'quickly enough to prevent a swirl of gritty dust from entering along with him.\n' + +>>> # can use context managers too: +>>> with open('smart_open/tests/test_data/1984.txt.gz') as fin: +... with open('smart_open/tests/test_data/1984.txt.bz2', 'w') as fout: +... for line in fin: +... fout.write(line) +74 +80 +78 +79 + +>>> # can use any IOBase operations, like seek +>>> with open('s3://commoncrawl/robots.txt', 'rb') as fin: +... for line in fin: +... print(repr(line.decode('utf-8'))) +... break +... offset = fin.seek(0) # seek to the beginning +... print(fin.read(4)) +'User-Agent: *\n' +b'User' + +>>> # stream from HTTP +>>> for line in open('http://example.com/index.html'): +... print(repr(line)) +... break +'\n' +``` + +
+ +
+ +Other examples of URLs that `smart_open` accepts: + + s3://my_bucket/my_key + s3://my_key:my_secret@my_bucket/my_key + s3://my_key:my_secret@my_server:my_port@my_bucket/my_key + gs://my_bucket/my_blob + azure://my_bucket/my_blob + hdfs:///path/file + hdfs://path/file + webhdfs://host:port/path/file + ./local/path/file + ~/local/path/file + local/path/file + ./local/path/file.gz + file:///home/user/file + file:///home/user/file.bz2 + [ssh|scp|sftp]://username@host//path/file + [ssh|scp|sftp]://username@host/path/file + [ssh|scp|sftp]://username:password@host/path/file + +
+ +## Documentation + +### Installation + +`smart_open` supports a wide range of storage solutions, including AWS +S3, Google Cloud and Azure. Each individual solution has its own +dependencies. By default, `smart_open` does not install any +dependencies, in order to keep the installation size small. You can +install these dependencies explicitly using: + + pip install smart_open[azure] # Install Azure deps + pip install smart_open[gcs] # Install GCS deps + pip install smart_open[s3] # Install S3 deps + +Or, if you don't mind installing a large number of third party +libraries, you can install all dependencies using: + + pip install smart_open[all] + +Be warned that this option increases the installation size +significantly, e.g. over 100MB. + +If you're upgrading from `smart_open` versions 2.x and below, please +check out the [Migration Guide](MIGRATING_FROM_OLDER_VERSIONS.rst). + +### Built-in help + +For detailed API info, see the online help: + +``` python +help('smart_open') +``` + +or click +[here](https://github.com/RaRe-Technologies/smart_open/blob/master/help.txt) +to view the help in your browser. + +### More examples + +For the sake of simplicity, the examples below assume you have all the +dependencies installed, i.e. you have done: + + pip install smart_open[all] + +``` python +>>> import os, boto3 +>>> from smart_open import open +>>> +>>> # stream content *into* S3 (write mode) using a custom session +>>> session = boto3.Session( +... aws_access_key_id=os.environ['AWS_ACCESS_KEY_ID'], +... aws_secret_access_key=os.environ['AWS_SECRET_ACCESS_KEY'], +... ) +>>> url = 's3://smart-open-py37-benchmark-results/test.txt' +>>> with open(url, 'wb', transport_params={'client': session.client('s3')}) as fout: +... bytes_written = fout.write(b'hello world!') +... print(bytes_written) +12 +``` + +``` python +# stream from HDFS +for line in open('hdfs://user/hadoop/my_file.txt', encoding='utf8'): + print(line) + +# stream from WebHDFS +for line in open('webhdfs://host:port/user/hadoop/my_file.txt'): + print(line) + +# stream content *into* HDFS (write mode): +with open('hdfs://host:port/user/hadoop/my_file.txt', 'wb') as fout: + fout.write(b'hello world') + +# stream content *into* WebHDFS (write mode): +with open('webhdfs://host:port/user/hadoop/my_file.txt', 'wb') as fout: + fout.write(b'hello world') + +# stream from a completely custom s3 server, like s3proxy: +for line in open('s3u://user:secret@host:port@mybucket/mykey.txt'): + print(line) + +# Stream to Digital Ocean Spaces bucket providing credentials from boto3 profile +session = boto3.Session(profile_name='digitalocean') +client = session.client('s3', endpoint_url='https://ams3.digitaloceanspaces.com') +transport_params = {'client': client} +with open('s3://bucket/key.txt', 'wb', transport_params=transport_params) as fout: + fout.write(b'here we stand') + +# stream from GCS +for line in open('gs://my_bucket/my_file.txt'): + print(line) + +# stream content *into* GCS (write mode): +with open('gs://my_bucket/my_file.txt', 'wb') as fout: + fout.write(b'hello world') + +# stream from Azure Blob Storage +connect_str = os.environ['AZURE_STORAGE_CONNECTION_STRING'] +transport_params = { + 'client': azure.storage.blob.BlobServiceClient.from_connection_string(connect_str), +} +for line in open('azure://mycontainer/myfile.txt', transport_params=transport_params): + print(line) + +# stream content *into* Azure Blob Storage (write mode): +connect_str = os.environ['AZURE_STORAGE_CONNECTION_STRING'] +transport_params = { + 'client': azure.storage.blob.BlobServiceClient.from_connection_string(connect_str), +} +with open('azure://mycontainer/my_file.txt', 'wb', transport_params=transport_params) as fout: + fout.write(b'hello world') +``` + +### Compression Handling + +The top-level compression parameter +controls compression/decompression behavior when reading and writing. +The supported values for this parameter are: + +- `infer_from_extension` (default behavior) +- `disable` +- `.gz` +- `.bz2` + +By default, `smart_open` determines the compression algorithm to use +based on the file extension. + +``` python +>>> from smart_open import open, register_compressor +>>> with open('smart_open/tests/test_data/1984.txt.gz') as fin: +... print(fin.read(32)) +It was a bright cold day in Apri +``` + +You can override this behavior to either disable compression, or +explicitly specify the algorithm to use. To disable compression: + +``` python +>>> from smart_open import open, register_compressor +>>> with open('smart_open/tests/test_data/1984.txt.gz', 'rb', compression='disable') as fin: +... print(fin.read(32)) +b'\x1f\x8b\x08\x08\x85F\x94\\\x00\x031984.txt\x005\x8f=r\xc3@\x08\x85{\x9d\xe2\x1d@' +``` + +To specify the algorithm explicitly (e.g. for non-standard file +extensions): + +``` python +>>> from smart_open import open, register_compressor +>>> with open('smart_open/tests/test_data/1984.txt.gzip', compression='.gz') as fin: +... print(fin.read(32)) +It was a bright cold day in Apri +``` + +You can also easily add support for other file extensions and +compression formats. For example, to open xz-compressed files: + +``` python +>>> import lzma, os +>>> from smart_open import open, register_compressor + +>>> def _handle_xz(file_obj, mode): +... return lzma.LZMAFile(filename=file_obj, mode=mode, format=lzma.FORMAT_XZ) + +>>> register_compressor('.xz', _handle_xz) + +>>> with open('smart_open/tests/test_data/1984.txt.xz') as fin: +... print(fin.read(32)) +It was a bright cold day in Apri +``` + +`lzma` is in the standard library in Python 3.3 and greater. For 2.7, +use [backports.lzma](https://pypi.org/project/backports.lzma/). + +### Transport-specific Options + +`smart_open` supports a wide range of transport options out of the box, +including: + +- S3 +- HTTP, HTTPS (read-only) +- SSH, SCP and SFTP +- WebHDFS +- GCS +- Azure Blob Storage + +Each option involves setting up its own set of parameters. For example, +for accessing S3, you often need to set up authentication, like API keys +or a profile name. `smart_open`'s `open` function accepts a keyword +argument `transport_params` which accepts additional parameters for the +transport layer. Here are some examples of using this parameter: + +``` python +>>> import boto3 +>>> fin = open('s3://commoncrawl/robots.txt', transport_params=dict(client=boto3.client('s3'))) +>>> fin = open('s3://commoncrawl/robots.txt', transport_params=dict(buffer_size=1024)) +``` + +For the full list of keyword arguments supported by each transport +option, see the documentation: + +``` python +help('smart_open.open') +``` + +### S3 Credentials + +`smart_open` uses the `boto3` library to talk to S3. `boto3` has several +[mechanisms](https://boto3.amazonaws.com/v1/documentation/api/latest/guide/configuration.html) +for determining the credentials to use. By default, `smart_open` will +defer to `boto3` and let the latter take care of the credentials. There +are several ways to override this behavior. + +The first is to pass a `boto3.Client` object as a transport parameter to +the `open` function. You can customize the credentials when constructing +the session for the client. `smart_open` will then use the session when +talking to S3. + +``` python +session = boto3.Session( + aws_access_key_id=ACCESS_KEY, + aws_secret_access_key=SECRET_KEY, + aws_session_token=SESSION_TOKEN, +) +client = session.client('s3', endpoint_url=..., config=...) +fin = open('s3://bucket/key', transport_params={'client': client}) +``` + +Your second option is to specify the credentials within the S3 URL +itself: + +``` python +fin = open('s3://aws_access_key_id:aws_secret_access_key@bucket/key', ...) +``` + +*Important*: The two methods above are **mutually exclusive**. If you +pass an AWS client *and* the URL contains credentials, `smart_open` will +ignore the latter. + +*Important*: `smart_open` ignores configuration files from the older +`boto` library. Port your old `boto` settings to `boto3` in order to use +them with `smart_open`. + +### S3 Advanced Usage + +Additional keyword arguments can be propagated to the boto3 methods that +are used by `smart_open` under the hood using the `client_kwargs` +transport parameter. + +For instance, to upload a blob with Metadata, ACL, StorageClass, these +keyword arguments can be passed to `create_multipart_upload` +([docs](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3.html#S3.Client.create_multipart_upload)). + +``` python +kwargs = {'Metadata': {'version': 2}, 'ACL': 'authenticated-read', 'StorageClass': 'STANDARD_IA'} +fout = open('s3://bucket/key', 'wb', transport_params={'client_kwargs': {'S3.Client.create_multipart_upload': kwargs}}) +``` + +### Iterating Over an S3 Bucket's Contents + +Since going over all (or select) keys in an S3 bucket is a very common +operation, there's also an extra function `smart_open.s3.iter_bucket()` +that does this efficiently, **processing the bucket keys in parallel** +(using multiprocessing): + +``` python +>>> from smart_open import s3 +>>> # we use workers=1 for reproducibility; you should use as many workers as you have cores +>>> bucket = 'silo-open-data' +>>> prefix = 'Official/annual/monthly_rain/' +>>> for key, content in s3.iter_bucket(bucket, prefix=prefix, accept_key=lambda key: '/201' in key, workers=1, key_limit=3): +... print(key, round(len(content) / 2**20)) +Official/annual/monthly_rain/2010.monthly_rain.nc 13 +Official/annual/monthly_rain/2011.monthly_rain.nc 13 +Official/annual/monthly_rain/2012.monthly_rain.nc 13 +``` + +### GCS Credentials + +`smart_open` uses the `google-cloud-storage` library to talk to GCS. +`google-cloud-storage` uses the `google-cloud` package under the hood to +handle authentication. There are several +[options](https://googleapis.dev/python/google-api-core/latest/auth.html) +to provide credentials. By default, `smart_open` will defer to +`google-cloud-storage` and let it take care of the credentials. + +To override this behavior, pass a `google.cloud.storage.Client` object +as a transport parameter to the `open` function. You can [customize the +credentials](https://googleapis.dev/python/storage/latest/client.html) +when constructing the client. `smart_open` will then use the client when +talking to GCS. To follow allow with the example below, [refer to +Google's +guide](https://cloud.google.com/storage/docs/reference/libraries#setting_up_authentication) +to setting up GCS authentication with a service account. + +``` python +import os +from google.cloud.storage import Client +service_account_path = os.environ['GOOGLE_APPLICATION_CREDENTIALS'] +client = Client.from_service_account_json(service_account_path) +fin = open('gs://gcp-public-data-landsat/index.csv.gz', transport_params=dict(client=client)) +``` + +If you need more credential options, you can create an explicit +`google.auth.credentials.Credentials` object and pass it to the Client. +To create an API token for use in the example below, refer to the [GCS +authentication +guide](https://cloud.google.com/storage/docs/authentication#apiauth). + +``` python +import os +from google.auth.credentials import Credentials +from google.cloud.storage import Client +token = os.environ['GOOGLE_API_TOKEN'] +credentials = Credentials(token=token) +client = Client(credentials=credentials) +fin = open('gs://gcp-public-data-landsat/index.csv.gz', transport_params={'client': client}) +``` + +### GCS Advanced Usage + +Additional keyword arguments can be propagated to the GCS open method +([docs](https://cloud.google.com/python/docs/reference/storage/latest/google.cloud.storage.blob.Blob#google_cloud_storage_blob_Blob_open)), +which is used by `smart_open` under the hood, using the +`blob_open_kwargs` transport parameter. + +Additionally keyword arguments can be propagated to the GCS `get_blob` +method +([docs](https://cloud.google.com/python/docs/reference/storage/latest/google.cloud.storage.bucket.Bucket#google_cloud_storage_bucket_Bucket_get_blob)) +when in a read-mode, using the `get_blob_kwargs` transport parameter. + +Additional blob properties +([docs](https://cloud.google.com/python/docs/reference/storage/latest/google.cloud.storage.blob.Blob#properties)) +can be set before an upload, as long as they are not read-only, using +the `blob_properties` transport parameter. + +``` python +open_kwargs = {'predefined_acl': 'authenticated-read'} +properties = {'metadata': {'version': 2}, 'storage_class': 'COLDLINE'} +fout = open('gs://bucket/key', 'wb', transport_params={'blob_open_kwargs': open_kwargs, 'blob_properties': properties}) +``` + +### Azure Credentials + +`smart_open` uses the `azure-storage-blob` library to talk to Azure Blob +Storage. By default, `smart_open` will defer to `azure-storage-blob` and +let it take care of the credentials. + +Azure Blob Storage does not have any ways of inferring credentials +therefore, passing a `azure.storage.blob.BlobServiceClient` object as a +transport parameter to the `open` function is required. You can +[customize the +credentials](https://docs.microsoft.com/en-us/azure/storage/common/storage-samples-python#authentication) +when constructing the client. `smart_open` will then use the client when +talking to. To follow allow with the example below, [refer to Azure's +guide](https://docs.microsoft.com/en-us/azure/storage/blobs/storage-quickstart-blobs-python#copy-your-credentials-from-the-azure-portal) +to setting up authentication. + +``` python +import os +from azure.storage.blob import BlobServiceClient +azure_storage_connection_string = os.environ['AZURE_STORAGE_CONNECTION_STRING'] +client = BlobServiceClient.from_connection_string(azure_storage_connection_string) +fin = open('azure://my_container/my_blob.txt', transport_params={'client': client}) +``` + +If you need more credential options, refer to the [Azure Storage +authentication +guide](https://docs.microsoft.com/en-us/azure/storage/common/storage-samples-python#authentication). + +### Azure Advanced Usage + +Additional keyword arguments can be propagated to the +`commit_block_list` method +([docs](https://azuresdkdocs.blob.core.windows.net/$web/python/azure-storage-blob/12.14.1/azure.storage.blob.html#azure.storage.blob.BlobClient.commit_block_list)), +which is used by `smart_open` under the hood for uploads, using the +`blob_kwargs` transport parameter. + +``` python +kwargs = {'metadata': {'version': 2}} +fout = open('azure://container/key', 'wb', transport_params={'blob_kwargs': kwargs}) +``` + +### Drop-in replacement of `pathlib.Path.open` + +`smart_open.open` can also be used with `Path` objects. The built-in +Path.open() is not able to read text from +compressed files, so use `patch_pathlib` to replace it with smart_open.open() instead. This can be helpful +when e.g. working with compressed files. + +``` python +>>> from pathlib import Path +>>> from smart_open.smart_open_lib import patch_pathlib +>>> +>>> _ = patch_pathlib() # replace `Path.open` with `smart_open.open` +>>> +>>> path = Path("smart_open/tests/test_data/crime-and-punishment.txt.gz") +>>> +>>> with path.open("r") as infile: +... print(infile.readline()[:41]) +В начале июля, в чрезвычайно жаркое время +``` + +## How do I ...? + +See [this document](howto.md). + +## Extending `smart_open` + +See [this document](extending.md). + +## Testing `smart_open` + +`smart_open` comes with a comprehensive suite of unit tests. Before you +can run the test suite, install the test dependencies: + + pip install -e .[test] + +Now, you can run the unit tests: + + pytest smart_open + +The tests are also run automatically with [Travis +CI](https://travis-ci.org/RaRe-Technologies/smart_open) on every commit +push & pull request. + +## Comments, bug reports + +`smart_open` lives on +[Github](https://github.com/RaRe-Technologies/smart_open). You can file +issues or pull requests there. Suggestions, pull requests and +improvements welcome! + +------------------------------------------------------------------------ + +`smart_open` is open source software released under the [MIT +license](https://github.com/piskvorky/smart_open/blob/master/LICENSE). +Copyright (c) 2015-now [Radim Řehůřek](https://radimrehurek.com). diff --git a/README.rst b/README.rst deleted file mode 100644 index c7060131..00000000 --- a/README.rst +++ /dev/null @@ -1,511 +0,0 @@ -====================================================== -smart_open — utils for streaming large files in Python -====================================================== - - -|License|_ |GHA|_ |Coveralls|_ |Downloads|_ - -.. |License| image:: https://img.shields.io/pypi/l/smart_open.svg -.. |GHA| image:: https://github.com/RaRe-Technologies/smart_open/workflows/Test/badge.svg -.. |Coveralls| image:: https://coveralls.io/repos/github/RaRe-Technologies/smart_open/badge.svg?branch=develop -.. |Downloads| image:: https://pepy.tech/badge/smart-open/month -.. _License: https://github.com/RaRe-Technologies/smart_open/blob/master/LICENSE -.. _GHA: https://github.com/RaRe-Technologies/smart_open/actions?query=workflow%3ATest -.. _Coveralls: https://coveralls.io/github/RaRe-Technologies/smart_open?branch=HEAD -.. _Downloads: https://pypi.org/project/smart-open/ - - -What? -===== - -``smart_open`` is a Python 3 library for **efficient streaming of very large files** from/to storages such as S3, GCS, Azure Blob Storage, HDFS, WebHDFS, HTTP, HTTPS, SFTP, or local filesystem. It supports transparent, on-the-fly (de-)compression for a variety of different formats. - -``smart_open`` is a drop-in replacement for Python's built-in ``open()``: it can do anything ``open`` can (100% compatible, falls back to native ``open`` wherever possible), plus lots of nifty extra stuff on top. - -**Python 2.7 is no longer supported. If you need Python 2.7, please use** `smart_open 1.10.1 `_, **the last version to support Python 2.** - -Why? -==== - -Working with large remote files, for example using Amazon's `boto3 `_ Python library, is a pain. -``boto3``'s ``Object.upload_fileobj()`` and ``Object.download_fileobj()`` methods require gotcha-prone boilerplate to use successfully, such as constructing file-like object wrappers. -``smart_open`` shields you from that. It builds on boto3 and other remote storage libraries, but offers a **clean unified Pythonic API**. The result is less code for you to write and fewer bugs to make. - - -How? -===== - -``smart_open`` is well-tested, well-documented, and has a simple Pythonic API: - - -.. _doctools_before_examples: - -.. code-block:: python - - >>> from smart_open import open - >>> - >>> # stream lines from an S3 object - >>> for line in open('s3://commoncrawl/robots.txt'): - ... print(repr(line)) - ... break - 'User-Agent: *\n' - - >>> # stream from/to compressed files, with transparent (de)compression: - >>> for line in open('smart_open/tests/test_data/1984.txt.gz', encoding='utf-8'): - ... print(repr(line)) - 'It was a bright cold day in April, and the clocks were striking thirteen.\n' - 'Winston Smith, his chin nuzzled into his breast in an effort to escape the vile\n' - 'wind, slipped quickly through the glass doors of Victory Mansions, though not\n' - 'quickly enough to prevent a swirl of gritty dust from entering along with him.\n' - - >>> # can use context managers too: - >>> with open('smart_open/tests/test_data/1984.txt.gz') as fin: - ... with open('smart_open/tests/test_data/1984.txt.bz2', 'w') as fout: - ... for line in fin: - ... fout.write(line) - 74 - 80 - 78 - 79 - - >>> # can use any IOBase operations, like seek - >>> with open('s3://commoncrawl/robots.txt', 'rb') as fin: - ... for line in fin: - ... print(repr(line.decode('utf-8'))) - ... break - ... offset = fin.seek(0) # seek to the beginning - ... print(fin.read(4)) - 'User-Agent: *\n' - b'User' - - >>> # stream from HTTP - >>> for line in open('http://example.com/index.html'): - ... print(repr(line)) - ... break - '\n' - -.. _doctools_after_examples: - -Other examples of URLs that ``smart_open`` accepts:: - - s3://my_bucket/my_key - s3://my_key:my_secret@my_bucket/my_key - s3://my_key:my_secret@my_server:my_port@my_bucket/my_key - gs://my_bucket/my_blob - azure://my_bucket/my_blob - hdfs:///path/file - hdfs://path/file - webhdfs://host:port/path/file - ./local/path/file - ~/local/path/file - local/path/file - ./local/path/file.gz - file:///home/user/file - file:///home/user/file.bz2 - [ssh|scp|sftp]://username@host//path/file - [ssh|scp|sftp]://username@host/path/file - [ssh|scp|sftp]://username:password@host/path/file - - -Documentation -============= - -Installation ------------- - -``smart_open`` supports a wide range of storage solutions, including AWS S3, Google Cloud and Azure. -Each individual solution has its own dependencies. -By default, ``smart_open`` does not install any dependencies, in order to keep the installation size small. -You can install these dependencies explicitly using:: - - pip install smart_open[azure] # Install Azure deps - pip install smart_open[gcs] # Install GCS deps - pip install smart_open[s3] # Install S3 deps - -Or, if you don't mind installing a large number of third party libraries, you can install all dependencies using:: - - pip install smart_open[all] - -Be warned that this option increases the installation size significantly, e.g. over 100MB. - -If you're upgrading from ``smart_open`` versions 2.x and below, please check out the `Migration Guide `_. - -Built-in help -------------- - -For detailed API info, see the online help: - -.. code-block:: python - - help('smart_open') - -or click `here `__ to view the help in your browser. - -More examples -------------- - -For the sake of simplicity, the examples below assume you have all the dependencies installed, i.e. you have done:: - - pip install smart_open[all] - -.. code-block:: python - - >>> import os, boto3 - >>> from smart_open import open - >>> - >>> # stream content *into* S3 (write mode) using a custom session - >>> session = boto3.Session( - ... aws_access_key_id=os.environ['AWS_ACCESS_KEY_ID'], - ... aws_secret_access_key=os.environ['AWS_SECRET_ACCESS_KEY'], - ... ) - >>> url = 's3://smart-open-py37-benchmark-results/test.txt' - >>> with open(url, 'wb', transport_params={'client': session.client('s3')}) as fout: - ... bytes_written = fout.write(b'hello world!') - ... print(bytes_written) - 12 - -.. code-block:: python - - # stream from HDFS - for line in open('hdfs://user/hadoop/my_file.txt', encoding='utf8'): - print(line) - - # stream from WebHDFS - for line in open('webhdfs://host:port/user/hadoop/my_file.txt'): - print(line) - - # stream content *into* HDFS (write mode): - with open('hdfs://host:port/user/hadoop/my_file.txt', 'wb') as fout: - fout.write(b'hello world') - - # stream content *into* WebHDFS (write mode): - with open('webhdfs://host:port/user/hadoop/my_file.txt', 'wb') as fout: - fout.write(b'hello world') - - # stream from a completely custom s3 server, like s3proxy: - for line in open('s3u://user:secret@host:port@mybucket/mykey.txt'): - print(line) - - # Stream to Digital Ocean Spaces bucket providing credentials from boto3 profile - session = boto3.Session(profile_name='digitalocean') - client = session.client('s3', endpoint_url='https://ams3.digitaloceanspaces.com') - transport_params = {'client': client} - with open('s3://bucket/key.txt', 'wb', transport_params=transport_params) as fout: - fout.write(b'here we stand') - - # stream from GCS - for line in open('gs://my_bucket/my_file.txt'): - print(line) - - # stream content *into* GCS (write mode): - with open('gs://my_bucket/my_file.txt', 'wb') as fout: - fout.write(b'hello world') - - # stream from Azure Blob Storage - connect_str = os.environ['AZURE_STORAGE_CONNECTION_STRING'] - transport_params = { - 'client': azure.storage.blob.BlobServiceClient.from_connection_string(connect_str), - } - for line in open('azure://mycontainer/myfile.txt', transport_params=transport_params): - print(line) - - # stream content *into* Azure Blob Storage (write mode): - connect_str = os.environ['AZURE_STORAGE_CONNECTION_STRING'] - transport_params = { - 'client': azure.storage.blob.BlobServiceClient.from_connection_string(connect_str), - } - with open('azure://mycontainer/my_file.txt', 'wb', transport_params=transport_params) as fout: - fout.write(b'hello world') - -Compression Handling --------------------- - -The top-level `compression` parameter controls compression/decompression behavior when reading and writing. -The supported values for this parameter are: - -- ``infer_from_extension`` (default behavior) -- ``disable`` -- ``.gz`` -- ``.bz2`` - -By default, ``smart_open`` determines the compression algorithm to use based on the file extension. - -.. code-block:: python - - >>> from smart_open import open, register_compressor - >>> with open('smart_open/tests/test_data/1984.txt.gz') as fin: - ... print(fin.read(32)) - It was a bright cold day in Apri - -You can override this behavior to either disable compression, or explicitly specify the algorithm to use. -To disable compression: - -.. code-block:: python - - >>> from smart_open import open, register_compressor - >>> with open('smart_open/tests/test_data/1984.txt.gz', 'rb', compression='disable') as fin: - ... print(fin.read(32)) - b'\x1f\x8b\x08\x08\x85F\x94\\\x00\x031984.txt\x005\x8f=r\xc3@\x08\x85{\x9d\xe2\x1d@' - - -To specify the algorithm explicitly (e.g. for non-standard file extensions): - -.. code-block:: python - - >>> from smart_open import open, register_compressor - >>> with open('smart_open/tests/test_data/1984.txt.gzip', compression='.gz') as fin: - ... print(fin.read(32)) - It was a bright cold day in Apri - -You can also easily add support for other file extensions and compression formats. -For example, to open xz-compressed files: - -.. code-block:: python - - >>> import lzma, os - >>> from smart_open import open, register_compressor - - >>> def _handle_xz(file_obj, mode): - ... return lzma.LZMAFile(filename=file_obj, mode=mode, format=lzma.FORMAT_XZ) - - >>> register_compressor('.xz', _handle_xz) - - >>> with open('smart_open/tests/test_data/1984.txt.xz') as fin: - ... print(fin.read(32)) - It was a bright cold day in Apri - -``lzma`` is in the standard library in Python 3.3 and greater. -For 2.7, use `backports.lzma`_. - -.. _backports.lzma: https://pypi.org/project/backports.lzma/ - -Transport-specific Options --------------------------- - -``smart_open`` supports a wide range of transport options out of the box, including: - -- S3 -- HTTP, HTTPS (read-only) -- SSH, SCP and SFTP -- WebHDFS -- GCS -- Azure Blob Storage - -Each option involves setting up its own set of parameters. -For example, for accessing S3, you often need to set up authentication, like API keys or a profile name. -``smart_open``'s ``open`` function accepts a keyword argument ``transport_params`` which accepts additional parameters for the transport layer. -Here are some examples of using this parameter: - -.. code-block:: python - - >>> import boto3 - >>> fin = open('s3://commoncrawl/robots.txt', transport_params=dict(client=boto3.client('s3'))) - >>> fin = open('s3://commoncrawl/robots.txt', transport_params=dict(buffer_size=1024)) - -For the full list of keyword arguments supported by each transport option, see the documentation: - -.. code-block:: python - - help('smart_open.open') - -S3 Credentials --------------- - -``smart_open`` uses the ``boto3`` library to talk to S3. -``boto3`` has several `mechanisms `__ for determining the credentials to use. -By default, ``smart_open`` will defer to ``boto3`` and let the latter take care of the credentials. -There are several ways to override this behavior. - -The first is to pass a ``boto3.Client`` object as a transport parameter to the ``open`` function. -You can customize the credentials when constructing the session for the client. -``smart_open`` will then use the session when talking to S3. - -.. code-block:: python - - session = boto3.Session( - aws_access_key_id=ACCESS_KEY, - aws_secret_access_key=SECRET_KEY, - aws_session_token=SESSION_TOKEN, - ) - client = session.client('s3', endpoint_url=..., config=...) - fin = open('s3://bucket/key', transport_params={'client': client}) - -Your second option is to specify the credentials within the S3 URL itself: - -.. code-block:: python - - fin = open('s3://aws_access_key_id:aws_secret_access_key@bucket/key', ...) - -*Important*: The two methods above are **mutually exclusive**. If you pass an AWS client *and* the URL contains credentials, ``smart_open`` will ignore the latter. - -*Important*: ``smart_open`` ignores configuration files from the older ``boto`` library. -Port your old ``boto`` settings to ``boto3`` in order to use them with ``smart_open``. - -S3 Advanced Usage ------------------ - -Additional keyword arguments can be propagated to the boto3 methods that are used by ``smart_open`` under the hood using the ``client_kwargs`` transport parameter. - -For instance, to upload a blob with Metadata, ACL, StorageClass, these keyword arguments can be passed to ``create_multipart_upload`` (`docs `__). - -.. code-block:: python - - kwargs = {'Metadata': {'version': 2}, 'ACL': 'authenticated-read', 'StorageClass': 'STANDARD_IA'} - fout = open('s3://bucket/key', 'wb', transport_params={'client_kwargs': {'S3.Client.create_multipart_upload': kwargs}}) - -Iterating Over an S3 Bucket's Contents --------------------------------------- - -Since going over all (or select) keys in an S3 bucket is a very common operation, there's also an extra function ``smart_open.s3.iter_bucket()`` that does this efficiently, **processing the bucket keys in parallel** (using multiprocessing): - -.. code-block:: python - - >>> from smart_open import s3 - >>> # we use workers=1 for reproducibility; you should use as many workers as you have cores - >>> bucket = 'silo-open-data' - >>> prefix = 'Official/annual/monthly_rain/' - >>> for key, content in s3.iter_bucket(bucket, prefix=prefix, accept_key=lambda key: '/201' in key, workers=1, key_limit=3): - ... print(key, round(len(content) / 2**20)) - Official/annual/monthly_rain/2010.monthly_rain.nc 13 - Official/annual/monthly_rain/2011.monthly_rain.nc 13 - Official/annual/monthly_rain/2012.monthly_rain.nc 13 - -GCS Credentials ---------------- -``smart_open`` uses the ``google-cloud-storage`` library to talk to GCS. -``google-cloud-storage`` uses the ``google-cloud`` package under the hood to handle authentication. -There are several `options `__ to provide -credentials. -By default, ``smart_open`` will defer to ``google-cloud-storage`` and let it take care of the credentials. - -To override this behavior, pass a ``google.cloud.storage.Client`` object as a transport parameter to the ``open`` function. -You can `customize the credentials `__ -when constructing the client. ``smart_open`` will then use the client when talking to GCS. To follow allow with -the example below, `refer to Google's guide `__ -to setting up GCS authentication with a service account. - -.. code-block:: python - - import os - from google.cloud.storage import Client - service_account_path = os.environ['GOOGLE_APPLICATION_CREDENTIALS'] - client = Client.from_service_account_json(service_account_path) - fin = open('gs://gcp-public-data-landsat/index.csv.gz', transport_params=dict(client=client)) - -If you need more credential options, you can create an explicit ``google.auth.credentials.Credentials`` object -and pass it to the Client. To create an API token for use in the example below, refer to the -`GCS authentication guide `__. - -.. code-block:: python - - import os - from google.auth.credentials import Credentials - from google.cloud.storage import Client - token = os.environ['GOOGLE_API_TOKEN'] - credentials = Credentials(token=token) - client = Client(credentials=credentials) - fin = open('gs://gcp-public-data-landsat/index.csv.gz', transport_params={'client': client}) - -GCS Advanced Usage ------------------- - -Additional keyword arguments can be propagated to the GCS open method (`docs `__), which is used by ``smart_open`` under the hood, using the ``blob_open_kwargs`` transport parameter. - -Additionally keyword arguments can be propagated to the GCS ``get_blob`` method (`docs `__) when in a read-mode, using the ``get_blob_kwargs`` transport parameter. - -Additional blob properties (`docs `__) can be set before an upload, as long as they are not read-only, using the ``blob_properties`` transport parameter. - -.. code-block:: python - - open_kwargs = {'predefined_acl': 'authenticated-read'} - properties = {'metadata': {'version': 2}, 'storage_class': 'COLDLINE'} - fout = open('gs://bucket/key', 'wb', transport_params={'blob_open_kwargs': open_kwargs, 'blob_properties': properties}) - -Azure Credentials ------------------ - -``smart_open`` uses the ``azure-storage-blob`` library to talk to Azure Blob Storage. -By default, ``smart_open`` will defer to ``azure-storage-blob`` and let it take care of the credentials. - -Azure Blob Storage does not have any ways of inferring credentials therefore, passing a ``azure.storage.blob.BlobServiceClient`` -object as a transport parameter to the ``open`` function is required. -You can `customize the credentials `__ -when constructing the client. ``smart_open`` will then use the client when talking to. To follow allow with -the example below, `refer to Azure's guide `__ -to setting up authentication. - -.. code-block:: python - - import os - from azure.storage.blob import BlobServiceClient - azure_storage_connection_string = os.environ['AZURE_STORAGE_CONNECTION_STRING'] - client = BlobServiceClient.from_connection_string(azure_storage_connection_string) - fin = open('azure://my_container/my_blob.txt', transport_params={'client': client}) - -If you need more credential options, refer to the -`Azure Storage authentication guide `__. - -Azure Advanced Usage --------------------- - -Additional keyword arguments can be propagated to the ``commit_block_list`` method (`docs `__), which is used by ``smart_open`` under the hood for uploads, using the ``blob_kwargs`` transport parameter. - -.. code-block:: python - - kwargs = {'metadata': {'version': 2}} - fout = open('azure://container/key', 'wb', transport_params={'blob_kwargs': kwargs}) - -Drop-in replacement of ``pathlib.Path.open`` --------------------------------------------- - -``smart_open.open`` can also be used with ``Path`` objects. -The built-in `Path.open()` is not able to read text from compressed files, so use ``patch_pathlib`` to replace it with `smart_open.open()` instead. -This can be helpful when e.g. working with compressed files. - -.. code-block:: python - - >>> from pathlib import Path - >>> from smart_open.smart_open_lib import patch_pathlib - >>> - >>> _ = patch_pathlib() # replace `Path.open` with `smart_open.open` - >>> - >>> path = Path("smart_open/tests/test_data/crime-and-punishment.txt.gz") - >>> - >>> with path.open("r") as infile: - ... print(infile.readline()[:41]) - В начале июля, в чрезвычайно жаркое время - -How do I ...? -============= - -See `this document `__. - -Extending ``smart_open`` -======================== - -See `this document `__. - -Testing ``smart_open`` -====================== - -``smart_open`` comes with a comprehensive suite of unit tests. -Before you can run the test suite, install the test dependencies:: - - pip install -e .[test] - -Now, you can run the unit tests:: - - pytest smart_open - -The tests are also run automatically with `Travis CI `_ on every commit push & pull request. - -Comments, bug reports -===================== - -``smart_open`` lives on `Github `_. You can file -issues or pull requests there. Suggestions, pull requests and improvements welcome! - ----------------- - -``smart_open`` is open source software released under the `MIT license `_. -Copyright (c) 2015-now `Radim Řehůřek `_.