四、使用图像、音频和其他资源

在本章中，我们将介绍：

在网上下载媒体内容
使用 urllib 解析 URL 以获取文件名
确定 URL 的内容类型
从内容类型确定文件扩展名
下载图像并将其保存到本地文件系统
下载图像并将其保存到 S3
为图像生成缩略图
使用 Selenium 拍摄网站截图
使用外部服务拍摄网站截图
使用 pytessaract 对图像执行 OCR
创建视频缩略图
将 MP4 视频翻录成 MP3

介绍

刮片的常见做法是下载、存储和进一步处理媒体内容（非网页或数据文件）。这种媒体可以包括图像、音频和视频。要将内容存储在本地（或存储在像 S3 这样的服务中）并正确执行，我们需要知道媒体的类型，仅仅信任 URL 中的文件扩展名是不够的。我们将学习如何根据 web 服务器上的信息下载并正确表示媒体类型。

另一项常见任务是生成图像、视频甚至网站页面的缩略图。我们将研究几种如何生成缩略图和制作网站页面截图的技术。很多时候，这些技术在新网站上用作指向当前存储在本地的已刮媒体的缩略图链接。

最后，通常需要能够对媒体进行转码，例如将非 MP4 视频转换为 MP4，或更改视频的比特率或分辨率。另一种情况是仅从视频文件中提取音频。我们不看视频转码，但我们将使用ffmpeg从 MP4 文件中提取 MP3 音频。这是一个简单的步骤，从那里也转码视频与ffmpeg。

从 web 下载媒体内容

从 web 下载媒体内容是一个简单的过程：使用请求或其他库并像下载 HTML 内容一样下载它。

准备

在解决方案的util文件夹中的urls.py mdoule中有一个名为URLUtility的类。此类通过下载和解析 URL 处理本章中的几个场景。我们将在这个菜谱和其他一些菜谱中使用这个类。确保modules文件夹位于 Python 路径中。此外，此配方的示例位于04/01_download_image.py文件中

怎么做

以下是我们如何继续制作配方：

URLUtility类可以从 URL 下载内容。配方文件中的代码如下所示：

import const
from util.urls import URLUtility

util = URLUtility(const.ApodEclipseImage())
print(len(util.data))

运行此命令时，您将看到以下输出：

Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg
Read 171014 bytes
171014

该示例读取171014字节的数据。

工作原理

URL 在const模块中定义为常量const.ApodEclipseImage()：

def ApodEclipseImage():
    return "https://apod.nasa.gov/apod/image/1709/BT5643s.jpg"

URLUtility类的构造函数有以下实现：

def __init__(self, url, readNow=True):
    """ Construct the object, parse the URL, and download now if specified"""
    self._url = url
    self._response = None
    self._parsed = urlparse(url)
    if readNow:
        self.read()

构造函数存储 URL，对其进行解析，并使用read()方法下载文件。以下是read()方法的代码：

def read(self):
    self._response = urllib.request.urlopen(self._url)
    self._data = self._response.read()

此函数使用urlopen获取响应对象，然后读取流并将其存储为对象的属性。然后可以使用数据属性检索该数据：

@property
def data(self):
    self.ensure_response()
    return self._data

然后，代码简单地报告该数据的长度，值为171014。

还有更多。。。

此类将用于其他任务，例如确定这些文件的内容类型、文件名和扩展名。接下来，我们将检查文件名的 URL 解析。

使用 urllib 解析 URL 以获取文件名

从 URL 下载内容时，我们通常希望将其保存在文件中。通常，将文件保存在 URL 中有名称的文件中就足够了。但是 URL 由许多片段组成，因此我们如何从 URL 中找到实际的文件名，尤其是在文件名后面通常有许多参数的情况下？

准备

我们将再次使用URLUtility类完成此任务。配方的代码文件为04/02_parse_url.py。

怎么做

使用 python 解释器执行配方文件。它将运行以下代码：

util = URLUtility(const.ApodEclipseImage())
print(util.filename_without_ext)

这将产生以下输出：

Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg
Read 171014 bytes
The filename is: BT5643s

工作原理

在URLUtility的构造函数中，有一个对urlib.parse.urlparse的调用。下面演示如何以交互方式使用该函数：

>>> parsed = urlparse(const.ApodEclipseImage())
>>> parsed
ParseResult(scheme='https', netloc='apod.nasa.gov', path='/apod/image/1709/BT5643s.jpg', params='', query='', fragment='')

ParseResult对象包含 URL 的各种组件。path 元素包含路径和文件名。对.filename_without_ext属性的调用只返回文件名，不返回扩展名：

@property
def filename_without_ext(self):
    filename = os.path.splitext(os.path.basename(self._parsed.path))[0]
    return filename

对os.path.basename的调用只返回路径的文件名部分（包括扩展名）。os.path.splittext()然后分离文件名和扩展名，函数返回该元组/列表的第一个元素（文件名）。

还有更多。。。

这可能看起来很奇怪，因为它没有将扩展名作为文件名的一部分返回。这是因为我们不能假设我们收到的内容实际上与扩展中的隐含类型匹配。使用 web 服务器返回的头来确定这一点更为准确。这是我们的下一个食谱。

确定 URL 的内容类型

当从 web 服务器执行GET内容请求时，web 服务器将返回多个标题，其中一个标题从 web 服务器的角度标识内容的类型。在本配方中，我们学习使用该标题确定 web 服务器认为内容的类型。

准备

我们再次使用URLUtility类。配方的代码在04/03_determine_content_type_from_response.py中。

怎么做

我们的工作如下：

执行配方的脚本。它包含以下代码：

util = URLUtility(const.ApodEclipseImage())
print("The content type is: " + util.contenttype)

结果如下：

Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg
Read 171014 bytes
The content type is: image/jpeg

工作原理

.contentype属性实现如下：

@property
def contenttype(self):
    self.ensure_response()
    return self._response.headers['content-type']

_response对象的.headers属性是一个类似字典的头类。content-type键将检索服务器指定的content-type。对ensure_response()方法的调用只是确保.read()函数已经执行。

还有更多。。。

响应中的标题包含大量信息。如果我们更仔细地查看响应的headers属性，我们可以看到返回以下标头：

>>> response = urllib.request.urlopen(const.ApodEclipseImage())
>>> for header in response.headers: print(header)
Date
Server
Last-Modified
ETag
Accept-Ranges
Content-Length
Connection
Content-Type
Strict-Transport-Security

我们可以看到每个标题的值。

>>> for header in response.headers: print(header + " ==> " + response.headers[header])
Date ==> Tue, 26 Sep 2017 19:31:41 GMT
Server ==> WebServer/1.0
Last-Modified ==> Thu, 31 Aug 2017 20:26:32 GMT
ETag ==> "547bb44-29c06-5581275ce2b86"
Accept-Ranges ==> bytes
Content-Length ==> 171014
Connection ==> close
Content-Type ==> image/jpeg
Strict-Transport-Security ==> max-age=31536000; includeSubDomains

在本书中，我们将不探讨其中的许多问题，但对于不熟悉的人来说，知道它们的存在是件好事。

从内容类型确定文件扩展名

最好使用content-type头来确定内容的类型，并确定将内容存储为文件的扩展名。

准备

我们再次使用我们创建的URLUtility对象。食谱的脚本是04/04_determine_file_extension_from_contenttype.py):。

怎么做

继续运行配方脚本。

可以使用.extension属性找到媒体类型的扩展：

util = URLUtility(const.ApodEclipseImage())
print("Filename from content-type: " + util.extension_from_contenttype)
print("Filename from url: " + util.extension_from_url)

这将产生以下输出：

Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg
Read 171014 bytes
Filename from content-type: .jpg
Filename from url: .jpg

这将报告根据文件类型和 URL 确定的扩展名。这些可能不同，但在这种情况下，它们是相同的。

工作原理

以下是.extension_from_contenttype属性的实现：

@property
def extension_from_contenttype(self):
    self.ensure_response()

    map = const.ContentTypeToExtensions()
    if self.contenttype in map:
        return map[self.contenttype]
    return None

第一行确保我们已从 URL 读取响应。然后，该函数使用在const模块中定义的 python 字典，其中包含要扩展的内容类型字典：

def ContentTypeToExtensions():
    return {
        "image/jpeg": ".jpg",
        "image/jpg": ".jpg",
        "image/png": ".png"
    }

如果内容类型在字典中，则将返回相应的值。否则返回None。

注意对应的属性.extension_from_url：

@property
def extension_from_url(self):
    ext = os.path.splitext(os.path.basename(self._parsed.path))[1]
    return ext

它使用与.filename属性相同的技术来解析 URL，但返回[1]元素，该元素表示扩展名而不是基本文件名。

还有更多。。。

如上所述，最好使用content-type头来确定本地存储文件的扩展名。除了这里提供的技术之外，还有其他技术，但这是最简单的。

下载图像并将其保存到本地文件系统

有时候，在抓取数据时，我们只是下载并解析数据，比如 HTML，以提取一些数据，然后扔掉我们读到的内容。其他时候，我们希望通过将下载的内容存储为文件来保存它。

怎么做

此配方的代码示例位于04/05_save_image_as_file.py文件中。文件的重要部分是：

# download the image
item = URLUtility(const.ApodEclipseImage())

# create a file writer to write the data
FileBlobWriter(expanduser("~")).write(item.filename, item.data)

使用 Python 解释器运行脚本，您将获得以下输出：

Reading URL: https://apod.nasa.gov/apod/image/1709/BT5643s.jpg
Read 171014 bytes
Attempting to write 171014 bytes to BT5643s.jpg:
The write was successful

工作原理

该示例仅使用标准 Python 文件访问函数将数据写入文件。它采用面向对象的方式，使用标准接口写入数据，并在FileBlobWriter类中使用基于文件的实现：

""" Implements the IBlobWriter interface to write the blob to a file """

from interface import implements
from core.i_blob_writer import IBlobWriter

class FileBlobWriter(implements(IBlobWriter)):
    def __init__(self, location):
        self._location = location

    def write(self, filename, contents):
        full_filename = self._location + "/" + filename
        print ("Attempting to write {0} bytes to {1}:".format(len(contents), filename))

        with open(full_filename, 'wb') as outfile:
            outfile.write(contents)

        print("The write was successful")

向类传递一个字符串，该字符串表示文件应放置的目录。数据实际上是在稍后调用.write()方法时写入的。此方法合并文件名和directory (_location)，然后打开/创建文件并写入字节。with语句确保文件已关闭。

还有更多。。。

这个写操作可以简单地使用一个包装代码的函数来处理。这个对象将在本章中重复使用。我们可以使用 python 的 duck 类型，或者只是一个函数，但是接口的清晰性更容易。说到这里，这个接口的定义如下：

""" Defines the interface for writing a blob of data to storage """

from interface import Interface

class IBlobWriter(Interface):
   def write(self, filename, contents):
      pass

我们还将看到这个接口的另一个实现，它允许我们在 S3 中存储文件。通过这种类型的实现，通过接口继承，我们可以轻松地替换实现。

下载图像并将其保存到 S3

我们已经在第 3 章处理数据中看到了如何将内容写入 S3。在这里，我们将把这个过程扩展到 IBlobWriter 的接口实现中，以写入 S3。

准备

此配方的代码示例在04/06_save_image_in_s3.py文件中。同时确保您已将 AWS 密钥设置为环境变量，以便 Boto 可以验证脚本。

怎么做

我们的工作如下：

运行配方脚本。它将执行以下操作：

# download the image
item = URLUtility(const.ApodEclipseImage())

# store it in S3
S3BlobWriter(bucket_name="scraping-apod").write(item.filename, item.data)

在 S3 中检查时，我们可以看到 bucket 已创建，并且图像已放置在 bucket 中：

The Image in S3

工作原理

以下是S3BlobWriter的执行情况：

class S3BlobWriter(implements(IBlobWriter)):
    def __init__(self, bucket_name, boto_client=None):
        self._bucket_name = bucket_name

        if self._bucket_name is None:
            self.bucket_name = "/"

        # caller can specify a boto client (can reuse and save auth times)
        self._boto_client = boto_client

        # or create a boto client if user did not, use secrets from environment variables
        if self._boto_client is None:
            self._boto_client = boto3.client('s3')

    def write(self, filename, contents):
        # create bucket, and put the object
        self._boto_client.create_bucket(Bucket=self._bucket_name, ACL='public-read')
        self._boto_client.put_object(Bucket=self._bucket_name,
                                     Key=filename,
                                     Body=contents,
                                     ACL="public-read")

我们以前在编写 S3 的方法中见过这段代码。这个类将其巧妙地包装到一个可重用的接口实现中。创建实例时，指定 bucket 名称，然后每次调用.write()都会保存在同一个 bucket 中。

还有更多。。。

S3 在 Bucket 上提供了一种称为启用网站的功能。本质上，如果您设置了这个选项，那么 bucket 中的内容将通过 HTTP 提供服务。我们可以将许多图像写入这个目录，然后直接从 S3 提供它们，而无需实现 web 服务器！

为图像生成缩略图

很多时候，在下载图像时，您不希望保存完整图像，而只希望保存缩略图。或者您也可以同时保存全尺寸图像和缩略图。可以使用枕头库在 python 中轻松创建缩略图。Pillow 是 Python 图像库的一个分支，包含许多用于处理图像的有用函数。有关枕头的更多信息，请访问https://python-pillow.org 。在此配方中，我们使用枕头创建图像缩略图。

准备

这个食谱的脚本是04/07_create_image_thumbnail.py。它使用枕头库，因此请确保您已使用 pip 或其他软件包管理工具将枕头安装到您的环境中：

pip install pillow

怎么做

以下是如何继续制作配方：

运行配方的脚本。它将执行以下代码：

from os.path import expanduser
import const
from core.file_blob_writer import FileBlobWriter
from core.image_thumbnail_generator import ImageThumbnailGenerator
from util.urls import URLUtility

# download the image and get the bytes
img_data = URLUtility(const.ApodEclipseImage()).data

# we will store this in our home folder
fw = FileBlobWriter(expanduser("~"))

# Create a thumbnail generator and scale the image
tg = ImageThumbnailGenerator(img_data).scale(200, 200)

# write the image to a file
fw.write("eclipse_thumbnail.png", tg.bytes)

由此产生的结果将是一个名为eclipse_thumbnail.png的文件写入您的主目录。

The Thumbnail we Created

枕头保持宽度和高度的比例一致。

工作原理

ImageThumbnailGenerator类包装了对 Pillow 的调用，以提供一个非常简单的 API 来创建图像的缩略图：

import io
from PIL import Image

class ImageThumbnailGenerator():
    def __init__(self, bytes):
        # Create a pillow image with the data provided
        self._image = Image.open(io.BytesIO(bytes))

    def scale(self, width, height):
        # call the thumbnail method to create the thumbnail
        self._image.thumbnail((width, height))
        return self

    @property
    def bytes(self):
        # returns the bytes of the pillow image

        # save the image to an in memory objects
        bytesio = io.BytesIO()
        self._image.save(bytesio, format="png")

        # set the position on the stream to 0 and return the underlying data
        bytesio.seek(0)
        return bytesio.getvalue()

构造函数被传递图像的数据，并从该数据创建一个枕头图像对象。通过使用表示所需缩略图大小的元组调用.thumbnail()来创建缩略图。这将调整现有图像的大小，并保留纵横比。它将确定图像的长边，并将其缩放到表示该轴的元组中的值。此图像的高度大于宽度，因此缩略图的高度为 200 像素，宽度也相应地进行了缩放（在本例中为 160 像素）。

拍摄网站截图

一个常见的抓取任务是创建网站的屏幕截图。在 Python 中，我们可以使用 selenium 和 webdriver 创建缩略图。

准备

这个食谱的脚本是04/08_create_website_screenshot.py。另外，请确保路径中有 selenium，并且已安装 Python 库。

怎么做

运行配方的脚本。脚本中的代码如下所示：

from core.website_screenshot_generator import  WebsiteScreenshotGenerator
from core.file_blob_writer import FileBlobWriter
from os.path import expanduser

# get the screenshot
image_bytes = WebsiteScreenshotGenerator().capture("http://espn.go.com", 500, 500).image_bytes

# save it to a file
FileBlobWriter(expanduser("~")).write("website_screenshot.png", image_bytes)

创建一个WebsiteScreenshotGenerator对象，然后调用它的捕获方法，传递要捕获的网站 URL 和图像所需的像素宽度。

这将创建一个枕头图像，可以使用.image属性访问该图像，并且可以使用.image_bytes直接访问该图像的字节。此脚本获取这些字节并将它们写入主目录中的website_screenshot.png文件。

您将看到此脚本的以下输出：

Connected to pydev debugger (build 162.1967.10)
Capturing website screenshot of: http://espn.go.com
Got a screenshot with the following dimensions: (500, 7416)
Cropped the image to: 500 500
Attempting to write 217054 bytes to website_screenshot.png:
The write was successful

我们得到的图像如下（图像的内容会有所不同）：

The Screenshot of the Web Page

工作原理

以下是WebsiteScreenshotGenerator类的代码：

class WebsiteScreenshotGenerator():
    def __init__(self):
        self._screenshot = None

    def capture(self, url, width, height, crop=True):
        print ("Capturing website screenshot of: " + url)
        driver = webdriver.PhantomJS()

        if width and height:
            driver.set_window_size(width, height)

        # go and get the content at the url
        driver.get(url)

        # get the screenshot and make it into a Pillow Image
        self._screenshot = Image.open(io.BytesIO(driver.get_screenshot_as_png()))
        print("Got a screenshot with the following dimensions: {0}".format(self._screenshot.size))

        if crop:
            # crop the image
            self._screenshot = self._screenshot.crop((0,0, width, height))
            print("Cropped the image to: {0} {1}".format(width, height))

        return self

    @property
    def image(self):
        return self._screenshot

    @property
    def image_bytes(self):
        bytesio = io.BytesIO()
        self._screenshot.save(bytesio, "PNG")
        bytesio.seek(0)
        return bytesio.getvalue()

对driver.get_screenshot_as_png()的呼叫完成了繁重的工作。它将页面呈现为 PNG 格式的图像，并返回图像的字节。然后将该数据转换为枕头图像对象。

请注意，在输出中，webdriver 返回的图像高度是 7416 像素，而不是我们指定的 500 像素。PhantomJS 渲染器将尝试处理无限滚动的网站，并且通常不会将屏幕截图限制在给定窗口的高度。

要使屏幕截图实际达到指定高度，请将裁剪参数设置为True（默认值）。然后，该代码将使用枕头图像的裁剪方法来设置所需的高度。如果使用crop=False运行此代码，则结果将是高度为 7416 像素的图像。

拍摄带有外部服务的网站的屏幕截图

前面的配方使用 selenium、webdriver 和 PhantomJS 创建屏幕截图。这显然需要安装这些软件包。如果您不想安装这些软件，但仍然想制作网站截图，那么您可以使用多种可以截图的 web 服务之一。在此配方中，我们将使用www.screenshotapi.io上的服务创建一个屏幕截图。

准备

首先，前往www.screenshotapi.io并注册一个免费帐户：

Screenshot of the free account sign up

创建帐户后，继续获取 API 密钥。这将需要对其服务进行身份验证：

The API Key

怎么做

本例的脚本为04/09_screenshotapi.py。运行一下，它将生成一个屏幕截图。代码如下，在结构上与前面的配方非常相似：

from core.website_screenshot_with_screenshotapi import WebsiteScreenshotGenerator
from core.file_blob_writer import FileBlobWriter
from os.path import expanduser

# get the screenshot
image_bytes = WebsiteScreenshotGenerator("bd17a1e1-db43-4686-9f9b-b72b67a5535e")\
    .capture("http://espn.go.com", 500, 500).image_bytes

# save it to a file
FileBlobWriter(expanduser("~")).write("website_screenshot.png", image_bytes)

与前一个配方的功能区别在于我们使用了不同的WebsiteScreenshotGenerator实现。这个来自core.website_screenshot_with_screenshotapi模块。

运行时，以下内容将输出到控制台：

Sending request: http://espn.go.com
{"status":"ready","key":"2e9a40b86c95f50ad3f70613798828a8","apiCreditsCost":1}
The image key is: 2e9a40b86c95f50ad3f70613798828a8
Trying to retrieve: https://api.screenshotapi.io/retrieve
Downloading image: https://screenshotapi.s3.amazonaws.com/captures/2e9a40b86c95f50ad3f70613798828a8.png
Saving screenshot to: downloaded_screenshot.png2e9a40b86c95f50ad3f70613798828a8
Cropped the image to: 500 500
Attempting to write 209197 bytes to website_screenshot.png:
The write was successful

并给出以下图像：

The Website Screenshot from screenshotapi.io

工作原理

以下是本WebsiteScreenshotGenerator的代码：

class WebsiteScreenshotGenerator:
    def __init__(self, apikey):
        self._screenshot = None
        self._apikey = apikey

    def capture(self, url, width, height, crop=True):
        key = self.beginCapture(url, "{0}x{1}".format(width, height), "true", "firefox", "true")

        print("The image key is: " + key)

        timeout = 30
        tCounter = 0
        tCountIncr = 3

        while True:
            result = self.tryRetrieve(key)
            if result["success"]:
                print("Saving screenshot to: downloaded_screenshot.png" + key)

                bytes=result["bytes"]
                self._screenshot = Image.open(io.BytesIO(bytes))

                if crop:
                    # crop the image
                    self._screenshot = self._screenshot.crop((0, 0, width, height))
                    print("Cropped the image to: {0} {1}".format(width, height))
                break

            tCounter += tCountIncr
            print("Screenshot not yet ready.. waiting for: " + str(tCountIncr) + " seconds.")
            time.sleep(tCountIncr)
            if tCounter > timeout:
                print("Timed out while downloading: " + key)
                break
        return self

    def beginCapture(self, url, viewport, fullpage, webdriver, javascript):
        serverUrl = "https://api.screenshotapi.io/capture"
        print('Sending request: ' + url)
        headers = {'apikey': self._apikey}
        params = {'url': urllib.parse.unquote(url).encode('utf8'), 'viewport': viewport, 'fullpage': fullpage,
                  'webdriver': webdriver, 'javascript': javascript}
        result = requests.post(serverUrl, data=params, headers=headers)
        print(result.text)
        json_results = json.loads(result.text)
        return json_results['key']

    def tryRetrieve(self, key):
        url = 'https://api.screenshotapi.io/retrieve'
        headers = {'apikey': self._apikey}
        params = {'key': key}
        print('Trying to retrieve: ' + url)
        result = requests.get(url, params=params, headers=headers)

        json_results = json.loads(result.text)
        if json_results["status"] == "ready":
            print('Downloading image: ' + json_results["imageUrl"])
            image_result = requests.get(json_results["imageUrl"])
            return {'success': True, 'bytes': image_result.content}
        else:
            return {'success': False}

    @property
    def image(self):
        return self._screenshot

    @property
    def image_bytes(self):
        bytesio = io.BytesIO()
        self._screenshot.save(bytesio, "PNG")
        bytesio.seek(0)
        return bytesio.getvalue()

screenshotapi.ioAPI 是一个 REST API。有两个不同的端点：

https://api.screenshotapi.io/capture
https://api.screenshotapi.io/retrieve

第一个端点被调用，并将 URL 和其他参数传递给它们的服务。成功执行后，此 API 将返回一个密钥，可在另一个端点上使用该密钥检索图像。屏幕截图是异步执行的，我们需要使用从捕获端点返回的键不断调用retrieveAPI。当屏幕截图完成时，该端点将返回状态值ready。代码只是循环，直到设置完毕、发生错误或代码超时为止。

当快照可用时，API 在retrieve响应中返回图像的 URL。然后，代码检索该图像，并根据接收到的数据构造一个枕头图像对象。

还有更多。。。

screenshotapi.ioAPI 有许多有用的参数。其中一些允许您调整要使用的浏览器引擎（Firefox、Chrome 或 PhantomJS）、设备仿真以及是否在网页中执行 JavaScript。有关这些选项和 API 的更多详细信息，请访问http://docs.screenshotapi.io/rest-api/ 。

使用 pytesseract 对图像执行 OCR

可以使用 pytesseract 库从图像中提取文本。在此配方中，我们将使用 pytesseract 从图像中提取文本。Tesseract 是由谷歌赞助的开源 OCR 库。来源可在此处找到：https://github.com/tesseract-ocr/tesseract ，您也可以在那里找到更多关于图书馆的信息。0;PyteSeract 是一个精简的 python 包装器，为可执行文件提供 python API。

准备

确保已安装 PyteSeract：

pip install pytesseract

您还需要安装 tesseract ocr。在 Windows 上，有一个可执行的安装程序，您可以在这里获得：https://github.com/tesseract-ocr/tesseract/wiki/4.0-with-LSTM#400-alpha-for-windows。在 Linux 系统上，您可以使用apt-get：

sudo apt-get tesseract-ocr

Mac 上最简单的安装方法是使用 brew：

brew install tesseract

此配方的代码在04/10_perform_ocr.py中。

怎么做

执行配方的脚本。脚本非常简单：

import pytesseract as pt
from PIL import Image

img = Image.open("textinimage.png")
text = pt.image_to_string(img)
print(text)

将要处理的图像如下所示：

The Image we will OCR

脚本将提供以下输出：

This is an image containing text.
And some numbers 123456789

And also special characters: !@#$%"&*(_+

工作原理

图像首先作为枕头图像对象加载。我们可以直接将这个对象传递给 pytesseractimage_to_string()函数。该函数在图像上运行 tesseract 并返回找到的文本。

还有更多。。。

在抓取应用程序中使用 OCR 的主要目的之一是解决基于文本的验证码。我们不会讨论验证码解决方案，因为它们可能很麻烦，并且也会记录在其他 Packt 标题中。

创建视频缩略图

您可能希望为从网站下载的视频创建缩略图。可以在显示大量视频缩略图的页面上使用这些缩略图，并允许您单击它们以观看特定视频

准备

此示例将使用名为 ffmpeg 的工具。ffmpeg 可从 www.ffmpeg.org 获得。请按照操作系统的说明下载并安装。

怎么做

示例脚本位于04/11_create_video_thumbnail.py中，由以下代码组成：

import subprocess
video_file = 'BigBuckBunny.mp4'
thumbnail_file = 'thumbnail.jpg'
subprocess.call(['ffmpeg', '-i', video_file, '-ss', '00:01:03.000', '-vframes', '1', thumbnail_file, "-y"])

运行时，您将看到 ffmpeg 的输出：

 built with Apple LLVM version 8.1.0 (clang-802.0.42)
 configuration: --prefix=/usr/local/Cellar/ffmpeg/3.3.4 --enable-shared --enable-pthreads --enable-gpl --enable-version3 --enable-hardcoded-tables --enable-avresample --cc=clang --host-cflags= --host-ldflags= --enable-libmp3lame --enable-libx264 --enable-libxvid --enable-opencl --enable-videotoolbox --disable-lzma --enable-vda
 libavutil 55\. 58.100 / 55\. 58.100
 libavcodec 57\. 89.100 / 57\. 89.100
 libavformat 57\. 71.100 / 57\. 71.100
 libavdevice 57\. 6.100 / 57\. 6.100
 libavfilter 6\. 82.100 / 6\. 82.100
 libavresample 3\. 5\. 0 / 3\. 5\. 0
 libswscale 4\. 6.100 / 4\. 6.100
 libswresample 2\. 7.100 / 2\. 7.100
 libpostproc 54\. 5.100 / 54\. 5.100
Input #0, mov,mp4,m4a,3gp,3g2,mj2, from 'BigBuckBunny.mp4':
 Metadata:
 major_brand : isom
 minor_version : 512
 compatible_brands: mp41
 creation_time : 1970-01-01T00:00:00.000000Z
 title : Big Buck Bunny
 artist : Blender Foundation
 composer : Blender Foundation
 date : 2008
 encoder : Lavf52.14.0
 Duration: 00:09:56.46, start: 0.000000, bitrate: 867 kb/s
 Stream #0:0(und): Video: h264 (Constrained Baseline) (avc1 / 0x31637661), yuv420p, 320x180 [SAR 1:1 DAR 16:9], 702 kb/s, 24 fps, 24 tbr, 24 tbn, 48 tbc (default)
 Metadata:
 creation_time : 1970-01-01T00:00:00.000000Z
 handler_name : VideoHandler
 Stream #0:1(und): Audio: aac (LC) (mp4a / 0x6134706D), 48000 Hz, stereo, fltp, 159 kb/s (default)
 Metadata:
 creation_time : 1970-01-01T00:00:00.000000Z
 handler_name : SoundHandler
Stream mapping:
 Stream #0:0 -> #0:0 (h264 (native) -> mjpeg (native))
Press [q] to stop, [?] for help
[swscaler @ 0x7fb50b103000] deprecated pixel format used, make sure you did set range correctly
Output #0, image2, to 'thumbnail.jpg':
 Metadata:
 major_brand : isom
 minor_version : 512
 compatible_brands: mp41
 date : 2008
 title : Big Buck Bunny
 artist : Blender Foundation
 composer : Blender Foundation
 encoder : Lavf57.71.100
 Stream #0:0(und): Video: mjpeg, yuvj420p(pc), 320x180 [SAR 1:1 DAR 16:9], q=2-31, 200 kb/s, 24 fps, 24 tbn, 24 tbc (default)
 Metadata:
 creation_time : 1970-01-01T00:00:00.000000Z
 handler_name : VideoHandler
 encoder : Lavc57.89.100 mjpeg
 Side data:
 cpb: bitrate max/min/avg: 0/0/200000 buffer size: 0 vbv_delay: -1
frame= 1 fps=0.0 q=4.0 Lsize=N/A time=00:00:00.04 bitrate=N/A speed=0.151x 
video:8kB audio:0kB subtitle:0kB other streams:0kB global headers:0kB muxing overhead: unknown

输出的 JPG 文件将是以下 JPG 图像：

The Thumbnail Created from the Video

工作原理

.ffmpeg文件实际上是一个可执行文件。该代码作为子进程执行以下 ffmpeg 命令：

ffmpeg -i BigBuckBunny.mp4 -ss 00:01:03.000 -frames:v 1 thumbnail.jpg -y

输入文件是BigBuckBunny.mp4。-ss选项通知我们要检查视频的位置。-frames:v表示我们要提取一帧。最后我们告诉ffmpeg将该帧写入thumbnail.jpg。-y确认覆盖现有文件。

还有。。

ffmpeg 是一个功能强大的工具。我曾经创建过一个刮板，可以抓取和查找媒体（实际上是在网站上播放的商业广告），然后，scraper 将通过一个消息队列发送一条消息，该消息队列将由服务器群接收，服务器群的唯一任务是运行 ffmpeg 将视频转换为多种不同的格式、比特率，并创建缩略图。从那时起，更多的信息将被发送到 auditor，以使用前端应用程序检查内容是否符合广告合同条款。了解 ffmeg，它是一个很棒的工具。

将 MP4 视频翻录成 MP3

现在，让我们来研究如何将 MP4 视频中的音频撕成 MP3 文件。您这样做的原因可能包括希望随身携带视频的音频（可能是音乐视频），或者您正在构建一个刮片/媒体收集系统，该系统还要求音频与视频分开。

此任务可以使用moviepy库完成。moviepy是一个整洁的库，可以让您对视频进行各种有趣的处理。其中一个功能是将音频提取为 MP3。

准备

确保您的环境中安装了 moviepy：

pip install moviepy

我们还需要安装 ffmpeg，这是我们在上一个配方中使用的，因此您应该能够很好地满足这一要求。

怎么做

演示翻录成 MP3 的代码在04/12_rip_mp3_from_mp4.py中。moviepy使这个过程非常简单。

以下是前一配方中下载的 MP4：

import moviepy.editor as mp
clip = mp.VideoFileClip("BigBuckBunny.mp4")
clip.audio.write_audiofile("movie_audio.mp3")

当运行此命令时，您将看到文件被撕开时的输出，如下面所示。这只需要几秒钟：

[MoviePy] Writing audio in movie_audio.mp3
100%|██████████| 17820/17820 [00:16<00:00, 1081.67it/s]
[MoviePy] Done.

完成后，您将拥有一个 MP3 文件：

# ls -l *.mp3
-rw-r--r--@ 1 michaelheydt  staff  12931074 Sep 27 21:44 movie_audio.mp3

还有更多。。。

有关 moviepy 的更多信息，请访问项目网站http://zulko.github.io/moviepy/ 。

Files

04.md

Latest commit

History

04.md

File metadata and controls

四、使用图像、音频和其他资源

介绍

从 web 下载媒体内容

准备

怎么做

工作原理

还有更多。。。

使用 urllib 解析 URL 以获取文件名

准备

怎么做

工作原理

还有更多。。。

确定 URL 的内容类型

准备

怎么做

工作原理

还有更多。。。

从内容类型确定文件扩展名

准备

怎么做

工作原理

还有更多。。。

下载图像并将其保存到本地文件系统

怎么做

工作原理

还有更多。。。

下载图像并将其保存到 S3

准备

怎么做

工作原理

还有更多。。。

为图像生成缩略图

准备

怎么做

工作原理

拍摄网站截图

准备

怎么做

工作原理

拍摄带有外部服务的网站的屏幕截图

准备

怎么做

工作原理

还有更多。。。

使用 pytesseract 对图像执行 OCR

准备

怎么做

工作原理

还有更多。。。

创建视频缩略图

准备

怎么做

工作原理

还有。。

将 MP4 视频翻录成 MP3

准备

怎么做

还有更多。。。