Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Gmail takeout mbox import (v2) #8

Open
wants to merge 21 commits into
base: master
Choose a base branch
from

Commits on Feb 22, 2021

  1. Configuration menu
    Copy the full SHA
    8008357 View commit details
    Browse the repository at this point in the history
  2. Add some tests

    UtahDave committed Feb 22, 2021
    Configuration menu
    Copy the full SHA
    50e7e8d View commit details
    Browse the repository at this point in the history

Commits on Feb 24, 2021

  1. Format with Black

    UtahDave committed Feb 24, 2021
    Configuration menu
    Copy the full SHA
    a3de045 View commit details
    Browse the repository at this point in the history

Commits on Jul 22, 2021

  1. Manually parse mbox format

    Parsing the mbox file manually instead of using Python's built-in
    parser allows us to process large files without loading them into
    memory all at once.
    maxhawkins committed Jul 22, 2021
    Configuration menu
    Copy the full SHA
    72802a8 View commit details
    Browse the repository at this point in the history
  2. Fix import for messages that don't have a Date

    This fixes a regression introduced by the previous commit where messages no longer fetch the date from the mbox 'From ' line. For messages without a Date header this means we lose information about the delivery date.
    maxhawkins committed Jul 22, 2021
    Configuration menu
    Copy the full SHA
    4bc7010 View commit details
    Browse the repository at this point in the history

Commits on Jul 28, 2021

  1. Use thread id as pkey if missing message id.

    Some messages (like gchat logs) don't have message ids and therefore don't save properly. This commit uses the gmail X-GM-THRID if the Message-Id is missing.
    maxhawkins committed Jul 28, 2021
    Configuration menu
    Copy the full SHA
    8ee555c View commit details
    Browse the repository at this point in the history
  2. Fix parse exception: convert delivery_date to a str

    The function email.utils.parsedate_tz expects a str, but we were passing bytes. Casting to str fixes an exception in messages where the Date header is missing and the delivery time must be inferred from the mbox header.
    maxhawkins committed Jul 28, 2021
    Configuration menu
    Copy the full SHA
    e1fdef7 View commit details
    Browse the repository at this point in the history

Commits on Aug 6, 2021

  1. Use message id from mbox header if none exists in MIME.

    Some messages (like chats) don't have a Message-Id mime header, so the message is saved without a primary key.
    
    A previous commit used the thread id in this situation, but the same thread id can be used for multiple messages. This id, which is the message id used by the gmail api, should be unique across all messages.
    maxhawkins committed Aug 6, 2021
    Configuration menu
    Copy the full SHA
    8939f5b View commit details
    Browse the repository at this point in the history
  2. Explicitly parse email with compat32 policy.

    The docs note: "The policy keyword should always be specified; The default will change to email.policy.default in a future version of Python."
    maxhawkins committed Aug 6, 2021
    Configuration menu
    Copy the full SHA
    953e7eb View commit details
    Browse the repository at this point in the history
  3. Simplify handling of headers with binary data.

    This shouldn't happen in RFC-abiding messages, but raw unicode or other non-ascii content will cause the header parser to return a Header object rather than a str. Improve handling of this case and add a simple unit test.
    maxhawkins committed Aug 6, 2021
    Configuration menu
    Copy the full SHA
    50cc883 View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    770bc0e View commit details
    Browse the repository at this point in the history
  5. Deal with invalid rfc 2047 strings.

    If the string is invalid, the undecoded string is returned instead.
    maxhawkins committed Aug 6, 2021
    Configuration menu
    Copy the full SHA
    4f50ff4 View commit details
    Browse the repository at this point in the history

Commits on Aug 7, 2021

  1. Configuration menu
    Copy the full SHA
    abb4dfd View commit details
    Browse the repository at this point in the history

Commits on Aug 8, 2021

  1. Configuration menu
    Copy the full SHA
    2a31dd4 View commit details
    Browse the repository at this point in the history
  2. Format with black

    maxhawkins committed Aug 8, 2021
    Configuration menu
    Copy the full SHA
    d3cf088 View commit details
    Browse the repository at this point in the history
  3. Configuration menu
    Copy the full SHA
    6a3832c View commit details
    Browse the repository at this point in the history
  4. Create table before inserting, ensuring proper column types.

    In some instances tables would be created with the wrong column types if the initial records had unexpected types. This fixes the issue by explicitly creating the table and specifying types.
    maxhawkins committed Aug 8, 2021
    Configuration menu
    Copy the full SHA
    25ee0a2 View commit details
    Browse the repository at this point in the history
  5. Configuration menu
    Copy the full SHA
    0bfe031 View commit details
    Browse the repository at this point in the history

Commits on Aug 10, 2021

  1. Use Python 3.6 email parser.

    Using this newer email parsing code enables parsing of attachments and easier parsing of html emails in the future.
    maxhawkins committed Aug 10, 2021
    Configuration menu
    Copy the full SHA
    98d89bf View commit details
    Browse the repository at this point in the history
  2. Use EmailMessage.get_body.

    This may be more robust than the tree-walking method we were using earlier, and will enable parsing of html email contents in a future commit.
    maxhawkins committed Aug 10, 2021
    Configuration menu
    Copy the full SHA
    c081ed3 View commit details
    Browse the repository at this point in the history
  3. Parse html emails to plaintext.

    (Only if no text/plain alternative exists)
    maxhawkins committed Aug 10, 2021
    Configuration menu
    Copy the full SHA
    8e6d487 View commit details
    Browse the repository at this point in the history