don't "fix" encoding of raw message/rfc822 parts

In the code for handling message/rfc822 MIME parts, message.rb line 498, we were calling the #normalize_whitespace method on the body string before it was decoded. I'm not too sure if messing with whitespace is the right thing to do there, but that aside, that method was then also calling #fix_encoding! which would forcibly transcode the raw body to UTF-8. Instead, we want to keep the body as ASCII-8BIT at that point, and let it be decoded using all the normal message decoding mechanisms. The only other calls to #normalize_whitespace are in the UI, and in the code path which handles body text of messages, message.rb line 592, where the body text has already been decoded. So it seems like we can safely make #normalize_whitespace just mess with whitespace and leave the string encoding alone. Fixes #205.
sup-heliotrope · Jul 12, 2020 · d3fbac1 · d3fbac1
1 parent 4204170
commit d3fbac1
Show file tree

Hide file tree

Showing 3 changed files with 59 additions and 1 deletion.
diff --git a/lib/sup/util.rb b/lib/sup/util.rb
@@ -376,7 +376,6 @@ def transcode to_encoding, from_encoding
   end
 
   def normalize_whitespace
-    fix_encoding!
     gsub(/\t/, "    ").gsub(/\r/, "")
   end
 

diff --git a/test/fixtures/non-ascii-header-in-nested-message.eml b/test/fixtures/non-ascii-header-in-nested-message.eml
@@ -0,0 +1,36 @@
+Return-Path: <[email protected]>
+From: SPAM ® <[email protected]>
+To: <[email protected]>
+Subject: spam ® spam
+MIME-Version: 1.0
+Content-Type: multipart/mixed; boundary="----------=_4F506AC2.EE281DC4"
+Message-Id: <[email protected]>
+Date: Fri,  2 Mar 2012 07:37:55 +0100 (CET)
+
+This is a multi-part message in MIME format.
+
+------------=_4F506AC2.EE281DC4
+Content-Type: text/plain; charset=iso-8859-1
+Content-Disposition: inline
+Content-Transfer-Encoding: 8bit
+
+Spam detection software, running on the system "a.a.a.a.a.", has
+identified this incoming email as possible spam.  The original message
+has been attached to this so you can view it (if it isn't spam) or label
+similar future email.
+
+
+------------=_4F506AC2.EE281DC4
+Content-Type: message/rfc822; x-spam-type=original
+Content-Description: original message before SpamAssassin
+Content-Disposition: attachment
+Content-Transfer-Encoding: 8bit
+
+From: SPAM ® <[email protected]>
+To: <[email protected]>
+Subject: spam ® spam
+
+This is a spam.
+
+------------=_4F506AC2.EE281DC4--
+
diff --git a/test/test_message.rb b/test/test_message.rb
@@ -248,6 +248,29 @@ def test_nonascii_header
     assert_equal("spam \ufffd spam", sup_message.subj)
   end
 
+  def test_nonascii_header_in_nested_message
+    source = DummySource.new("sup-test://test_nonascii_header_in_nested_message")
+    source.messages = [ fixture_path("non-ascii-header-in-nested-message.eml") ]
+    source_info = 0
+
+    sup_message = Message.build_from_source(source, source_info)
+    chunks = sup_message.load_from_source!
+
+    assert_equal(3, chunks.length)
+
+    assert(chunks[0].is_a? Redwood::Chunk::Text)
+
+    assert(chunks[1].is_a? Redwood::Chunk::EnclosedMessage)
+    ## TODO need to fix EnclosedMessage#lines
+    #assert_equal(4, chunks[1].lines.length)
+    #assert_equal("From: SPAM \ufffd <[email protected]>", chunks[1].lines[0])
+    #assert_equal("spam \ufffd spam", chunks[1].lines[3])
+
+    assert(chunks[2].is_a? Redwood::Chunk::Text)
+    assert_equal(1, chunks[2].lines.length)
+    assert_equal("This is a spam.", chunks[2].lines[0])
+  end
+
   def test_malicious_attachment_names
     source = DummySource.new("sup-test://test_blank_header_lines")
     source.messages = [ fixture_path('malicious-attachment-names.eml') ]