Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Header value (for user-defined header) in UTF-8 causes hang on later message send (SMTP transport) #56

Open
dd-b opened this issue Dec 26, 2017 · 8 comments

Comments

@dd-b
Copy link

dd-b commented Dec 26, 2017

Took me forever to find this since it violated some of my assumptions, like that if any header values handled UTF-8 correctly they all would :-).

UTF-8 in header string values works fine for Subject and To. However, if I pass a header string value for the header "X-Minicon-DB" that Encode::is_utf8() believes is UTF-8 (even if none of the characters are non-ASCII), an attempt to send an email with that header times out (with SMTP transport).

I finally demonstrated this by using Encode::encode() and Encode::decode() to force the two states; ASCII works consistently, UTF-8 hangs consistently.

The following code fragment creates a header array later used in Email::MIME->Create(); when that second line is a decode() call, sending the email created hangs with Sending failed: error at after DATA: 4.4.2 homiemail-a25.g.dreamhost.com Error: timeout exceeded. When it is an encode() call, sending the email succeeds. (This is descended from production code but rather simplified at this point.)

my $xmdb = '53 pr1 350141';
$xmdb = decode('UTF-8', $xmdb);      # encode() sends okay, decode() hangs in SMTP transport later
print "xmdb \"$xmdb\" utf-8 ", is_utf8($xmdb)? 'y':'n', "\n";
my $header_str = [
 'From'	=> "Minicon <empubs\@minicon$mcnum.mnstf.org>",
 'To'	=> '"' . $topguy->{'name'} . '" <' . $destemail . '>',
 'X-Minicon-DB' => $xmdb,
 'Subject' => "Minicon $mcnum Progress Report 1",
 ];

Once I determined what was causing the hang (which probably took me longer than it should have; the fact that some headers with utf-8 values worked lead me to put off testing that as the cause until late, so I wasted time checking lots of other things that weren't the cause), it is of course easy to work around the problem, by using encode(). Still, this probably isn't the intended behavior, and if it is it should at least be documented clearly (or maybe I missed finding it).

@pali
Copy link
Contributor

pali commented Dec 26, 2017

Hi! First thing which I spotted in your message... You should never use Encode::is_utf8() nor utf8::is_utf8() call. They do NOT check if string is UTF-8 encoded.

@dd-b
Copy link
Author

dd-b commented Dec 26, 2017

Well, they return opposite values in the two cases, and one fails and one succeeds. I know they're not about encoding, they're about Perl internal representation; but that seems to be what makes the difference in this case.

@pali
Copy link
Contributor

pali commented Dec 26, 2017

Err. I have not understood what is the real problem there. What do you mean by they return opposite values?

decode('UTF-8', $xmdb) expects UTF-8 octets in $xmdb and returns sequence of Unicode codepoints. encode is doing reverse operation (expects Unicode codepoints and returns octets). So you cannot exchange one call by another, you always needs to know what you have or what other side or API expects.

@dd-b
Copy link
Author

dd-b commented Dec 26, 2017

I meant is_utf8() did. If I decode() it returns one, if I encode() it returns the other. I think I roughly understand what encode() and decode() do and have a slight understanding of what is_utf8() does, having read the Encode documentation fairly carefully (and been programming for a hair under 50 years now).

My point is, that code with encode() in it works, in that the message eventually generated (that $header_str goes into the Email::MIME->create() call untouched) when sent with Transport::SMTP "always" (I know email is unreliable, but over dozens of trials it has not ever failed) arrives, whereas if I put decode() there it always times out.

And, secondarily, that is_utf8() returns 1 on the output of decode() and undef on the output of encode().

I was lead to that solution by applying is_utf8() to the actual values being used in the production code (which was timing out on 11 records out of a file of 1500) to see what was there, and noticing that in the particular cases that were hanging, is_utf8() returned true even though there were no non-ASCII characters in the string. (The real data comes from reading a file that I use binmode to set as UTF-8, some lines of which contain characters beyond ASCII, though none of them in the field being extracted and returned by PphHdr(). However, other fields in the same line had non-ascii in the cases that failed. I'm suspecting that the process of cutting them out of the line preserves their utf-8 character in some internal Perl sense, and that encode() is fixing it by producing a byte-for-byte identical result which is not considered utf-8 internally. But I don't know the tools to dig into it that deeply, and don't actually understand the fine points of Perl internal string representations precisely enough to apply the tools sensibly if I had them.

Reducing this to simplest terms -- I have exhibited two code fragments (well, one, with one call to change), none of which return errors or throw an exception. Both arrays of header strings go into Email::MIME->create() without error and return an email message object (I've dumped them, they differ only in the header string carried forward). But when you send one of them using Transport::SMTP it times out, whereas when you send the other the message is delivered. My speculation on causes may be completely bogus, but I have an easily-demonstrated problem.

@pali
Copy link
Contributor

pali commented Dec 26, 2017

Have you looked at generated message with encode and with decode? Maybe this could help...

API for Email::MIME->create is that header_str expects Unicode strings and header expects bytes.

So if you already have Unicode string, you should pass it into header_str without any encode or decode. If you have UTF-8 string, then you can either:

  • decode it to UTF-8 and pass into header_str

  • or, pass it as is into header

is_utf8() returned true even though there were no non-ASCII characters in the string.

As you already pointed is_utf8 returns something which is internal representation. And perl can at anytime change internal representation. So there is absolutely nothing suspicious that ascii string has internal representation in utf8 (moreover, utf8 is extension of ascii). And also byte buffer can be stored in internal utf8 representation. And when you have a unicode string which have only codepoints U+000000 to U+0000FF then it can, but does not have to be internal utf8 representation... So it is really something internal which can change at anytime and has nothing with what is stored in perl scalar.

It is is known that some perl modules written in C/XS are buggy in these cases, but in pure perl code this internal representation is fully invisible.

@dd-b
Copy link
Author

dd-b commented Dec 26, 2017

header_str interface doesn't actually expect Unicode as such, it expects "strings", whereas header expects bytes / octet streams. I'm passing strings to header_str. (That's why the array ref is named what it is.)

Some of those strings are utf-8 encoded. Some of those contain non-ASCII characters, others do not. Most work. One particular case does not work. I think I've identified that case well enough for people to reproduce it. I've also found a workaround for my own immediate problem (which would work, I believe, for anybody else hitting the same issue); so a fix isn't particularly urgent to me.

However, is part of your issue that you think my workaround works too much by luck rather than using a tool properly? I do think I'm depending on a slightly weird behavior of encode() (that when encoding a string containing only ASCII to utf-8, it actually produces an ASCII result, not flagged as utf-8; I saw this described somewhere, and I'm pretty sure it's why my workaround works). I played around a bit more, and I find that using $xmdb = encode('ascii', $xmdb) on the real data also works, and this seems to have the advantage of doing exactly what it says -- taking a string, encoding it as ASCII. I've changed the production code to use this form of the workaround, and updated the comments there.

@pali
Copy link
Contributor

pali commented Dec 26, 2017

header_str interface doesn't actually expect Unicode as such, it expects "strings"

All strings in perl are just sequence of Unicode code points. So if interface expects strings, it means Unicode.

whereas header expects bytes / octet streams.

That is truth.

Some of those strings are utf-8 encoded.

If it is already UTF-8 encoded, then it is sequence of UTF-8 octets and so it is byte stream. I would suggest to not call it "string" in perl context. (UTF-8 != Unicode)

However, is part of your issue that you think my workaround works too much by luck rather than using a tool properly?

Basically I do not know if your reported issue is a bug for Email::MIME or question or suggestion. For bug reports to Email::MIME it is expected at least snip of code which returns wrong return value.

@dd-b
Copy link
Author

dd-b commented Dec 27, 2017

Sorry, thought I'd made it pretty clear in the original post (and multiple times since) that this was a bug report. That's why I'm entering it as an issue! Also, that I have a workable workaround. I'm not looking for help, I got my code working, I'm trying to get things fixed (in code, documentation, or wherever the root problem is) so other people don't have to go through the same process I did.

The complete code that produces the original problem requires external files, and can't be run without reconfiguring it to some other email service than the one I use (since I won't be giving out my password!). The short version that produces it standalone depends on what may be abuse of encode() and decode() (but no errors are produced at any time, until the final timeout when the attempt to actually send the email is made). Plus it still would need to be configured to some other email service to actually run it. And I don't have the time right now to spend yet more time on this problem that's already solved for me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants