-
Notifications
You must be signed in to change notification settings - Fork 0
/
p0482r3.html
2497 lines (2260 loc) · 92.6 KB
/
p0482r3.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<!doctype html public "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<meta http-equiv="Content-Type" content="text/html;charset=UTF-8">
<head>
<title>char8_t: A type for UTF-8 characters and strings (Revision 3)</title>
<link rel="stylesheet"
href="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/styles/default.min.css"/>
<script src="https://cdnjs.cloudflare.com/ajax/libs/highlight.js/9.12.0/highlight.min.js"></script>
<script>hljs.initHighlightingOnLoad();</script>
<style type="text/css">
pre {
display: inline;
}
table#header th,
table#header td
{
text-align: left;
}
table#references th,
table#references td
{
vertical-align: top;
}
ins, ins * { text-decoration:none; font-weight:bold; background-color:#A0FFA0 }
del, del * { text-decoration:line-through; background-color:#FFA0A0 }
#hidedel:checked ~ * del, #hidedel:checked ~ * del * { display:none; visibility:hidden }
blockquote
{
color: #000000;
background-color: #F1F1F1;
border: 1px solid #D1D1D1;
padding-left: 0.5em;
padding-right: 0.5em;
}
blockquote.stdins
{
text-decoration: underline;
color: #000000;
background-color: #C8FFC8;
border: 1px solid #B3EBB3;
padding: 0.5em;
}
blockquote.stddel
{
text-decoration: line-through;
color: #000000;
background-color: #FFEBFF;
border: 1px solid #ECD7EC;
padding-left: 0.5empadding-right: 0.5em;
}
</style>
</head>
<body>
<table id="header">
<tr>
<th>Document Number:</th>
<td>P0482R3</td>
</tr>
<tr>
<th>Date:</th>
<td>2018-05-07</td>
</tr>
<tr>
<th>Audience:</th>
<td>SG16<br/>
Evolution Working Group<br/>
Library Evolution Working Group</td>
</tr>
<tr>
<th>Reply-to:</th>
<td>Tom Honermann <[email protected]></td>
</tr>
</table>
<h1>char8_t: A type for UTF-8 characters and strings (Revision 3)</h1>
<ul>
<li><a href="#changes_since_P0482R2">
Changes since P0482R2</a></li>
<li><a href="#introduction">
Introduction</a></li>
<li><a href="#motivation">
Motivation</a></li>
<li><a href="#proposal">
Proposal</a></li>
<li><a href="#design">
Design Considerations</a>
<ul>
<li><a href="#design_compat">
Backward compatibility
</a>
<ul>
<li><a href="#design_compat_core">
Core language backward compatibility
</a>
<ul>
<li><a href="#design_compat_core_init">
Initialization
</a></li>
<li><a href="#design_compat_core_implicit_conversion">
Implicit conversions
</a></li>
<li><a href="#design_compat_core_type_deduction">
Type deduction
</a></li>
<li><a href="#design_compat_core_overload_resolution">
Overload resolution
</a></li>
<li><a href="#design_compat_core_template_specialization">
Template specialization
</a></li>
</ul>
</li>
<li><a href="#design_compat_library">
Library backward compatibility
</a>
<ul>
<li><a href="#design_compat_library_u8string">
Return type of <tt>path::u8string</tt> and <tt>path::generic_u8string</tt>
</a></li>
<li><a href="#design_compat_library_literal_operators">
Return type of <tt>operator ""s</tt> and <tt>operator ""sv</tt>
</a></li>
</ul>
</li>
</ul>
</li>
<li><a href="#design_narrow_utf8">
Should UTF-8 literals continue to be referred to as narrow literals?
</a></li>
<li><a href="#design_char8_t_underlying_type">
What should be the underlying type of char8_t?
</a></li>
</ul>
</li>
<li><a href="#implementation_exp">
Implementation Experience</a></li>
<li><a href="#wording">
Formal Wording</a>
<ul>
<li><a href="#core_wording">
Core wording</a></li>
<li><a href="#library_wording">
Library wording</a></li>
<li><a href="#annex_a_wording">
Annex A Grammar summary wording</a></li>
<li><a href="#annex_c_wording">
Annex C Compatibility wording</a></li>
<li><a href="#annex_d_wording">
Annex D Compatibility features wording</a></li>
<li><a href="#feature_testing">
Wording for P0096: Feature-testing recommendations for C++</a></li>
</ul>
</li>
<li><a href="#acknowledgements">
Acknowledgements</a></li>
<li><a href="#references">
References</a></li>
</ul>
<h1 id="changes_since_P0482R2">Changes since P0482R2</h1>
<ul>
<li>Added a wording section for annex D for deprecated features.</li>
<li>Updated the implementation experience section to note Richard Smith's
contribution of an implementation of the proposed core language changes
to Clang.</li>
</ul>
<h1 id="introduction">Introduction</h1>
<p>C++11 introduced support for UTF-8, UTF-16, and UTF-32 encoded string
literals via
<a title="N2249: New Character Types in C++"
href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2007/n2249.html">
N2249</a>
<sup><a title="N2249: New Character Types in C++"
href="#ref_n2249">
[N2249]</a></sup>.
New <tt>char16_t</tt> and <tt>char32_t</tt> types were added to hold values of
code units for the UTF-16 and UTF-32 variants, but a new type was not added for
the UTF-8 variants. Instead, UTF-8 character literals (added in C++17 via
<a title="N4197: Adding u8 character literals"
href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2014/n4197.html">
N4197</a>
<sup><a title="N4197: Adding u8 character literals"
href="#ref_n4197">
[N4197]</a></sup>)
and UTF-8 string literals were defined in terms of the <tt>char</tt> type used
for the code unit type of ordinary character and string literals. UTF-8 is the
only text encoding mandated to be supported by the C++ standard for which there
is no distinct code unit type. Lack of a distinct type for UTF-8 encoded
character and string literals prevents the use of overloading and template
specialization in interfaces designed for interoperability with encoded text.
The inability to infer an encoding for narrow characters and strings limits
design possibilities and hinders the production of elegant interfaces that work
seemlessly in generic code. Library authors must choose to limit encoding
support, design interfaces that require users to explicitly specify encodings,
or provide distinct interfaces for, at least, the implementation defined
execution and UTF-8 encodings.</p>
<p>Whether <tt>char</tt> is a signed or unsigned type is implementation defined
and implementations that use an 8-bit signed char are at a disadvantage with
respect to working with UTF-8 encoded text due to the necessity of having to
rely on conversions to unsigned types in order to correctly process leading and
continuation code units of multi-byte encoded code points.</p>
<p>The lack of a distinct type and the use of a code unit type with a range that
does not portably include the full unsigned range of UTF-8 code units presents
challenges for working with UTF-8 encoded text that are not present when working
with UTF-16 or UTF-32 encoded text. Enclosed is a proposal for a new
<tt>char8_t</tt> fundamental type and related library enhancements intended to
remove barriers to working with UTF-8 encoded text and to enable generic
interfaces that work with all five of the standard mandated text encodings in a
consistent manner.</p>
<h1 id="motivation">Motivation</h1>
<p>Consider the following string literal expressions, all of which encode
<tt>U+0123</tt>, <tt>LATIN SMALL LETTER G WITH CEDILLA</tt>:
<fieldset>
<pre><code class="c++">u8"\u0123" // UTF-8: const char[]: 0xC4 0xA3 0x00
u"\u0123" // UTF-16: const char16_t[]: 0x0123 0x0000
U"\u0123" // UTF-32: const char32_t[]: 0x00000123 0x00000000
"\u0123" // ???: const char[]: ???
L"\u0123" // ???: const wchar_t[]: ???
</code></pre>
</fieldset>
</p>
<p>The UTF-8, UTF-16, and UTF-32 string literals have well-defined and portable
sequences of code unit values. The ordinary and wide string literal code unit
sequences depend on the implementation defined execution and execution wide
encodings respectively. Code that is designed to work with text encodings must
be able to differentiate these strings. This is straight forward for wide,
UTF-16, and UTF-32 string literals since they each have a distinct code unit
type suitable for differentiation via function overloading or template
specialization. But for ordinary and UTF-8 string literals, differentiating
between them requires additional information since they have the same code unit
type. That additional information might be provided implicitly via differently
named functions, or explicitly via additional function or template
arguments. For example:
<fieldset>
<pre><code class="c++">// Differentiation by function name:
void do_x(const char *);
void do_x_utf8(const char *);
void do_x(const wchar_t *);
void do_x(const char16_t *);
void do_x(const char32_t *);
// Differentiation by suffix for user-defined literals:
int operator ""_udl(const char *s, std::size_t);
int operator ""_udl_utf8(const char *s, std::size_t);
int operator ""_udl(const wchar_t *s, std::size_t);
int operator ""_udl(const char16_t *s, std::size_t);
int operator ""_udl(const char32_t *s, std::size_t);
// Differentiation by function parameter:
void do_x2(const char *, bool is_utf8);
void do_x2(const wchar_t *);
void do_x2(const char16_t *);
void do_x2(const char32_t *);
// Differentiation by template parameter:
template<bool IsUTF8>
void do_x3(const char *);
</code></pre>
</fieldset>
</p>
<p>The requirement to, in some way, specify the text encoding, other than
through the type of the string, limits the ability to provide elegant encoding
sensitive interfaces. Consider the following invocations of the
<tt>make_text_view</tt> function proposed in
<a title="P0244R2: Text_view: A C++ concepts and range based character encoding
and code point enumeration library"
href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/p0244r2.html">
P0244R2</a>
<sup><a title="P0244R2: Text_view: A C++ concepts and range based character
encoding and code point enumeration library"
href="#ref_p0244r2">
[P0244R2]</a></sup>:
<fieldset>
<pre><code class="c++">make_text_view<execution_character_encoding>("text")
make_text_view<execution_wide_character_encoding>(L"text")
make_text_view<utf8_encoding>(u8"text")
make_text_view<utf16_encoding>(u"text")
make_text_view<utf32_encoding>(U"text")
</code></pre>
</fieldset>
</p>
<p>For each invocation, the encoding of the string literal is known at compile
time, so having to explicitly specify the encoding tag is redundant. If
UTF-8 string literals had a distinct type, then the encoding type could be
inferred, while still allowing an overriding tag to be supplied:
<fieldset>
<pre><code class="c++">make_text_view("text") // defaults to execution_character_encoding.
make_text_view(L"text") // defaults to execution_wide_character_encoding.
make_text_view(u8"text") // defaults to utf8_encoding.
make_text_view(u"text") // defaults to utf16_encoding.
make_text_view(U"text") // defaults to utf32_encoding.
make_text_view<utf16be_encoding>("\0t\0e\0x\0t\0") // Default overridden to select UTF-16BE.
</code></pre>
</fieldset>
</p>
<p>The inability to infer an encoding for narrow strings doesn't just limit the
interfaces of new features under consideration. Compromised interfaces are
already present in the standard library.</p>
<p>Consider the design of the <tt>codecvt</tt> class template. The standard
specifies the following specializations of <tt>codecvt</tt> be provided to
enable transcoding text from one encoding to another.
<fieldset>
<pre><code class="c++">codecvt<char, char, mbstate_t> <em>// #1</em>
codecvt<wchar_t, char, mbstate_t> <em>// #2</em>
codecvt<char16_t, char, mbstate_t> <em>// #3</em>
codecvt<char32_t, char, mbstate_t> <em>// #4</em>
</code></pre>
</fieldset>
</p>
<p>#1 performs no conversions. #2 converts between strings encoded in the
implementation defined wide and narrow encodings. #3 and #4 convert between
either the UTF-16 or UTF-32 encoding and the UTF-8 encoding. Specializations
are not currently specified for conversion between the implementation defined
narrow and wide encodings and any of the UTF-8, UTF-16, or UTF-32 encodings.
However, if support for such conversions were to be added, the desired
interfaces are already taken by #1, #3 and #4.</p>
<p>The file system interface adopted for C++17 via
<a title="P0218R1: Adopt the File System TS for C++17"
href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0372r0.html">
P0218R1</a>
<sup><a title="P0218R1: Adopt the File System TS for C++17"
href="#ref_p0218r1">
[P0218R1]</a></sup>
provides an example of a feature that supports all five of the standard mandated
encodings, but does so with an asymetric interface due to the inability to
overload functions for UTF-8 encoded strings. Class
<tt>std::filesystem::path</tt> provides the following constructors to initialize
a <tt>path</tt> object based on a range of code unit values where the encoding
is inferred based on the value type of the range.
<fieldset>
<pre><code class="c++">template <class Source>
path(const Source& source);
template <class InputIterator>
path(InputIterator first, InputIterator last);
</code></pre>
</fieldset>
<p>§ 30.11.7.2.2 [fs.path.type.cvt] describes how the source encoding is
determined based on whether the source range value type is <tt>char</tt>,
<tt>wchar_t</tt>, <tt>char16_t</tt>, or <tt>char32_t</tt>. A range with value
type <tt>char</tt> is interpreted using the implementation defined execution
encoding. It is not possible to construct a path object from UTF-8
encoded text using these constructors.
<p>To accommodate UTF-8 encoded text, the file system library specifies the
following factory functions. Matching factory functions are not provided for
other encodings.
<fieldset>
<pre><code class="c++">template <class Source>
path u8path(const Source& source);
template <class InputIterator>
path u8path(InputIterator first, InputIterator last);
</code></pre>
</fieldset>
<p>The requirement to construct <tt>path</tt> objects using one interface for
UTF-8 strings vs another interface for all other supported encodings creates
unnecessary difficulties for portable code. Consider an application that uses
UTF-8 as its internal encoding on POSIX systems, but uses UTF-16 on Windows.
Conditional compilation or other abstractions must be implemented and used
in otherwise platform neutral code to construct <tt>path</tt> objects.</p>
<p>The inability to infer an encoding based on string type is not the only
challenge posed by use of <tt>char</tt> as the UTF-8 code unit type. The
following code exhibits implementation defined behavior.
<fieldset>
<pre><code class="c++">bool is_utf8_multibyte_code_unit(char c) {
return c >= 0x80;
}
</code></pre>
</fieldset>
</p>
<p>UTF-8 leading and continuation code units have values in the range 128
(0x80) to 255 (0xFF). In the common case where <tt>char</tt> is implemented
as a signed 8-bit type with a two's complement representation and a range of
-128 (-0x80) to 127 (0x7F), these values exceed the unsigned range of the
<tt>char</tt> type. Such implementations typically encode such code units as
unsigned values which are then reinterpreted as signed values when read. In
the code above, integral promotion rules result in <tt>c</tt> being promoted to
type <tt>int</tt> for comparison to the <tt>0x80</tt> operand. if <tt>c</tt>
holds a value corresponding to a leading or continuation code unit value, then
its value will be interpreted as negative and the promoted value of type
<tt>int</tt> will likewise be negative. The result is that the comparison
is always false for these implementations.</p>
<p>To correct the code above, explicit conversions are required. For example:
<fieldset>
<pre><code class="c++">bool is_utf8_multibyte_code_unit(char c) {
return static_cast<unsigned char>(c) >= 0x80;
}
</code></pre>
</fieldset>
</p>
<p>Finally, processing of UTF-8 strings is currently subject to an optimization
pessimization due to glvalue expressions of type <tt>char</tt> potentially
aliasing objects of other types. Use of a distinct type that does not share
this aliasing behavior may allow for further compiler optimizations.</p>
<p>As of November 2017,
<a title="Usage of UTF-8 for websites"
href="https://w3techs.com/technologies/details/en-utf8/all/all">
UTF-8 is now used by more than 90% of all websites</a>
<sup><a title="Usage of UTF-8 for websites"
href="#ref_w3techs">
[W3Techs]</a></sup>.
The C++ standard must improve support for UTF-8 by removing the existing
barriers that result in redundant tagging of character encodings, non-generic
UTF-8 specific workarounds like <tt>u8path</tt>, and the need for static
casts to examine UTF-8 code unit values.
</p>
<h1 id="proposal">Proposal</h1>
<p>The proposed changes are intended to bring the standard to the state the
author believes it would likely be in had <tt>char8_t</tt> been added at the
same time that <tt>char16_t</tt> and <tt>char32_t</tt> were added. This
includes the ability to differentiate ordinary and UTF-8 literals in function
overloading, template specializations, and user-defined literal operator
signatures. The following core language changes are proposed in order to
facilitate these capabilities:
<ul>
<li>A new fundamental type named <tt>char8_t</tt>. This integral type has
the same signedness, size, alignment, and integer conversion rank as
<tt>unsigned char</tt>, but does not alias with any other type
(e.g., this proposal does not add <tt>char8_t</tt> to the list of
aliasing types in § 8.2.1 [basic.lval] paragraph 11 (11.8)).</li>
<li>The type of UTF-8 string literals is changed from array of
<tt>const char</tt> to array of <tt>const char8_t</tt>.</li>
<li>The type of UTF-8 character literals is changed from <tt>char</tt>
to <tt>char8_t</tt>.</li>
<li>New <tt>char8_t</tt> based signatures for user-defined literal
operators.</li>
</ul></p>
<p>The following library changes are proposed to address concerns like those
raised in the motivation section above, and to take advantage of the new
core features:
<ul>
<li>New <tt>char8_t</tt> based specializations of <tt>atomic</tt>,
<tt>numeric_limits</tt>, <tt>hash</tt>, <tt>char_traits</tt>,
<tt>basic_string</tt>, and <tt>basic_string_view</tt>.</li>
<li>New <tt>u8streampos</tt>, <tt>u8string</tt>, <tt>u8string_view</tt>
type aliases.</li>
<li>New <tt>operator ""s</tt> and <tt>operator ""sv</tt> <tt>char8_t</tt>
based overloads for UTF-8 literals.</li>
<li>New <tt>basic_ostream<char>::operator<<()</tt> and
<tt>basic_istream<char>::operator>>()</tt> stream insertion
and extraction overloads for <tt>char8_t</tt>. These are added for
consistency with the current <tt>signed char</tt> and
<tt>unsigned char</tt> overloads.
<li>New <tt>char8_t</tt> based specializations of <tt>codecvt</tt> and
<tt>codecvt_byname</tt> for converting between UTF-16, UTF-32, and
UTF-8. The existing <tt>char</tt> based specializations are deprecated.
The new specializations are functionally identical to the deprecated
ones.</li>
<li>The return type of the <tt>u8string</tt> and <tt>generic_u8string</tt>
member functions of the filesystem <tt>path</tt> class are changed
from <tt>string</tt> to <tt>u8string</tt>.</li>
<li>Filesystem <tt>path</tt> objects may now be constructed with UTF-8
strings using the existing <tt>path</tt> constructors used for
construction with other encodings. The existing <tt>u8path</tt>
factory functions are deprecated.</li>
</ul></p>
<p>These changes necessarily impact backward compatibility as described in
the <a href="#design_compat">Backward compatibility</a> section.</p>
<h1 id="design">Design Considerations</h1>
<h2 id="design_compat">Backward compatibility</h2>
<p>This proposal does not specify any backward compatibility features other than
to retain interfaces that it deprecates. The author believes such features are
necessary, but that a single set of such features would unnecessarily compromise
the goals of this proposal. Rather, the expectation is that implementations
will provide options to enable more fine grained compatibility features.</p>
<p>The following sections discuss backward compatibility impact.</p>
<h3 id="design_compat_core">Core language backward compatibility</h3>
<h4 id="design_compat_core_init">Initialization</h4>
<p>Declarations of arrays of <tt>char</tt> may currently be initialized with
UTF-8 string literals. Under this proposal, such initializations would
become ill-formed. This is intended to maintain consistency with
initialization of arrays of <tt>wchar_t</tt>, <tt>char16_t</tt>, and
<tt>char32_t</tt>, all of which require the initializing string literal to
have a matching element type as specified in § 11.6.2 [dcl.init.string].
<fieldset>
<pre><code class="c++">char ca[] = u8"text"; // C++17: Ok.
// This proposal: Ill-formed.
char8_t c8a[] = "text"; // C++17: N/A (char8_t is not a type specifier).
// This proposal: Ill-formed.
</code></pre>
</fieldset>
</p>
<p>Implementations are encouraged to add options to allow the above
initializations (with a warning) to assist users in migrating their code.</p>
<p>Declarations of variables of type <tt>char</tt> initialized with a UTF-8
character literal remain well-formed and are initialized following the
standard conversion rules.
<fieldset>
<pre><code class="c++">char c = u8'c'; // C++17: Ok.
// This proposal: Ok (no change from C++17).
char8_t c8 = 'c'; // C++17: N/A (char8_t is not a type specifier).
// This proposal: Ok; c8 is assigned the value of the 'c'
// character in the execution character set.
</code></pre>
</fieldset>
</p>
<h4 id="design_compat_core_implicit_conversion">Implicit conversions</h4>
<p>Under this proposal, UTF-8 string literals no longer bind to references
to array of type <tt>const char</tt> nor do they implicitly convert to pointer
to <tt>const char</tt>. The following code is currently well-formed, but would
become ill-formed under this proposal:
<fieldset>
<pre><code class="c++">const char (&u8r)[] = u8"text"; // C++17: Ok.
// This proposal: Ill-formed.
const char *u8p = u8"text"; // C++17: Ok.
// This proposal: Ill-formed.
</code></pre>
</fieldset>
</p>
<p>Implementations are encouraged to add options to allow the above
conversions (with a warning) to assist users in migrating their code.
Such options would require allowing aliasing of <tt>char</tt> and
<tt>char8_t</tt>. Note that it may be useful to permit these conversions
only for UTF-8 string literals and not for general expressions of array
of <tt>char8_t</tt> type.</p>
<h4 id="design_compat_core_type_deduction">Type deduction</h4>
<p>Under this proposal, UTF-8 string and character literals have type array of
<tt>const char8_t</tt> and <tt>char8_t</tt> respectively. This affects the
types deduced for placeholder types and template parameter types.
<fieldset>
<pre><code class="c++">template<typename T1, typename T2>
void ft(T1, T2);
ft(u8"text", u8'c'); // C++17: T1 deduced to const char*, T2 deduced to char.
// This proposal: T1 deduced to const char8_t*, T2 deduced to char8_t.
auto u8p = u8"text"; // C++17: Type deduced to const char*.
// This proposal: Type deduced to const char8_t*.
auto u8c = u8'c'; // C++17: Type deduced to char.
// This proposal: Type deduced to char8_t.
</code></pre>
</fieldset>
</p>
<p>This change in behavior is a primary objective of this proposal.
Implementations are encouraged to add options to disable <tt>char8_t</tt>
support entirely when necessary to preserve compatibility with C++17.</p>
<h4 id="design_compat_core_overload_resolution">Overload resolution</h4>
<p>The following code is currently well-formed, and would remain well-formed
under this proposal, but would behave differently:
<fieldset>
<pre><code class="c++">template<typename T> void f(const T*);
void f(const char*);
f(u8"text"); // C++17: Calls f(const char*).
// This proposal: Calls f<char8_t>(const char8_t*).
</code></pre>
</fieldset>
</p>
<p>The following code is currently well-formed, but would become ill-formed
under this proposal:
<fieldset>
<pre><code class="c++">void f(const char*);
f(u8"text"); // C++17: Ok.
// This proposal: Ill-formed; no matching function found.
int operator ""_udl(const char*, size_t);
auto x = u8"text"_udl; // C++17: Ok
// This proposal: Ill-formed; no matching literal operator found.
</code></pre>
</fieldset>
</p>
<p>These changes in behavior are a primary objective of this proposal.
Implementations are encouraged to add options to disable <tt>char8_t</tt>
support entirely when necessary to preserve compatibility with C++17.</p>
<h4 id="design_compat_core_template_specialization">Template specialization</h4>
<p>The following code is currently well-formed, and would remain well-formed
under this proposal, but would behave differently:
<fieldset>
<pre><code class="c++">template<typename T> struct ct { static constexpr bool value = false; };
template<> struct ct<char> { static constexpr bool value = true; };
template<typename T> bool ft(const T*) { return ct<T>::value; }
ft(u8"text"); // C++17: returns true.
// This proposal: returns false.
</code></pre>
</fieldset>
</p>
<p>This change in behavior is a primary objective of this proposal.
Implementations are encouraged to add options to disable <tt>char8_t</tt>
support entirely when necessary to preserve compatibility with C++17.</p>
<h3 id="design_compat_library">Library backward compatibility</h3>
<h4 id="design_compat_library_u8string">
Return type of <tt>path::u8string</tt> and <tt>path::generic_u8string</tt></h4>
<p>This proposal includes a new specialization of <tt>std::basic_string</tt>
for the new <tt>char8_t</tt> type, a new <tt>std::u8string</tt> type alias,
and changes to the <tt>u8string</tt> and <tt>generic_u8string</tt> member
functions of <tt>filesystem::path</tt> to return <tt>std::u8string</tt>
instead of <tt>std::string</tt>. This change renders ill-formed the following
code that is currently well-formed.
<fieldset>
<pre><code class="c++">void f(std::filesystem::path p) {
std::string s;
s = p.u8string(); // C++17: Ok.
// This proposal: ill-formed.
}
</code></pre>
</fieldset>
</p>
<p>Implementations are encouraged to add an option that allows implicit
conversion of <tt>std::u8string</tt> to <tt>std::string</tt> to assist in
a gradual migration of code that calls these functions.</p>
<h4 id="design_compat_library_literal_operators">
Return type of <tt>operator ""s</tt> and <tt>operator ""sv</tt></h4>
<p>This proposal includes new overloads of <tt>operator ""s</tt> and
<tt>operator ""sv</tt> that return <tt>char8_t</tt> specializations of
<tt>std::basic_string</tt> and <tt>std::basic_string_view</tt> respectively.
This change renders ill-formed the following code that is currently well-formed.
<fieldset>
<pre><code class="c++">std::string s;
s = u8"text"s; // C++17: Ok.
// This proposal: ill-formed.
s = u8"text"sv; // C++17: Ok.
// This proposal: ill-formed.
</code></pre>
</fieldset>
</p>
<p>Implementations are encouraged to add an option that allows implicit
conversion of <tt>std::u8string</tt> to <tt>std::string</tt> to assist in
a gradual migration of code that calls these functions.</p>
<h2 id="design_narrow_utf8">
Should UTF-8 literals continue to be referred to as narrow literals?</h2>
<p>UTF-8 literals are maintained as narrow literals in this proposal.</p>
<h2 id="design_char8_t_underlying_type">
What should be the underlying type of char8_t?</h2>
<p>There are several choices for the underlying type of <tt>char8_t</tt>.
Use of <tt>unsigned char</tt> closely aligns with historical use. Use of
<tt>uint_least8_t</tt> would maintain consistency with how the underlying
types of <tt>char16_t</tt> and <tt>char32_t</tt> are specified.</p>
<p>This proposal specifies <tt>unsigned char</tt> as the underlying type as
noted in the changes to § 6.7.1 <tt>[basic.fundamental]</tt> paragraph 5.</p>
<h1 id="implementation_exp">Implementation Experience</h1>
<p>An implementation is available in the <tt>char8_t</tt> branch of a gcc
fork hosted on GitHub at
<a href="https://github.com/tahonermann/gcc/tree/char8_t">
https://github.com/tahonermann/gcc/tree/char8_t</a>. This implementation is
believed to be complete for both the proposed core language and library
features. New <tt>-fchar8_t</tt> and <tt>-fno-char8_t</tt> compiler options
support enabling and disabling the new features. No backward compatibility
features are currently implemented.</p>
<p>Richard Smith implemented support for the proposed core wording changes
for the next release of Clang. The changes are guarded by new
<tt>-fchar8_t</tt> and <tt>-fno-char8_t</tt> matching the gcc implementation.
No backward compatibility features are currently implemented. Support for
the proposed library features has not yet been implemented in libc++.
Richard's changes can be found at
<a href="http://llvm.org/viewvc/llvm-project?view=revision&revision=331244">
http://llvm.org/viewvc/llvm-project?view=revision&revision=331244</a>
</p>
<h1 id="wording">Formal Wording</h1>
<input type="checkbox" id="hidedel">Hide deleted text</input>
<p>These changes are relative to
<a title="Working Draft, Standard for Programming Language C++"
href="http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2017/n4713.pdf">
N4713</a>
<sup><a title="Working Draft, Standard for Programming Language C++"
href="#ref_n4713">
[N4713]</a></sup></p>
<p>Where noted, these changes presume the adoption of proposal
<a title="char8_t: A type for UTF-8 characters and strings"
href="http://www.open-std.org/jtc1/sc22/wg14/www/docs/n2231.htm">
WG14 N2231</a>
<sup><a title="char8_t: A type for UTF-8 characters and strings"
href="#ref_wg14_n2231">
[WG14 N2231]</a></sup>
into the ISO/IEC 9899 standard for C, and that the next revision of the C++
standard will update dependencies on ISO/IEC 9899 accordingly.</p>
<h2 id="core_wording">Core wording</h2>
<p>Add <tt>char8_t</tt> to the list of keywords in table 5 in 5.11 [lex.key]
paragraph 1. </p>
<p>Change in 5.13.3 [lex.ccon] paragraph 3:
<blockquote>
A character literal that begins with <tt>u8</tt>, such as <tt>u8'w'</tt>, is a
character literal of type <del><tt>char</tt></del><ins><tt>char8_t</tt></ins>,
known as a <em>UTF-8 character literal</em>.[…]
</blockquote>
</p>
<p>Change in 5.13.5 [lex.string] paragraph 6:
<blockquote>
After translation phase 6, a <em>string-literal</em> that does not begin with
an <em>encoding-prefix</em> is an <em>ordinary string literal</em><ins>. An
ordinary string literal has type "<em>array</em> of <em>n</em>
<tt>const char</tt>" where <em>n</em> is the size of the string as defined
below, has static storage duration (6.6.4)</ins>, and is initialized with the
given characters.
</blockquote>
</p>
<p>Change in 5.13.5 [lex.string] paragraph 7:
<blockquote>
A <em>string-literal</em> that begins with <tt>u8</tt>, such as
<tt>u8"asdf"</tt>, is a <em>UTF-8 string literal</em><ins>, also referred to as
a <tt>char8_t</tt> string literal. A <tt>char8_t</tt> string literal has type
"<em>array</em> of <em>n</em> <tt>const char8_t</tt>", where <em>n</em> is
the size of the string as defined below; each successive element of the object
representation (6.7) has the value of the corresponding code unit of the UTF-8
encoding of the string.
</blockquote>
</p>
<p>Change in 5.13.5 [lex.string] paragraph 8:
<blockquote>
Ordinary string literals and UTF-8 string literals are also referred to as
narrow string literals. <del>A narrow string literal has type "<em>array</em>
of <em>n</em> <tt>const char</tt>", where <em>n</em> is the size of the string
as defined below, and has static storage duration (6.6.4).</del>
</blockquote>
</p>
<p><em>Drafting note: The deleted paragraph 8 content was incorporated in the
changes to paragraphs 6 and 7.</em></p>
<p>Remove 5.13.5 [lex.string] paragraph 9:
<blockquote class=stddel>
For a UTF-8 string literal, each successive element of the object
representation (6.7) has the value of the corresponding code unit of the UTF-8
encoding of the string.
</blockquote>
</p>
<p><em>Drafting note: The paragraph 9 content was incorporated in the changes
to paragraph 7.</em></p>
<p>Change in 5.13.5 [lex.string] paragraph 15:
<blockquote>
[…] In a narrow string literal, a <em>universal-character-name</em>
may map to more than one <tt>char</tt> <ins>or <tt>char8_t</tt></ins> element
due to <em>multibyte encoding</em>. […]
</blockquote>
</p>
<p>Change in 6.7.1 [basic.fundamental] paragraph 1:
<blockquote>
Objects declared <del>as characters</del><ins>with type </ins>
<del>(</del><tt>char</tt><del>)</del> shall be large enough to store any member
of the implementation’s basic character set. If a character from this set is
stored in a character object, the integral value of that character object is
equal to the value of the single character literal form of that character. It is
implementation-defined whether a <tt>char</tt> object can hold negative values.
Characters <ins>declared with type <tt>char</tt> </ins>can be explicitly
declared <tt>unsigned</tt> or <tt>signed</tt>. Plain <tt>char</tt>,
<tt>signed char</tt>, and <tt>unsigned char</tt> are three distinct types,
collectively called <em><del>narrow</del><ins>ordinary</ins> character
types</em>. <ins>The ordinary character types and <tt>char8_t</tt> are
collectively called <em>narrow character types</em>.</ins> A <tt>char</tt>, a
<tt>signed char</tt>, <del>and </del>an <tt>unsigned char</tt><ins>, and a
<tt>char8_t</tt></ins> occupy the same amount of storage and have the same
alignment requirements (6.6.5); that is, they have the same object
representation. For narrow character types, all bits of the object
representation participate in the value representation. [ <em>Note</em>: A
bit-field of narrow character type whose length is larger than the number of
bits in the object representation of that type has padding bits; see 6.7.
— <em>end note</em> ] For unsigned narrow character types,
each possible bit pattern of the value representation
represents a distinct number. These requirements do not hold for other types.
In any particular implementation, a plain <tt>char</tt> object can
take on either the same values as a <tt>signed char</tt> or an
<tt>unsigned char</tt>; which one is implementation-defined. For each value
<em>i</em> of type <tt>unsigned char</tt><ins>, or <tt>char8_t</tt></ins> in the
range 0 to 255 inclusive, there exists a value <em>j</em> of type <tt>char</tt>
such that the result of an integral conversion (7.8) from <em>i</em> to
<tt>char</tt> is <em>j</em>, and the result of an integral conversion from
<em>j</em> to <tt>unsigned char</tt><ins> or <tt>char8_t</tt></ins> is
<em>i</em>.
</blockquote>
</p>
<p>Change in 6.7.1 [basic.fundamental] paragraph 5:
<blockquote>
[…] Type <tt>wchar_t</tt> shall have the same size, signedness, and
alignment requirements (6.6.5) as one of the other integral types, called its
underlying type. <ins>Type <tt>char8_t</tt> denotes a distinct type with the
same size, signedness, and alignment as <tt>unsigned char</tt>, called its
underlying type.</ins> Types <tt>char16_t</tt> and <tt>char32_t</tt> denote
distinct types with the same size, signedness, and alignment as
<tt>uint_least16_t</tt> and <tt>uint_least32_t</tt>, respectively, in
<tt><cstdint></tt>, called the underlying types. […]
</blockquote>
</p>
<p>Change in 6.7.1 [basic.fundamental] paragraph 7:
<blockquote>
Types <tt>bool</tt>, <tt>char</tt>, <ins><tt>char8_t</tt>, </ins>
<tt>char16_t</tt>, <tt>char32_t</tt>, <tt>wchar_t</tt>, and the signed and
unsigned integer types are collectively called integral types. […]
</blockquote>
</p>
<p>Change in 6.7.4 [conv.rank] paragraph 1:
<blockquote>
[…]<br/>
(1.8) — The ranks of <ins><tt>char8_t</tt>, </ins><tt>char16_t</tt>,
<tt>char32_t</tt>, and <tt>wchar_t</tt> shall equal the ranks of their
underlying types (6.7.1).
<br/>[…]
</blockquote>
</p>
<p>Change to footnote 64 associated with 8.3 [expr.arith.conv] paragraph 1 (1.5):
<blockquote>
As a consequence, operands of type <tt>bool</tt>, <ins><tt>char8_t</tt>, </ins>
<tt>char16_t</tt>, <tt>char32_t</tt>, <tt>wchar_t</tt>, or an enumerated type
are converted to some integral type.
</blockquote>
</p>
<p>Change in 8.5.2.3 [expr.sizeof] paragraph 1:
<blockquote>
[…] <del><tt>sizeof(char)</tt>, <tt>sizeof(signed char)</tt>
and <tt>sizeof(unsigned char)</tt> are 1</del><ins>The result of <tt>sizeof</tt>
applied to any of the narrow character types is 1</ins>. The result of
<tt>sizeof</tt> applied to any other fundamental type is implementation-defined.
[…]
</blockquote>
</p>
<p>Change in 10.1.7.2 [dcl.type.simple] paragraph 1:
<blockquote>
The simple type specifiers are<br/>
<div style="margin-left: 1em;">
<em>simple-type-specifier</em>:<br/>
<div style="margin-left: 1em;">
[…]<br/>
<tt>char</tt><br/>
<ins><tt>char8_t</tt></ins><br/>
<tt>char16_t</tt><br/>
<tt>char32_t</tt><br/>
[…]<br/>
</div>
</div>
</blockquote>
</p>
<p>Change in table 11 of 10.1.7.2 [dcl.type.simple] paragraph 2:
<blockquote>
[…]<br/>
<div style="margin-left: 1em;">
<table>
<tr>
<td align="center">
Table 11 — <em>simple-type-specifiers</em> and the types they specify
</td>
</tr>
<tr>
<td align="center">
<table border="1">
<tr>
<th>Specifier(s)</th>
<th>Type</th>
</tr>
<tr>
<td>[…]</td>
<td>[…]</td>
</tr>
<tr>
<td><tt>char</tt></td>
<td><tt>“char”</tt></td>
</tr>
<tr>
<td><tt>unsigned char</tt></td>
<td><tt>“unsigned char”</tt></td>
</tr>
<tr>
<td><tt>signed char</tt></td>
<td><tt>“signed char”</tt></td>
</tr>
<tr>
<td><ins><tt>char8_t</tt></ins></td>
<td><ins><tt>“char8_t”</tt></ins></td>
</tr>
<tr>
<td><tt>char16_t</tt></td>
<td><tt>“char16_t”</tt></td>
</tr>
<tr>
<td><tt>char32_t</tt></td>
<td><tt>“char32_t”</tt></td>
</tr>
<tr>
<td>[…]</td>
<td>[…]</td>
</tr>
</table>
</td>
</tr>
</table>
</div>
<br/>[…]
</blockquote>
</p>
<p>Change in 11.6 [dcl.init] paragraph 17:
<blockquote>
[…]<br/>
(17.3) — If the destination type is an array of characters, <ins>an
array of <tt>char8_t</tt>, </ins>an array of <tt>char16_t</tt>, an array of
<tt>char32_t</tt>, or an array of <tt>wchar_t</tt>, and the initializer is a
string literal, see 11.6.2.