-
Notifications
You must be signed in to change notification settings - Fork 36
/
Copy pathS15-unicode.pod
736 lines (515 loc) · 24.7 KB
/
S15-unicode.pod
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
=comment Oh, this is in Pod6 format. Just so you know.
=begin pod
=TITLE Synopsis 15: Unicode [DRAFT]
=VERSION
=for table
Created 2 December 2013
Last Modified 10 October 2014
Version 8
This document describes how Unicode and Perl 6 work together. Needless to say,
it would be good for your chosen reader to support various Unicode characters.
=head1 String Base Units
A Unicode string can be looked at in any of four ways. It could be seen in terms
of its graphemes, its codepoints, its encoding's code units, or the bytes that
make up the encoding.
For example, consider a string containing the Devanagari syllable नि, which is
comprised of the codepoints
U+0928 DEVANAGARI LETTER NA
U+093F DEVANAGARI VOWEL SIGN I
There are a variety of ways in which to perceive the length of this string. For
reference, here is how the syllable gets encoded in each of UTF-8, UTF-16BE, and
UTF-32BE.
=for table
UTF-8 E0 A4 A8 E0 A4 BF
UTF-16BE 0928 093F
UTF-32BE 00000928 0000093F
And depending on if you desire to count by graphemes, codepoints, code units, or
bytes, the perceived length of the string differs:
=for table
|------------+-------+--------+--------|
| count by | UTF-8 | UTF-16 | UTF-32 |
|============+=======+========+========|
| bytes | 6 | 4 | 8 |
| code units | 6 | 2 | 2 |
| codepoints | 2 | 2 | 2 |
| graphemes | 1 | 1 | 1 |
|------------+-------+--------+--------|
Perl 6 offers various mechanisms to count by each of these "base units" of a
string.
Perl 6 by default operates on graphemes, so counting by graphemes involves:
"string".chars
To count by codepoints, conversion of a string to one of NFC, NFD, NFKC, or NFKD
is needed:
"string".NFC.codes
"string".NFKD.codes
To count by code units, you can convert to the appropriate buffer type.
"string".encode("UTF-32LE").elems
"string".encode("utf-8").elems
And finally, counting by bytes simply involves converting that buffer to a
C<buf8> object:
"string".encode("UTF-16BE").buf8.elems
Note that C<utf8> already stores by bytes, so the count for bytes and code units
is always the same.
=head1 Normalization Forms
=head2 NFG
Perl 6, by default, stores all strings given to it in NFG form, Normalization
Form Grapheme. It's a Perl 6–invented character representation, designed to deal
with un-precomposed graphemes properly.
Formally Perl 6 graphemes are defined exactly according to Unicode Grapheme Cluster Boundaries
at level "extended" (in contrast to "tailored" or "legacy"),
see Unicode Standard Annex #29 UNICODE TEXT SEGMENTATION
L<3 Grapheme Cluster Boundaries|http://www.unicode.org/reports/tr29/tr29-25.html#Grapheme_Cluster_Boundaries>>.
This is the same as the Perl 5 character class
\X Match Unicode "eXtended grapheme cluster"
With NFG, strings start by being run through the normal NFC process, compressing
any given character sequences into precomposed characters.
Any graphemes remaining without precomposed characters, such as ậ or नि, are
given their own internal designation to refer to them, at least 32 bits in
length, in such a way that they avoid clashing with any potential future
changes to Unicode. The mapping between these internal designations and
graphemes in this form is not guaranteed constant, even between strings in
the same process.
The Perl 6 C<Str> type, and more generally the C<Stringy> role, deals
exclusively in NFG form.
=head2 NFC and NFD
The NFC and NFD normalization forms are a defined part of the Unicode
standard. NFD takes precomposed characters and separates them into their
constituent parts, with a specific ordering of those pieces. NFC tries to
replace characters sequences into singular precomposed characters whenever
possible, after first running it through the NFD process.
These two Normalization Forms are similar to NFG, except that graphemes without
precomposed versions exist as multiple codepoints.
NFC is the form Perl 6 uses whenever NFG is not viable, such as printing the
string to stdout or passing it to a C<{ use v5; }> section.
=head2 NFKC and NFKD
These forms are known as compatibility forms (denoted by a K to avoid confusion
with C for Composition). They are similar to their canonical counterparts, but
may transform various characters (such as fi or ſ) to perform better with the
software.
All four of NFC, NFD, NFKC, and NFKD can be considered valid "codepoint views",
though each differ in their exact formulation of the contents of a string:
say "ẛ̣".codes; # OUTPUT: 2 (NFG, ẛ̣)
say "ẛ̣".NFC.codes; # OUTPUT: 2 (NFC, ẛ + ̣)
say "ẛ̣".NFD.codes; # OUTPUT: 3 (NFD, ſ + ̣+ ̇)
say "ẛ̣".NFKC.codes; # OUTPUT: 1 (NFKC, ṩ)
say "ẛ̣".NFKD.codes; # OUTPUT: 3 (NFKD, s + ̣+ ̇)
Those who wish to operate with strings on the codepoint level may wish to use
NFC, as it is the least different from NFG, as well as Perl 6's default form for
NFG-less contexts.
All of C<Uni>, C<NFC>, C<NFD>, C<NFKC>, and C<NFKD>, and more generally the
C<Unicodey> role, deal with the various codepoint-based compositions.
=head1 The C<Str> Type
Presented are the variety of methods of C<Str> which are related to
Unicode. C<Str> deals exclusively in the NFG form of Unicode strings.
=head2 String to Numeral Conversion
Str.ord
Str.ords
ord(Str $string)
ords(Str $string)
These give you the numeric values of the B<base character> of graphemes in a
string. C<ord> only works on the first graphemes, while C<ords> works on every
grapheme.
=head2 Length Methods
Str.chars
These methods are equivalent, and count the number of graphemes in your string.
[Should there be methods that implicitly convert to the other string types, or
would .NFKD.chars be necessary?]
=head2 Buf conversion
Str.encode($enc = "utf-8")
Encodes the contents of the string by the specified encoding (by default UTF-8)
and generates the appropriate C<blob>.
Note that if you convert to one of the UTFs, you'll get a UTF-aware version of
the C<blob>. (Non-Unicode encodings will go for the most appropriate C<blob>
type.)
UTF-16 and UTF-32 default to big endian if you don't specify endianness.
Str.encode --> utf8
Str.encode("UTF-16") --> utf16 (big endian)
Str.encode("UTF-32LE") --> utf32 (little endian)
Str.encode("ASCII") --> blob8
=head1 The C<NF*> Types
Perl 6 has four types corresponding to a specific Unicode Standard Normalization
Form: C<NFC>, C<NFD>, C<NFKC>, and C<NFKD>.
Each one of these types perform normalization on strings stored in it.
The C<NF*> types do the C<Unicodey> role.
=head1 The C<Uni> Type
The C<Uni> type is like the various C<NF*> types, but allows a mixed collection
of normalization forms to make up the string.
The C<Uni> type does the C<Unicodey> role.
=head1 The C<Unicodey> Role
The C<Unicodey> role deals in various Unicode-aware functions.
=head2 Length Methods
Unicodey.chars
Unicodey.codes
Both are synonymous. Counts the number of codepoints in a C<Unicodey> type.
[Maybe C<Unicodey does Stringy> ?]
=head1 The C<Stringy> Role
The C<Stringy> role deals with a more general, not necessarily Unicode-based
view of strings. C<Str> uses this because it doesn't always play by the Unicode
Standards' rules (most notably the use of NFG).
=head1 C<Buf> methods
=head2 Decoding buffers
Buf.decode($dec = "utf-8");
Transforms the buffer into a C<Str>. Defaults to assuming a "utf-8"
encoding. Encoding-aware buffers have a different default decoding, for
instance:
utf8.decode($dec = "utf-8");
utf16.decode($dec = "utf-16be");
utf32.decode($dec = "utf-32be");
[It would be best if utf16 and utf32 changed its default between BE and LE at
creation, either because of what Str.encode said or, if the utf16/32 was
manually created, analyzing the BOM, if any. Just know that Unicode itself
defaults to BE if nothing else.]
=head1 String Type Conversions
If you desire to have a string in one of the other Normalization Forms, there
are various conversion methods to do this.
Cool.Str
Cool.NFG [ XXX this is purely a synonym to .Str. Necessary? ]
Cool.NFC
Cool.NFD
Cool.NFKC
Cool.NFKD
Cool.Uni
Notably, conversion to the C<Uni> type will assume NFC for either NFG strings or
non-strings being converted to this string-like type. Otherwise it's a
transposition of the string without changes in normalization.
=head1 Unicode Information
There's plenty of information each Unicode codepoint possesses, and Perl 6
provides various ways of accessing that information.
Unless plural forms of these functions are provided, each function operates only
on the first codepoint of the string. Various array-based operations would be
needed to gain information on every character in the string.
[Note: The properties of graphemes are not defined by Unicode, but to inherit
the properties from the first NFC codepoint in the grapheme should make sense
in most cases, but not for e.g. charnames. Or are they not supported at NFG?]
[Note: If adding additional methods to access Unicode information, priority
should be placed on info that can't be accessed as a Unicode property.]
=head2 Property Lookup
uniprop(Int $codepoint, Stringy $property)
Int.uniprop(Str $property)
uniprop(Unicodey $char, Stringy $property)
Unicodey.uniprop(Stringy $property)
uniprops(Unicodey $str, Stringy $property)
Unicodey.uniprops(String $property)
This function returns the value of C<$property> for the given C<$codepoint> or
C<$char>, or an array of values of the property of each character in C<$str> .
All official spellings of a property name are supported.
uniprops("a", "ASCII_Hex_Digit") # is this character an ASCII hex digit?
uniprops("a", "AHex") # ditto
Values returned for properties may be the narrowest possible type for numeric
(widest C<Rat>), and C<Str> objects.
Boolean properties are returned True or False as a Bool.
Note there is no version of C<uniprops> for integers, while there is one for
strings. To achieve the same thing, use normal array operations:
my @isws = (32,42,43)».uniprop("White_Space");
Note that the integer-based lookup is the fundamental version; the string-based
versions are convenience functions. These two are nearly equivalent:
uniprop("0".ord, "Numeric_Value"); # integer lookup
uniprop("0", "Numeric_Value"); # stringy lookup
However, the string-based version will convert NFG strings to NFC before sending
either the first or all characters through the lookup. This is because Unicode
property lookup is considered an NFG-less environment (see L<NFC and NFD|#NFC
and NFD>).
Integer-based lookup should die on negative integers, or integers greater than
C<0x10_FFFF>.
[Conjecture: would versions of uniprop with a slurpy instead of a single string
property be useful? Or is C<uniprop(0x20, $_) for @props> good enough?]
=head3 Binary Property Lookup
unibool(Int $codepoint, Stringy $property)
Int.unibool(Str $property)
unibool(Unicodey $char, Stringy $property)
Unicodey.unibool(Stringy $property)
unibools(Unicodey $str, Stringy $property)
Unicodey.unibools(String $property)
Looks up a boolean Unicode property (such as C<Case_Ignorable>) and returns a
boolean. Throws an error on non-boolean properties.
unibool(0x41, "Case_Ignorable"); # OK
unibool(0x41, "General_Category"); # dies
As with C<uniprop>, the string version converts NFG strings to NFC, but
otherwise is equivalent to feeding the result of C<.ord> through the base
integer version.
=head3 Binary Category Check
unimatch(Int $codepoint, Stringy $category)
Int.unimatch(Str $category)
unimatch(Unicodey $char, Stringy $category)
Unicodey.unimatch(Stringy $category)
unimatches(Unicodey $str, Stringy $category)
Unicodey.unimatches(String $category)
Checks to see if the character(s) given are in the given C<$category>. The
string-based versions are conveniences that convert any NFG input to NFC, and
then pass it along to the integer version.
unimatch("A", "Lu"); # True
unimatch("A", "L"); # True
unimatch("A", "Sc"); # False
An error may be issued if the given category name is not valid.
=head2 Numeric Codepoint
ord(Stringy $char) --> Int
ords(Stringy $string) --> Array[Int]
Stringy.ord() --> Int
Stringy.ords() --> Array[Int]
The C<&ord> function (and corresponding C<Stringy.ord> method) return the
codepoint number of the base character of the first grapheme of the string.
The C<&ords> function and method returns an C<Array> of codepoint numbers
of the base character for every grapheme in the string.
This works on any type that does the C<Stringy> role.
=head2 Character Representation
chr(Int $codepoint) --> Uni
chrs(Array[Int] @codepoints) --> Uni
Cool.chr() --> Uni
Cool.chrs() --> Uni
Converts one or more numbers into a series of characters, treating those numbers
as Unicode codepoints. The C<chrs> version generates a multi-character string
from the given array.
Note that this operates on encoding-independent codepoints (use C<Buf> types for
encoded codepoints).
An error will occur if the C<Uni> generated by these functions contains an
invalid character or sequence of characters. This includes, but is not limited
to, codepoint values greater than C<0x10FFFF> and parts of surrogate code pairs.
To obtain a more definitive string type, the normal ways of type conversion may
be used.
=head2 Character Name
uniname(Str $char, :$one = False, :$either = False) --> Str
uninames(Str $char, :$one = False, :$either = False) --> Array[Str]
Str.uniname(:$one = False, :$either = False) --> Str
Str.uninames(:$one = False, :$either = False) --> Array[Str]
The C<&uniname> function returns the Unicode name associated with the first
codepoint of the string. C<&uninames> returns an array of names, one per
codepoint.
By default, C<uniname> tries to find the Unicode name associated with that
character, returning a code point label (see
L<UAX#44|http://www.unicode.org/reports/tr44/tr44-12.html#Code_Point_Labels> and
section 4.8 of the Standard). This is nearly identical to accessing the C<Name>
property from the C<uniprops> list, except that the list holds an empty string
for undefined names.
uninames("A\x[00]¶\x[2028,80]")
# results in:
"LATIN CAPITAL LETTER A",
"<control-0000>",
"PILCROW SIGN",
"LINE SEPARATOR",
"<control-0080>"
The C<:one> adverb instead tries to find the Unicode 1.0 name associated with
the character (this would most often be useful with getting a proper name for
control codes). If there is no Unicode 1.0 name associated with the character, a
code point label is returned. This is similar to the C<Unicode_1_Name> property
of the C<uniprops> list, except that the list holds an empty string for
undefined Unicode 1.0 names.
uninames("A\x[00]¶\x[2028,80]", :one)
# results in:
"<graphic-0041>",
"NULL",
"PARAGRAPH SIGN",
"<format-2028>",
"<control-0080>"
The C<:either> adverb will try to first obtain a Unicode name for the
character. Failing that, it will try to instead obtain the Unicode 1.0 name. If
the character has neither name property defined, a code point label is returned.
uninames("A\x[00]¶\x[2028,80]", :either)
# results in:
"LATIN CAPITAL LETTER A",
"NULL",
"PILCROW SIGN",
"LINE SEPARATOR",
"<control-0080>"
The use of C<:either> and C<:one> together will prefer Unicode 1.0 names over
newer Unicode names, but otherwise function identically to C<:either>.
uninames("A\x[00]¶\x[2028,80]", :either :one)
# results in:
"LATIN CAPITAL LETTER A",
"NULL",
"PARAGRAPH SIGN",
"LINE SEPARATOR",
"<control-0080>"
In the case of graphical or formatting characters without a Unicode 1.0 name,
the use of the C<:one> adverb by itself will return a I<non-standard> codepoint
label of either of the following:
<graphic-XXXX>
<format-XXXX>
Note that the use of C<:either> and C<:one> together will not use these
non-standard labels, as every graphic and format character has a current Unicode
name.
The definition of "graphic" and "format" characters is covered in Section 2.4,
Table 2-3 of the current Unicode Standard.
This command does not deal with name aliases; get the C<Name_Alias> property
from C<uniprop>.
If a strict adherence to the values in those properties is desired (i.e. return
null strings instead of code-point labels), the C<Name> and C<Unicode_1_Name>
properties my be used.
=head2 Numeric Value
unival(Int $codepoint)
Int.unival
unival(Unicodey $char)
Unicodey.unival
univals(Unicodey $str)
Unicodey.univals
Returns a C<Rat> (or C<Int> if the denominator is 1) of the given character's
numeric value. Returns C<NaN> if the character is not a number.
say unival("0"); # output: 0
say unival("½"); # output: .5
say unival("."); # output: NaN
say univals("½¾"); # output: .5 .75 (array of Rats and/or Ints)
Note that this will not convert a multi-digit string into one numeral; use the
normal string-to-numeral coercers for that.
[Conjecture: should C<val()> use C<unival> on one-character strings as part of
its allomorphic type process? E.g. K<./fractionmagic.p6 ¾> takes the one
positional argument as a C<RatStr>.]
=head1 Regexes
By default regexes operate on the grapheme (NFG) level, regardless of how the
string itself is stored.
The following is a list of adverbs that change how regexes view strings:
:i Ignore case (a ~~ A)
:m Ignore marks (ä ~~ a)
:nfg String matching against as NFG (default)
:nfc String as NFC
:nfd String as NFD
:nfkc String as NFKC
:nfkd String as NFKD
There's of course the syntax for accessing Unicode properties inside a regex:
<:Letter>
<:East_Asian_Width<Narrow>>
(For example, if you needed to collect combining mark usage (e.g. for
language-guessing purposes):
$string ~~ /:nfd [<:Letter> (<:Mark>*)]+/
would get that info for you.)
C</./> always matches one "character" in the current view, in other words one
element of C<"string being matched".ords>.
=head2 Grapheme Explosion
To match to one specific character under different rules, you may use one of the
C«</ />» rules.
<D/ /> Work on next character in NFD mode
<C/ /> NFC mode
<KD/ /> NFKD mode
<KC/ /> NFKC mode
<G/ /> NFG mode
</ /> NFD mode
This construct was primarily invented to allow you to deal with combining
characters (matches C«<:Mark>») on single graphemes. This is why C«</ />» is
used as a synonym to C«<D/ />».
The forms with letters may use any brackets. Similar to how C<m/ /> and C</ />
relate to each other.
</ /> Explodes grapheme
<⦃ ⦄> Doesn't explode grapheme
<D/ /> Explodes grapheme
<D⦃ ⦄> Explodes grapheme
So to collect base characters and combining marks in one section of what you're
parsing, you could define such a regex as:
$string ~~ / </ $<base>=<:Letter> $<marks>=[<:Mark>+] /> /
Note that each of these exploders become useless when their counterpart adverbs
are used beforehand.
And yes, some of these forms do the opposite of exploding; the imagery of
radically changing things in a localized area still applies C<:)>.
=head1 Quoting Constructs
By default, all quoting forms create C<Str> objects:
"interpolating $string"
'non-interpolating'
Q[Base form]
Various adverbs may be used to generate non-NFG literals:
Q:nfd/NFD literal/
qq:nfc:to/heredocIsNFC/
qx:nfkd/useful for commands on less capable terminals perhaps/
The typical C<:nf> adverbs are in use here.
:nfg Str literal (default)
:nfd NFD literal
:nfc NFC literal
:nfkd NFKD literal
:nfkc NFKC literal
=head1 Unicode Literals
=head2 Identifiers
Identifiers in Perl 6 can start with any alphabetic character (those characters
in the C<L> category, as well as underscore), followed by any number of
alphanumeric characters (those characters in the C<L> or C<N> category, as well
as underscore). Dashes (C<->) and apostrophes (C<'>) may also appear, provided
that they are followed by an alphabetic character.
my $foo; # OK
my $foo; # also OK
my $১০kinds; # not ok (১০ are digits)
Combining marks (characters in the C<M> category) may not be the first character
in an identifier, but they may appear at any time afterwards.
Perl 6 internally stores all identifiers in NFG form, so these two lines create
the same variable (and throw a redeclaration warning if used like this in code):
my $ä; # precomposed
my $ä; # not precomposed
=head2 Numbers
Similar to its support for any kind of Unicode in identifiers, Perl 6 allows any
kind of character within the category C<Nd> (Decimal Number) for decimal
numbers:
say 42 + ٤٢; # ascii digits + arabic indic
say ᱄2; # lepcha & ascii
say ⑨; # not ok (category No, not Nd)
For hexadecimal digits, any character with C<Hex_Digit = yes> is allowed:
say 0xCAFE; # OK
say 0xcafe; # OK too
The C<:radix[]> form of specifying numbers can accept strings following this
same rule, with the following sets of characters specifying digits C<10..35>, as
they have characters with true C<Hex_Digit> properties:
a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
a b c d e f g h i j k l m n o p q r s t u v w x y z
A B C D E F G H I J K L M N O P Q R S T U V W X Y Z
For radices greater than 36, you must use literal numbers (see L<doc:S02/General
radices> for details).
=head1 Pragmas
[ the NF* pragmas have been removed, as they no longer are attributes of a Str
object, and there's no sane way to set a default string-like type in a clean
fashion. ]
use encoding :utf8;
use encoding :utf16<be> or :utf16<le>;
use encoding :utf32<be> or :utf32<le>;
The C<encoding> pragma changes the default encoding for situations where that's
necessary, e.g. for the default C<Str.encode> or C<Buf.decode>. C<Str>s
themselves don't store encoding information.
=for code :allow<R>
use unicode :v(R<version>);
Specifies what version of Unicode you want to use. The C<$*UNICODE> variable
will tell you what version of Unicode is currently in use. This is useful if you
need to work on data created for much older Unicode versions, or if you're doing
work with properties known to be highly volatile between versions.
Pragmas are of course localize-able:
my $first = "hello"; # NFG string
{
use unicode :v(3.2);
⋮
}
my $buffer = "foobar".encode; # object of type utf8 in $buffer
{
use utf32<le>;
$buffer = "foobar".encode; # object of type utf32 in $buffer
}
say $buffer.WHAT # output: (utf32)
=head1 Final Considerations
The C<Stringy> and C<Unicodey> roles need some expansion, definitely. Keep in
mind that the C<Uni> type is supposed to accept any of the C<NFC>, C<NFD>,
C<NFKC>, and C<NFKD> contents without normalizing.
The inclusion of ropey types will most directly impact C<Uni>.
Operators between various string types need defining. The general rule should be
"most specialized type wins" for the return value.
NFD ~ NFD --> NFD
NFC ~ NFKD --> Uni
(UAX#15 says concat of mismatched NFs results in a non-NF string, which
is our Uni type.)
[Note: concatenation and similar operations forming one string from parts can lead to non-NF,
even between NFC ~ NFC or NFG ~ NFG, if the second string begins with one (or more) "orphan"
combining characters.]
Regexes likely need more work, though I don't see anything immediate.
Some easy way to change how Perl 6 handles language specific weirdness, possibly
through another type (C<Rope>? C<Twine>? C<Yarn>?). A very small selection of
those weirdnesses:
=item Turkish dotted and dotless eyes (I ı and İ i), which follow non-standard
casing.
=item Those who realize the superiority of a capital ẞ and would rather ß not be
capitalized to SS C<:D>.
How would a hypothetical C<EBCDIC> string type be implemented by some module
writer?
Other areas to consider, surely.
(This spec should not be moved to a status more official than DRAFT status until
this Final Considerations section disappears.)
=AUTHOR
=for table
Matthew N. "lue" <L<[email protected]|mailto:[email protected]>>
Helmut Wollmersdorfer <L<[email protected]|mailto:[email protected]>>
=ACKNOWLEDGEMENT
Thanks to TimToady and the rest here:
L<http://irclog.perlgeek.de/perl6/2013-12-02#i_7942599> for answering my
questions and inadvertently steering this document in a far different direction
than I would've taken it otherwise.
=comment vim: expandtab sw=4
=end pod