1236 lines
54 KiB
Plaintext
1236 lines
54 KiB
Plaintext
|
=head1 NAME
|
|||
|
|
|||
|
perlebcdic - Considerations for running Perl on EBCDIC platforms
|
|||
|
|
|||
|
=head1 DESCRIPTION
|
|||
|
|
|||
|
An exploration of some of the issues facing Perl programmers
|
|||
|
on EBCDIC based computers. We do not cover localization,
|
|||
|
internationalization, or multi byte character set issues (yet).
|
|||
|
|
|||
|
Portions that are still incomplete are marked with XXX.
|
|||
|
|
|||
|
=head1 COMMON CHARACTER CODE SETS
|
|||
|
|
|||
|
=head2 ASCII
|
|||
|
|
|||
|
The American Standard Code for Information Interchange is a set of
|
|||
|
integers running from 0 to 127 (decimal) that imply character
|
|||
|
interpretation by the display and other system(s) of computers.
|
|||
|
The range 0..127 can be covered by setting the bits in a 7-bit binary
|
|||
|
digit, hence the set is sometimes referred to as a "7-bit ASCII".
|
|||
|
ASCII was described by the American National Standards Institute
|
|||
|
document ANSI X3.4-1986. It was also described by ISO 646:1991
|
|||
|
(with localization for currency symbols). The full ASCII set is
|
|||
|
given in the table below as the first 128 elements. Languages that
|
|||
|
can be written adequately with the characters in ASCII include
|
|||
|
English, Hawaiian, Indonesian, Swahili and some Native American
|
|||
|
languages.
|
|||
|
|
|||
|
There are many character sets that extend the range of integers
|
|||
|
from 0..2**7-1 up to 2**8-1, or 8 bit bytes (octets if you prefer).
|
|||
|
One common one is the ISO 8859-1 character set.
|
|||
|
|
|||
|
=head2 ISO 8859
|
|||
|
|
|||
|
The ISO 8859-$n are a collection of character code sets from the
|
|||
|
International Organization for Standardization (ISO) each of which
|
|||
|
adds characters to the ASCII set that are typically found in European
|
|||
|
languages many of which are based on the Roman, or Latin, alphabet.
|
|||
|
|
|||
|
=head2 Latin 1 (ISO 8859-1)
|
|||
|
|
|||
|
A particular 8-bit extension to ASCII that includes grave and acute
|
|||
|
accented Latin characters. Languages that can employ ISO 8859-1
|
|||
|
include all the languages covered by ASCII as well as Afrikaans,
|
|||
|
Albanian, Basque, Catalan, Danish, Faroese, Finnish, Norwegian,
|
|||
|
Portugese, Spanish, and Swedish. Dutch is covered albeit without
|
|||
|
the ij ligature. French is covered too but without the oe ligature.
|
|||
|
German can use ISO 8859-1 but must do so without German-style
|
|||
|
quotation marks. This set is based on Western European extensions
|
|||
|
to ASCII and is commonly encountered in world wide web work.
|
|||
|
In IBM character code set identification terminology ISO 8859-1 is
|
|||
|
also known as CCSID 819 (or sometimes 0819 or even 00819).
|
|||
|
|
|||
|
=head2 EBCDIC
|
|||
|
|
|||
|
The Extended Binary Coded Decimal Interchange Code refers to a
|
|||
|
large collection of slightly different single and multi byte
|
|||
|
coded character sets that are different from ASCII or ISO 8859-1
|
|||
|
and typically run on host computers. The EBCDIC encodings derive
|
|||
|
from 8 bit byte extensions of Hollerith punched card encodings.
|
|||
|
The layout on the cards was such that high bits were set for the
|
|||
|
upper and lower case alphabet characters [a-z] and [A-Z], but there
|
|||
|
were gaps within each latin alphabet range.
|
|||
|
|
|||
|
Some IBM EBCDIC character sets may be known by character code set
|
|||
|
identification numbers (CCSID numbers) or code page numbers. Leading
|
|||
|
zero digits in CCSID numbers within this document are insignificant.
|
|||
|
E.g. CCSID 0037 may be referred to as 37 in places.
|
|||
|
|
|||
|
=head2 13 variant characters
|
|||
|
|
|||
|
Among IBM EBCDIC character code sets there are 13 characters that
|
|||
|
are often mapped to different integer values. Those characters
|
|||
|
are known as the 13 "variant" characters and are:
|
|||
|
|
|||
|
\ [ ] { } ^ ~ ! # | $ @ `
|
|||
|
|
|||
|
=head2 0037
|
|||
|
|
|||
|
Character code set ID 0037 is a mapping of the ASCII plus Latin-1
|
|||
|
characters (i.e. ISO 8859-1) to an EBCDIC set. 0037 is used
|
|||
|
in North American English locales on the OS/400 operating system
|
|||
|
that runs on AS/400 computers. CCSID 37 differs from ISO 8859-1
|
|||
|
in 237 places, in other words they agree on only 19 code point values.
|
|||
|
|
|||
|
=head2 1047
|
|||
|
|
|||
|
Character code set ID 1047 is also a mapping of the ASCII plus
|
|||
|
Latin-1 characters (i.e. ISO 8859-1) to an EBCDIC set. 1047 is
|
|||
|
used under Unix System Services for OS/390, and OpenEdition for VM/ESA.
|
|||
|
CCSID 1047 differs from CCSID 0037 in eight places.
|
|||
|
|
|||
|
=head2 POSIX-BC
|
|||
|
|
|||
|
The EBCDIC code page in use on Siemens' BS2000 system is distinct from
|
|||
|
1047 and 0037. It is identified below as the POSIX-BC set.
|
|||
|
|
|||
|
=head1 SINGLE OCTET TABLES
|
|||
|
|
|||
|
The following tables list the ASCII and Latin 1 ordered sets including
|
|||
|
the subsets: C0 controls (0..31), ASCII graphics (32..7e), delete (7f),
|
|||
|
C1 controls (80..9f), and Latin-1 (a.k.a. ISO 8859-1) (a0..ff). In the
|
|||
|
table non-printing control character names as well as the Latin 1
|
|||
|
extensions to ASCII have been labelled with character names roughly
|
|||
|
corresponding to I<The Unicode Standard, Version 2.0> albeit with
|
|||
|
substitutions such as s/LATIN// and s/VULGAR// in all cases,
|
|||
|
s/CAPITAL LETTER// in some cases, and s/SMALL LETTER ([A-Z])/\l$1/
|
|||
|
in some other cases (the C<charnames> pragma names unfortunately do
|
|||
|
not list explicit names for the C0 or C1 control characters). The
|
|||
|
"names" of the C1 control set (128..159 in ISO 8859-1) listed here are
|
|||
|
somewhat arbitrary. The differences between the 0037 and 1047 sets are
|
|||
|
flagged with ***. The differences between the 1047 and POSIX-BC sets
|
|||
|
are flagged with ###. All ord() numbers listed are decimal. If you
|
|||
|
would rather see this table listing octal values then run the table
|
|||
|
(that is, the pod version of this document since this recipe may not
|
|||
|
work with a pod2_other_format translation) through:
|
|||
|
|
|||
|
=over 4
|
|||
|
|
|||
|
=item recipe 0
|
|||
|
|
|||
|
=back
|
|||
|
|
|||
|
perl -ne 'if(/(.{33})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/)' \
|
|||
|
-e '{printf("%s%-9o%-9o%-9o%-9o\n",$1,$2,$3,$4,$5)}' perlebcdic.pod
|
|||
|
|
|||
|
If you would rather see this table listing hexadecimal values then
|
|||
|
run the table through:
|
|||
|
|
|||
|
=over 4
|
|||
|
|
|||
|
=item recipe 1
|
|||
|
|
|||
|
=back
|
|||
|
|
|||
|
perl -ne 'if(/(.{33})(\d+)\s+(\d+)\s+(\d+)\s+(\d+)/)' \
|
|||
|
-e '{printf("%s%-9X%-9X%-9X%-9X\n",$1,$2,$3,$4,$5)}' perlebcdic.pod
|
|||
|
|
|||
|
|
|||
|
8859-1
|
|||
|
chr 0819 0037 1047 POSIX-BC
|
|||
|
----------------------------------------------------------------
|
|||
|
<NULL> 0 0 0 0
|
|||
|
<START OF HEADING> 1 1 1 1
|
|||
|
<START OF TEXT> 2 2 2 2
|
|||
|
<END OF TEXT> 3 3 3 3
|
|||
|
<END OF TRANSMISSION> 4 55 55 55
|
|||
|
<ENQUIRY> 5 45 45 45
|
|||
|
<ACKNOWLEDGE> 6 46 46 46
|
|||
|
<BELL> 7 47 47 47
|
|||
|
<BACKSPACE> 8 22 22 22
|
|||
|
<HORIZONTAL TABULATION> 9 5 5 5
|
|||
|
<LINE FEED> 10 37 21 21 ***
|
|||
|
<VERTICAL TABULATION> 11 11 11 11
|
|||
|
<FORM FEED> 12 12 12 12
|
|||
|
<CARRIAGE RETURN> 13 13 13 13
|
|||
|
<SHIFT OUT> 14 14 14 14
|
|||
|
<SHIFT IN> 15 15 15 15
|
|||
|
<DATA LINK ESCAPE> 16 16 16 16
|
|||
|
<DEVICE CONTROL ONE> 17 17 17 17
|
|||
|
<DEVICE CONTROL TWO> 18 18 18 18
|
|||
|
<DEVICE CONTROL THREE> 19 19 19 19
|
|||
|
<DEVICE CONTROL FOUR> 20 60 60 60
|
|||
|
<NEGATIVE ACKNOWLEDGE> 21 61 61 61
|
|||
|
<SYNCHRONOUS IDLE> 22 50 50 50
|
|||
|
<END OF TRANSMISSION BLOCK> 23 38 38 38
|
|||
|
<CANCEL> 24 24 24 24
|
|||
|
<END OF MEDIUM> 25 25 25 25
|
|||
|
<SUBSTITUTE> 26 63 63 63
|
|||
|
<ESCAPE> 27 39 39 39
|
|||
|
<FILE SEPARATOR> 28 28 28 28
|
|||
|
<GROUP SEPARATOR> 29 29 29 29
|
|||
|
<RECORD SEPARATOR> 30 30 30 30
|
|||
|
<UNIT SEPARATOR> 31 31 31 31
|
|||
|
<SPACE> 32 64 64 64
|
|||
|
! 33 90 90 90
|
|||
|
" 34 127 127 127
|
|||
|
# 35 123 123 123
|
|||
|
$ 36 91 91 91
|
|||
|
% 37 108 108 108
|
|||
|
& 38 80 80 80
|
|||
|
' 39 125 125 125
|
|||
|
( 40 77 77 77
|
|||
|
) 41 93 93 93
|
|||
|
* 42 92 92 92
|
|||
|
+ 43 78 78 78
|
|||
|
, 44 107 107 107
|
|||
|
- 45 96 96 96
|
|||
|
. 46 75 75 75
|
|||
|
/ 47 97 97 97
|
|||
|
0 48 240 240 240
|
|||
|
1 49 241 241 241
|
|||
|
2 50 242 242 242
|
|||
|
3 51 243 243 243
|
|||
|
4 52 244 244 244
|
|||
|
5 53 245 245 245
|
|||
|
6 54 246 246 246
|
|||
|
7 55 247 247 247
|
|||
|
8 56 248 248 248
|
|||
|
9 57 249 249 249
|
|||
|
: 58 122 122 122
|
|||
|
; 59 94 94 94
|
|||
|
< 60 76 76 76
|
|||
|
= 61 126 126 126
|
|||
|
> 62 110 110 110
|
|||
|
? 63 111 111 111
|
|||
|
@ 64 124 124 124
|
|||
|
A 65 193 193 193
|
|||
|
B 66 194 194 194
|
|||
|
C 67 195 195 195
|
|||
|
D 68 196 196 196
|
|||
|
E 69 197 197 197
|
|||
|
F 70 198 198 198
|
|||
|
G 71 199 199 199
|
|||
|
H 72 200 200 200
|
|||
|
I 73 201 201 201
|
|||
|
J 74 209 209 209
|
|||
|
K 75 210 210 210
|
|||
|
L 76 211 211 211
|
|||
|
M 77 212 212 212
|
|||
|
N 78 213 213 213
|
|||
|
O 79 214 214 214
|
|||
|
P 80 215 215 215
|
|||
|
Q 81 216 216 216
|
|||
|
R 82 217 217 217
|
|||
|
S 83 226 226 226
|
|||
|
T 84 227 227 227
|
|||
|
U 85 228 228 228
|
|||
|
V 86 229 229 229
|
|||
|
W 87 230 230 230
|
|||
|
X 88 231 231 231
|
|||
|
Y 89 232 232 232
|
|||
|
Z 90 233 233 233
|
|||
|
[ 91 186 173 187 *** ###
|
|||
|
\ 92 224 224 188 ###
|
|||
|
] 93 187 189 189 ***
|
|||
|
^ 94 176 95 106 *** ###
|
|||
|
_ 95 109 109 109
|
|||
|
` 96 121 121 74 ###
|
|||
|
a 97 129 129 129
|
|||
|
b 98 130 130 130
|
|||
|
c 99 131 131 131
|
|||
|
d 100 132 132 132
|
|||
|
e 101 133 133 133
|
|||
|
f 102 134 134 134
|
|||
|
g 103 135 135 135
|
|||
|
h 104 136 136 136
|
|||
|
i 105 137 137 137
|
|||
|
j 106 145 145 145
|
|||
|
k 107 146 146 146
|
|||
|
l 108 147 147 147
|
|||
|
m 109 148 148 148
|
|||
|
n 110 149 149 149
|
|||
|
o 111 150 150 150
|
|||
|
p 112 151 151 151
|
|||
|
q 113 152 152 152
|
|||
|
r 114 153 153 153
|
|||
|
s 115 162 162 162
|
|||
|
t 116 163 163 163
|
|||
|
u 117 164 164 164
|
|||
|
v 118 165 165 165
|
|||
|
w 119 166 166 166
|
|||
|
x 120 167 167 167
|
|||
|
y 121 168 168 168
|
|||
|
z 122 169 169 169
|
|||
|
{ 123 192 192 251 ###
|
|||
|
| 124 79 79 79
|
|||
|
} 125 208 208 253 ###
|
|||
|
~ 126 161 161 255 ###
|
|||
|
<DELETE> 127 7 7 7
|
|||
|
<C1 0> 128 32 32 32
|
|||
|
<C1 1> 129 33 33 33
|
|||
|
<C1 2> 130 34 34 34
|
|||
|
<C1 3> 131 35 35 35
|
|||
|
<C1 4> 132 36 36 36
|
|||
|
<C1 5> 133 21 37 37 ***
|
|||
|
<C1 6> 134 6 6 6
|
|||
|
<C1 7> 135 23 23 23
|
|||
|
<C1 8> 136 40 40 40
|
|||
|
<C1 9> 137 41 41 41
|
|||
|
<C1 10> 138 42 42 42
|
|||
|
<C1 11> 139 43 43 43
|
|||
|
<C1 12> 140 44 44 44
|
|||
|
<C1 13> 141 9 9 9
|
|||
|
<C1 14> 142 10 10 10
|
|||
|
<C1 15> 143 27 27 27
|
|||
|
<C1 16> 144 48 48 48
|
|||
|
<C1 17> 145 49 49 49
|
|||
|
<C1 18> 146 26 26 26
|
|||
|
<C1 19> 147 51 51 51
|
|||
|
<C1 20> 148 52 52 52
|
|||
|
<C1 21> 149 53 53 53
|
|||
|
<C1 22> 150 54 54 54
|
|||
|
<C1 23> 151 8 8 8
|
|||
|
<C1 24> 152 56 56 56
|
|||
|
<C1 25> 153 57 57 57
|
|||
|
<C1 26> 154 58 58 58
|
|||
|
<C1 27> 155 59 59 59
|
|||
|
<C1 28> 156 4 4 4
|
|||
|
<C1 29> 157 20 20 20
|
|||
|
<C1 30> 158 62 62 62
|
|||
|
<C1 31> 159 255 255 95 ###
|
|||
|
<NON-BREAKING SPACE> 160 65 65 65
|
|||
|
<INVERTED EXCLAMATION MARK> 161 170 170 170
|
|||
|
<CENT SIGN> 162 74 74 176 ###
|
|||
|
<POUND SIGN> 163 177 177 177
|
|||
|
<CURRENCY SIGN> 164 159 159 159
|
|||
|
<YEN SIGN> 165 178 178 178
|
|||
|
<BROKEN BAR> 166 106 106 208 ###
|
|||
|
<SECTION SIGN> 167 181 181 181
|
|||
|
<DIAERESIS> 168 189 187 121 *** ###
|
|||
|
<COPYRIGHT SIGN> 169 180 180 180
|
|||
|
<FEMININE ORDINAL INDICATOR> 170 154 154 154
|
|||
|
<LEFT POINTING GUILLEMET> 171 138 138 138
|
|||
|
<NOT SIGN> 172 95 176 186 *** ###
|
|||
|
<SOFT HYPHEN> 173 202 202 202
|
|||
|
<REGISTERED TRADE MARK SIGN> 174 175 175 175
|
|||
|
<MACRON> 175 188 188 161 ###
|
|||
|
<DEGREE SIGN> 176 144 144 144
|
|||
|
<PLUS-OR-MINUS SIGN> 177 143 143 143
|
|||
|
<SUPERSCRIPT TWO> 178 234 234 234
|
|||
|
<SUPERSCRIPT THREE> 179 250 250 250
|
|||
|
<ACUTE ACCENT> 180 190 190 190
|
|||
|
<MICRO SIGN> 181 160 160 160
|
|||
|
<PARAGRAPH SIGN> 182 182 182 182
|
|||
|
<MIDDLE DOT> 183 179 179 179
|
|||
|
<CEDILLA> 184 157 157 157
|
|||
|
<SUPERSCRIPT ONE> 185 218 218 218
|
|||
|
<MASC. ORDINAL INDICATOR> 186 155 155 155
|
|||
|
<RIGHT POINTING GUILLEMET> 187 139 139 139
|
|||
|
<FRACTION ONE QUARTER> 188 183 183 183
|
|||
|
<FRACTION ONE HALF> 189 184 184 184
|
|||
|
<FRACTION THREE QUARTERS> 190 185 185 185
|
|||
|
<INVERTED QUESTION MARK> 191 171 171 171
|
|||
|
<A WITH GRAVE> 192 100 100 100
|
|||
|
<A WITH ACUTE> 193 101 101 101
|
|||
|
<A WITH CIRCUMFLEX> 194 98 98 98
|
|||
|
<A WITH TILDE> 195 102 102 102
|
|||
|
<A WITH DIAERESIS> 196 99 99 99
|
|||
|
<A WITH RING ABOVE> 197 103 103 103
|
|||
|
<CAPITAL LIGATURE AE> 198 158 158 158
|
|||
|
<C WITH CEDILLA> 199 104 104 104
|
|||
|
<E WITH GRAVE> 200 116 116 116
|
|||
|
<E WITH ACUTE> 201 113 113 113
|
|||
|
<E WITH CIRCUMFLEX> 202 114 114 114
|
|||
|
<E WITH DIAERESIS> 203 115 115 115
|
|||
|
<I WITH GRAVE> 204 120 120 120
|
|||
|
<I WITH ACUTE> 205 117 117 117
|
|||
|
<I WITH CIRCUMFLEX> 206 118 118 118
|
|||
|
<I WITH DIAERESIS> 207 119 119 119
|
|||
|
<CAPITAL LETTER ETH> 208 172 172 172
|
|||
|
<N WITH TILDE> 209 105 105 105
|
|||
|
<O WITH GRAVE> 210 237 237 237
|
|||
|
<O WITH ACUTE> 211 238 238 238
|
|||
|
<O WITH CIRCUMFLEX> 212 235 235 235
|
|||
|
<O WITH TILDE> 213 239 239 239
|
|||
|
<O WITH DIAERESIS> 214 236 236 236
|
|||
|
<MULTIPLICATION SIGN> 215 191 191 191
|
|||
|
<O WITH STROKE> 216 128 128 128
|
|||
|
<U WITH GRAVE> 217 253 253 224 ###
|
|||
|
<U WITH ACUTE> 218 254 254 254
|
|||
|
<U WITH CIRCUMFLEX> 219 251 251 221 ###
|
|||
|
<U WITH DIAERESIS> 220 252 252 252
|
|||
|
<Y WITH ACUTE> 221 173 186 173 *** ###
|
|||
|
<CAPITAL LETTER THORN> 222 174 174 174
|
|||
|
<SMALL LETTER SHARP S> 223 89 89 89
|
|||
|
<a WITH GRAVE> 224 68 68 68
|
|||
|
<a WITH ACUTE> 225 69 69 69
|
|||
|
<a WITH CIRCUMFLEX> 226 66 66 66
|
|||
|
<a WITH TILDE> 227 70 70 70
|
|||
|
<a WITH DIAERESIS> 228 67 67 67
|
|||
|
<a WITH RING ABOVE> 229 71 71 71
|
|||
|
<SMALL LIGATURE ae> 230 156 156 156
|
|||
|
<c WITH CEDILLA> 231 72 72 72
|
|||
|
<e WITH GRAVE> 232 84 84 84
|
|||
|
<e WITH ACUTE> 233 81 81 81
|
|||
|
<e WITH CIRCUMFLEX> 234 82 82 82
|
|||
|
<e WITH DIAERESIS> 235 83 83 83
|
|||
|
<i WITH GRAVE> 236 88 88 88
|
|||
|
<i WITH ACUTE> 237 85 85 85
|
|||
|
<i WITH CIRCUMFLEX> 238 86 86 86
|
|||
|
<i WITH DIAERESIS> 239 87 87 87
|
|||
|
<SMALL LETTER eth> 240 140 140 140
|
|||
|
<n WITH TILDE> 241 73 73 73
|
|||
|
<o WITH GRAVE> 242 205 205 205
|
|||
|
<o WITH ACUTE> 243 206 206 206
|
|||
|
<o WITH CIRCUMFLEX> 244 203 203 203
|
|||
|
<o WITH TILDE> 245 207 207 207
|
|||
|
<o WITH DIAERESIS> 246 204 204 204
|
|||
|
<DIVISION SIGN> 247 225 225 225
|
|||
|
<o WITH STROKE> 248 112 112 112
|
|||
|
<u WITH GRAVE> 249 221 221 192 ###
|
|||
|
<u WITH ACUTE> 250 222 222 222
|
|||
|
<u WITH CIRCUMFLEX> 251 219 219 219
|
|||
|
<u WITH DIAERESIS> 252 220 220 220
|
|||
|
<y WITH ACUTE> 253 141 141 141
|
|||
|
<SMALL LETTER thorn> 254 142 142 142
|
|||
|
<y WITH DIAERESIS> 255 223 223 223
|
|||
|
|
|||
|
If you would rather see the above table in CCSID 0037 order rather than
|
|||
|
ASCII + Latin-1 order then run the table through:
|
|||
|
|
|||
|
=over 4
|
|||
|
|
|||
|
=item recipe 2
|
|||
|
|
|||
|
=back
|
|||
|
|
|||
|
perl -ne 'if(/.{33}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}/)'\
|
|||
|
-e '{push(@l,$_)}' \
|
|||
|
-e 'END{print map{$_->[0]}' \
|
|||
|
-e ' sort{$a->[1] <=> $b->[1]}' \
|
|||
|
-e ' map{[$_,substr($_,42,3)]}@l;}' perlebcdic.pod
|
|||
|
|
|||
|
If you would rather see it in CCSID 1047 order then change the digit
|
|||
|
42 in the last line to 51, like this:
|
|||
|
|
|||
|
=over 4
|
|||
|
|
|||
|
=item recipe 3
|
|||
|
|
|||
|
=back
|
|||
|
|
|||
|
perl -ne 'if(/.{33}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}/)'\
|
|||
|
-e '{push(@l,$_)}' \
|
|||
|
-e 'END{print map{$_->[0]}' \
|
|||
|
-e ' sort{$a->[1] <=> $b->[1]}' \
|
|||
|
-e ' map{[$_,substr($_,51,3)]}@l;}' perlebcdic.pod
|
|||
|
|
|||
|
If you would rather see it in POSIX-BC order then change the digit
|
|||
|
51 in the last line to 60, like this:
|
|||
|
|
|||
|
=over 4
|
|||
|
|
|||
|
=item recipe 4
|
|||
|
|
|||
|
=back
|
|||
|
|
|||
|
perl -ne 'if(/.{33}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}\s{6,8}\d{1,3}/)'\
|
|||
|
-e '{push(@l,$_)}' \
|
|||
|
-e 'END{print map{$_->[0]}' \
|
|||
|
-e ' sort{$a->[1] <=> $b->[1]}' \
|
|||
|
-e ' map{[$_,substr($_,60,3)]}@l;}' perlebcdic.pod
|
|||
|
|
|||
|
|
|||
|
=head1 IDENTIFYING CHARACTER CODE SETS
|
|||
|
|
|||
|
To determine the character set you are running under from perl one
|
|||
|
could use the return value of ord() or chr() to test one or more
|
|||
|
character values. For example:
|
|||
|
|
|||
|
$is_ascii = "A" eq chr(65);
|
|||
|
$is_ebcdic = "A" eq chr(193);
|
|||
|
|
|||
|
Also, "\t" is a C<HORIZONTAL TABULATION> character so that:
|
|||
|
|
|||
|
$is_ascii = ord("\t") == 9;
|
|||
|
$is_ebcdic = ord("\t") == 5;
|
|||
|
|
|||
|
To distinguish EBCDIC code pages try looking at one or more of
|
|||
|
the characters that differ between them. For example:
|
|||
|
|
|||
|
$is_ebcdic_37 = "\n" eq chr(37);
|
|||
|
$is_ebcdic_1047 = "\n" eq chr(21);
|
|||
|
|
|||
|
Or better still choose a character that is uniquely encoded in any
|
|||
|
of the code sets, e.g.:
|
|||
|
|
|||
|
$is_ascii = ord('[') == 91;
|
|||
|
$is_ebcdic_37 = ord('[') == 186;
|
|||
|
$is_ebcdic_1047 = ord('[') == 173;
|
|||
|
$is_ebcdic_POSIX_BC = ord('[') == 187;
|
|||
|
|
|||
|
However, it would be unwise to write tests such as:
|
|||
|
|
|||
|
$is_ascii = "\r" ne chr(13); # WRONG
|
|||
|
$is_ascii = "\n" ne chr(10); # ILL ADVISED
|
|||
|
|
|||
|
Obviously the first of these will fail to distinguish most ASCII machines
|
|||
|
from either a CCSID 0037, a 1047, or a POSIX-BC EBCDIC machine since "\r" eq
|
|||
|
chr(13) under all of those coded character sets. But note too that
|
|||
|
because "\n" is chr(13) and "\r" is chr(10) on the MacIntosh (which is an
|
|||
|
ASCII machine) the second C<$is_ascii> test will lead to trouble there.
|
|||
|
|
|||
|
To determine whether or not perl was built under an EBCDIC
|
|||
|
code page you can use the Config module like so:
|
|||
|
|
|||
|
use Config;
|
|||
|
$is_ebcdic = $Config{'ebcdic'} eq 'define';
|
|||
|
|
|||
|
=head1 CONVERSIONS
|
|||
|
|
|||
|
=head2 tr///
|
|||
|
|
|||
|
In order to convert a string of characters from one character set to
|
|||
|
another a simple list of numbers, such as in the right columns in the
|
|||
|
above table, along with perl's tr/// operator is all that is needed.
|
|||
|
The data in the table are in ASCII order hence the EBCDIC columns
|
|||
|
provide easy to use ASCII to EBCDIC operations that are also easily
|
|||
|
reversed.
|
|||
|
|
|||
|
For example, to convert ASCII to code page 037 take the output of the second
|
|||
|
column from the output of recipe 0 (modified to add \\ characters) and use
|
|||
|
it in tr/// like so:
|
|||
|
|
|||
|
$cp_037 =
|
|||
|
'\000\001\002\003\234\011\206\177\227\215\216\013\014\015\016\017' .
|
|||
|
'\020\021\022\023\235\205\010\207\030\031\222\217\034\035\036\037' .
|
|||
|
'\200\201\202\203\204\012\027\033\210\211\212\213\214\005\006\007' .
|
|||
|
'\220\221\026\223\224\225\226\004\230\231\232\233\024\025\236\032' .
|
|||
|
'\040\240\342\344\340\341\343\345\347\361\242\056\074\050\053\174' .
|
|||
|
'\046\351\352\353\350\355\356\357\354\337\041\044\052\051\073\254' .
|
|||
|
'\055\057\302\304\300\301\303\305\307\321\246\054\045\137\076\077' .
|
|||
|
'\370\311\312\313\310\315\316\317\314\140\072\043\100\047\075\042' .
|
|||
|
'\330\141\142\143\144\145\146\147\150\151\253\273\360\375\376\261' .
|
|||
|
'\260\152\153\154\155\156\157\160\161\162\252\272\346\270\306\244' .
|
|||
|
'\265\176\163\164\165\166\167\170\171\172\241\277\320\335\336\256' .
|
|||
|
'\136\243\245\267\251\247\266\274\275\276\133\135\257\250\264\327' .
|
|||
|
'\173\101\102\103\104\105\106\107\110\111\255\364\366\362\363\365' .
|
|||
|
'\175\112\113\114\115\116\117\120\121\122\271\373\374\371\372\377' .
|
|||
|
'\134\367\123\124\125\126\127\130\131\132\262\324\326\322\323\325' .
|
|||
|
'\060\061\062\063\064\065\066\067\070\071\263\333\334\331\332\237' ;
|
|||
|
|
|||
|
my $ebcdic_string = $ascii_string;
|
|||
|
eval '$ebcdic_string =~ tr/\000-\377/' . $cp_037 . '/';
|
|||
|
|
|||
|
To convert from EBCDIC 037 to ASCII just reverse the order of the tr///
|
|||
|
arguments like so:
|
|||
|
|
|||
|
my $ascii_string = $ebcdic_string;
|
|||
|
eval '$ascii_string = tr/' . $cp_037 . '/\000-\377/';
|
|||
|
|
|||
|
Similarly one could take the output of the third column from recipe 0 to
|
|||
|
obtain a C<$cp_1047> table. The fourth column of the output from recipe
|
|||
|
0 could provide a C<$cp_posix_bc> table suitable for transcoding as well.
|
|||
|
|
|||
|
=head2 iconv
|
|||
|
|
|||
|
XPG operability often implies the presence of an I<iconv> utility
|
|||
|
available from the shell or from the C library. Consult your system's
|
|||
|
documentation for information on iconv.
|
|||
|
|
|||
|
On OS/390 see the iconv(1) man page. One way to invoke the iconv
|
|||
|
shell utility from within perl would be to:
|
|||
|
|
|||
|
# OS/390 example
|
|||
|
$ascii_data = `echo '$ebcdic_data'| iconv -f IBM-1047 -t ISO8859-1`
|
|||
|
|
|||
|
or the inverse map:
|
|||
|
|
|||
|
# OS/390 example
|
|||
|
$ebcdic_data = `echo '$ascii_data'| iconv -f ISO8859-1 -t IBM-1047`
|
|||
|
|
|||
|
For other perl based conversion options see the Convert::* modules on CPAN.
|
|||
|
|
|||
|
=head2 C RTL
|
|||
|
|
|||
|
The OS/390 C run time library provides _atoe() and _etoa() functions.
|
|||
|
|
|||
|
=head1 OPERATOR DIFFERENCES
|
|||
|
|
|||
|
The C<..> range operator treats certain character ranges with
|
|||
|
care on EBCDIC machines. For example the following array
|
|||
|
will have twenty six elements on either an EBCDIC machine
|
|||
|
or an ASCII machine:
|
|||
|
|
|||
|
@alphabet = ('A'..'Z'); # $#alphabet == 25
|
|||
|
|
|||
|
The bitwise operators such as & ^ | may return different results
|
|||
|
when operating on string or character data in a perl program running
|
|||
|
on an EBCDIC machine than when run on an ASCII machine. Here is
|
|||
|
an example adapted from the one in L<perlop>:
|
|||
|
|
|||
|
# EBCDIC-based examples
|
|||
|
print "j p \n" ^ " a h"; # prints "JAPH\n"
|
|||
|
print "JA" | " ph\n"; # prints "japh\n"
|
|||
|
print "JAPH\nJunk" & "\277\277\277\277\277"; # prints "japh\n";
|
|||
|
print 'p N$' ^ " E<H\n"; # prints "Perl\n";
|
|||
|
|
|||
|
An interesting property of the 32 C0 control characters
|
|||
|
in the ASCII table is that they can "literally" be constructed
|
|||
|
as control characters in perl, e.g. C<(chr(0) eq "\c@")>
|
|||
|
C<(chr(1) eq "\cA")>, and so on. Perl on EBCDIC machines has been
|
|||
|
ported to take "\c@" to chr(0) and "\cA" to chr(1) as well, but the
|
|||
|
thirty three characters that result depend on which code page you are
|
|||
|
using. The table below uses the character names from the previous table
|
|||
|
but with substitutions such as s/START OF/S.O./; s/END OF /E.O./;
|
|||
|
s/TRANSMISSION/TRANS./; s/TABULATION/TAB./; s/VERTICAL/VERT./;
|
|||
|
s/HORIZONTAL/HORIZ./; s/DEVICE CONTROL/D.C./; s/SEPARATOR/SEP./;
|
|||
|
s/NEGATIVE ACKNOWLEDGE/NEG. ACK./;. The POSIX-BC and 1047 sets are
|
|||
|
identical throughout this range and differ from the 0037 set at only
|
|||
|
one spot (21 decimal). Note that the C<LINE FEED> character
|
|||
|
may be generated by "\cJ" on ASCII machines but by "\cU" on 1047 or POSIX-BC
|
|||
|
machines and cannot be generated as a C<"\c.letter."> control character on
|
|||
|
0037 machines. Note also that "\c\\" maps to two characters
|
|||
|
not one.
|
|||
|
|
|||
|
chr ord 8859-1 0037 1047 && POSIX-BC
|
|||
|
------------------------------------------------------------------------
|
|||
|
"\c?" 127 <DELETE> " " ***><
|
|||
|
"\c@" 0 <NULL> <NULL> <NULL> ***><
|
|||
|
"\cA" 1 <S.O. HEADING> <S.O. HEADING> <S.O. HEADING>
|
|||
|
"\cB" 2 <S.O. TEXT> <S.O. TEXT> <S.O. TEXT>
|
|||
|
"\cC" 3 <E.O. TEXT> <E.O. TEXT> <E.O. TEXT>
|
|||
|
"\cD" 4 <E.O. TRANS.> <C1 28> <C1 28>
|
|||
|
"\cE" 5 <ENQUIRY> <HORIZ. TAB.> <HORIZ. TAB.>
|
|||
|
"\cF" 6 <ACKNOWLEDGE> <C1 6> <C1 6>
|
|||
|
"\cG" 7 <BELL> <DELETE> <DELETE>
|
|||
|
"\cH" 8 <BACKSPACE> <C1 23> <C1 23>
|
|||
|
"\cI" 9 <HORIZ. TAB.> <C1 13> <C1 13>
|
|||
|
"\cJ" 10 <LINE FEED> <C1 14> <C1 14>
|
|||
|
"\cK" 11 <VERT. TAB.> <VERT. TAB.> <VERT. TAB.>
|
|||
|
"\cL" 12 <FORM FEED> <FORM FEED> <FORM FEED>
|
|||
|
"\cM" 13 <CARRIAGE RETURN> <CARRIAGE RETURN> <CARRIAGE RETURN>
|
|||
|
"\cN" 14 <SHIFT OUT> <SHIFT OUT> <SHIFT OUT>
|
|||
|
"\cO" 15 <SHIFT IN> <SHIFT IN> <SHIFT IN>
|
|||
|
"\cP" 16 <DATA LINK ESCAPE> <DATA LINK ESCAPE> <DATA LINK ESCAPE>
|
|||
|
"\cQ" 17 <D.C. ONE> <D.C. ONE> <D.C. ONE>
|
|||
|
"\cR" 18 <D.C. TWO> <D.C. TWO> <D.C. TWO>
|
|||
|
"\cS" 19 <D.C. THREE> <D.C. THREE> <D.C. THREE>
|
|||
|
"\cT" 20 <D.C. FOUR> <C1 29> <C1 29>
|
|||
|
"\cU" 21 <NEG. ACK.> <C1 5> <LINE FEED> ***
|
|||
|
"\cV" 22 <SYNCHRONOUS IDLE> <BACKSPACE> <BACKSPACE>
|
|||
|
"\cW" 23 <E.O. TRANS. BLOCK> <C1 7> <C1 7>
|
|||
|
"\cX" 24 <CANCEL> <CANCEL> <CANCEL>
|
|||
|
"\cY" 25 <E.O. MEDIUM> <E.O. MEDIUM> <E.O. MEDIUM>
|
|||
|
"\cZ" 26 <SUBSTITUTE> <C1 18> <C1 18>
|
|||
|
"\c[" 27 <ESCAPE> <C1 15> <C1 15>
|
|||
|
"\c\\" 28 <FILE SEP.>\ <FILE SEP.>\ <FILE SEP.>\
|
|||
|
"\c]" 29 <GROUP SEP.> <GROUP SEP.> <GROUP SEP.>
|
|||
|
"\c^" 30 <RECORD SEP.> <RECORD SEP.> <RECORD SEP.> ***><
|
|||
|
"\c_" 31 <UNIT SEP.> <UNIT SEP.> <UNIT SEP.> ***><
|
|||
|
|
|||
|
|
|||
|
=head1 FUNCTION DIFFERENCES
|
|||
|
|
|||
|
=over 8
|
|||
|
|
|||
|
=item chr()
|
|||
|
|
|||
|
chr() must be given an EBCDIC code number argument to yield a desired
|
|||
|
character return value on an EBCDIC machine. For example:
|
|||
|
|
|||
|
$CAPITAL_LETTER_A = chr(193);
|
|||
|
|
|||
|
=item ord()
|
|||
|
|
|||
|
ord() will return EBCDIC code number values on an EBCDIC machine.
|
|||
|
For example:
|
|||
|
|
|||
|
$the_number_193 = ord("A");
|
|||
|
|
|||
|
=item pack()
|
|||
|
|
|||
|
The c and C templates for pack() are dependent upon character set
|
|||
|
encoding. Examples of usage on EBCDIC include:
|
|||
|
|
|||
|
$foo = pack("CCCC",193,194,195,196);
|
|||
|
# $foo eq "ABCD"
|
|||
|
$foo = pack("C4",193,194,195,196);
|
|||
|
# same thing
|
|||
|
|
|||
|
$foo = pack("ccxxcc",193,194,195,196);
|
|||
|
# $foo eq "AB\0\0CD"
|
|||
|
|
|||
|
=item print()
|
|||
|
|
|||
|
One must be careful with scalars and strings that are passed to
|
|||
|
print that contain ASCII encodings. One common place
|
|||
|
for this to occur is in the output of the MIME type header for
|
|||
|
CGI script writing. For example, many perl programming guides
|
|||
|
recommend something similar to:
|
|||
|
|
|||
|
print "Content-type:\ttext/html\015\012\015\012";
|
|||
|
# this may be wrong on EBCDIC
|
|||
|
|
|||
|
Under the IBM OS/390 USS Web Server for example you should instead
|
|||
|
write that as:
|
|||
|
|
|||
|
print "Content-type:\ttext/html\r\n\r\n"; # OK for DGW et alia
|
|||
|
|
|||
|
That is because the translation from EBCDIC to ASCII is done
|
|||
|
by the web server in this case (such code will not be appropriate for
|
|||
|
the Macintosh however). Consult your web server's documentation for
|
|||
|
further details.
|
|||
|
|
|||
|
=item printf()
|
|||
|
|
|||
|
The formats that can convert characters to numbers and vice versa
|
|||
|
will be different from their ASCII counterparts when executed
|
|||
|
on an EBCDIC machine. Examples include:
|
|||
|
|
|||
|
printf("%c%c%c",193,194,195); # prints ABC
|
|||
|
|
|||
|
=item sort()
|
|||
|
|
|||
|
EBCDIC sort results may differ from ASCII sort results especially for
|
|||
|
mixed case strings. This is discussed in more detail below.
|
|||
|
|
|||
|
=item sprintf()
|
|||
|
|
|||
|
See the discussion of printf() above. An example of the use
|
|||
|
of sprintf would be:
|
|||
|
|
|||
|
$CAPITAL_LETTER_A = sprintf("%c",193);
|
|||
|
|
|||
|
=item unpack()
|
|||
|
|
|||
|
See the discussion of pack() above.
|
|||
|
|
|||
|
=back
|
|||
|
|
|||
|
=head1 REGULAR EXPRESSION DIFFERENCES
|
|||
|
|
|||
|
As of perl 5.005_03 the letter range regular expression such as
|
|||
|
[A-Z] and [a-z] have been especially coded to not pick up gap
|
|||
|
characters. For example, characters such as E<ocirc> C<o WITH CIRCUMFLEX>
|
|||
|
that lie between I and J would not be matched by the
|
|||
|
regular expression range C</[H-K]/>.
|
|||
|
|
|||
|
If you do want to match the alphabet gap characters in a single octet
|
|||
|
regular expression try matching the hex or octal code such
|
|||
|
as C</\313/> on EBCDIC or C</\364/> on ASCII machines to
|
|||
|
have your regular expression match C<o WITH CIRCUMFLEX>.
|
|||
|
|
|||
|
Another construct to be wary of is the inappropriate use of hex or
|
|||
|
octal constants in regular expressions. Consider the following
|
|||
|
set of subs:
|
|||
|
|
|||
|
sub is_c0 {
|
|||
|
my $char = substr(shift,0,1);
|
|||
|
$char =~ /[\000-\037]/;
|
|||
|
}
|
|||
|
|
|||
|
sub is_print_ascii {
|
|||
|
my $char = substr(shift,0,1);
|
|||
|
$char =~ /[\040-\176]/;
|
|||
|
}
|
|||
|
|
|||
|
sub is_delete {
|
|||
|
my $char = substr(shift,0,1);
|
|||
|
$char eq "\177";
|
|||
|
}
|
|||
|
|
|||
|
sub is_c1 {
|
|||
|
my $char = substr(shift,0,1);
|
|||
|
$char =~ /[\200-\237]/;
|
|||
|
}
|
|||
|
|
|||
|
sub is_latin_1 {
|
|||
|
my $char = substr(shift,0,1);
|
|||
|
$char =~ /[\240-\377]/;
|
|||
|
}
|
|||
|
|
|||
|
The above would be adequate if the concern was only with numeric code points.
|
|||
|
However, the concern may be with characters rather than code points
|
|||
|
and on an EBCDIC machine it may be desirable for constructs such as
|
|||
|
C<if (is_print_ascii("A")) {print "A is a printable character\n";}> to print
|
|||
|
out the expected message. One way to represent the above collection
|
|||
|
of character classification subs that is capable of working across the
|
|||
|
four coded character sets discussed in this document is as follows:
|
|||
|
|
|||
|
sub Is_c0 {
|
|||
|
my $char = substr(shift,0,1);
|
|||
|
if (ord('^')==94) { # ascii
|
|||
|
return $char =~ /[\000-\037]/;
|
|||
|
}
|
|||
|
if (ord('^')==176) { # 37
|
|||
|
return $char =~ /[\000-\003\067\055-\057\026\005\045\013-\023\074\075\062\046\030\031\077\047\034-\037]/;
|
|||
|
}
|
|||
|
if (ord('^')==95 || ord('^')==106) { # 1047 || posix-bc
|
|||
|
return $char =~ /[\000-\003\067\055-\057\026\005\025\013-\023\074\075\062\046\030\031\077\047\034-\037]/;
|
|||
|
}
|
|||
|
}
|
|||
|
|
|||
|
sub Is_print_ascii {
|
|||
|
my $char = substr(shift,0,1);
|
|||
|
$char =~ /[ !"\#\$%&'()*+,\-.\/0-9:;<=>?\@A-Z[\\\]^_`a-z{|}~]/;
|
|||
|
}
|
|||
|
|
|||
|
sub Is_delete {
|
|||
|
my $char = substr(shift,0,1);
|
|||
|
if (ord('^')==94) { # ascii
|
|||
|
return $char eq "\177";
|
|||
|
}
|
|||
|
else { # ebcdic
|
|||
|
return $char eq "\007";
|
|||
|
}
|
|||
|
}
|
|||
|
|
|||
|
sub Is_c1 {
|
|||
|
my $char = substr(shift,0,1);
|
|||
|
if (ord('^')==94) { # ascii
|
|||
|
return $char =~ /[\200-\237]/;
|
|||
|
}
|
|||
|
if (ord('^')==176) { # 37
|
|||
|
return $char =~ /[\040-\044\025\006\027\050-\054\011\012\033\060\061\032\063-\066\010\070-\073\040\024\076\377]/;
|
|||
|
}
|
|||
|
if (ord('^')==95) { # 1047
|
|||
|
return $char =~ /[\040-\045\006\027\050-\054\011\012\033\060\061\032\063-\066\010\070-\073\040\024\076\377]/;
|
|||
|
}
|
|||
|
if (ord('^')==106) { # posix-bc
|
|||
|
return $char =~
|
|||
|
/[\040-\045\006\027\050-\054\011\012\033\060\061\032\063-\066\010\070-\073\040\024\076\137]/;
|
|||
|
}
|
|||
|
}
|
|||
|
|
|||
|
sub Is_latin_1 {
|
|||
|
my $char = substr(shift,0,1);
|
|||
|
if (ord('^')==94) { # ascii
|
|||
|
return $char =~ /[\240-\377]/;
|
|||
|
}
|
|||
|
if (ord('^')==176) { # 37
|
|||
|
return $char =~
|
|||
|
/[\101\252\112\261\237\262\152\265\275\264\232\212\137\312\257\274\220\217\352\372\276\240\266\263\235\332\233\213\267\270\271\253\144\145\142\146\143\147\236\150\164\161-\163\170\165-\167\254\151\355\356\353\357\354\277\200\375\376\373\374\255\256\131\104\105\102\106\103\107\234\110\124\121-\123\130\125-\127\214\111\315\316\313\317\314\341\160\335\336\333\334\215\216\337]/;
|
|||
|
}
|
|||
|
if (ord('^')==95) { # 1047
|
|||
|
return $char =~
|
|||
|
/[\101\252\112\261\237\262\152\265\273\264\232\212\260\312\257\274\220\217\352\372\276\240\266\263\235\332\233\213\267\270\271\253\144\145\142\146\143\147\236\150\164\161-\163\170\165-\167\254\151\355\356\353\357\354\277\200\375\376\373\374\272\256\131\104\105\102\106\103\107\234\110\124\121-\123\130\125-\127\214\111\315\316\313\317\314\341\160\335\336\333\334\215\216\337]/;
|
|||
|
}
|
|||
|
if (ord('^')==106) { # posix-bc
|
|||
|
return $char =~
|
|||
|
/[\101\252\260\261\237\262\320\265\171\264\232\212\272\312\257\241\220\217\352\372\276\240\266\263\235\332\233\213\267\270\271\253\144\145\142\146\143\147\236\150\164\161-\163\170\165-\167\254\151\355\356\353\357\354\277\200\340\376\335\374\255\256\131\104\105\102\106\103\107\234\110\124\121-\123\130\125-\127\214\111\315\316\313\317\314\341\160\300\336\333\334\215\216\337]/;
|
|||
|
}
|
|||
|
}
|
|||
|
|
|||
|
Note however that only the C<Is_ascii_print()> sub is really independent
|
|||
|
of coded character set. Another way to write C<Is_latin_1()> would be
|
|||
|
to use the characters in the range explicitly:
|
|||
|
|
|||
|
sub Is_latin_1 {
|
|||
|
my $char = substr(shift,0,1);
|
|||
|
$char =~ /[<5B><><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>]/;
|
|||
|
}
|
|||
|
|
|||
|
Although that form may run into trouble in network transit (due to the
|
|||
|
presence of 8 bit characters) or on non ISO-Latin character sets.
|
|||
|
|
|||
|
=head1 SOCKETS
|
|||
|
|
|||
|
Most socket programming assumes ASCII character encodings in network
|
|||
|
byte order. Exceptions can include CGI script writing under a
|
|||
|
host web server where the server may take care of translation for you.
|
|||
|
Most host web servers convert EBCDIC data to ISO-8859-1 or Unicode on
|
|||
|
output.
|
|||
|
|
|||
|
=head1 SORTING
|
|||
|
|
|||
|
One big difference between ASCII based character sets and EBCDIC ones
|
|||
|
are the relative positions of upper and lower case letters and the
|
|||
|
letters compared to the digits. If sorted on an ASCII based machine the
|
|||
|
two letter abbreviation for a physician comes before the two letter
|
|||
|
for drive, that is:
|
|||
|
|
|||
|
@sorted = sort(qw(Dr. dr.)); # @sorted holds ('Dr.','dr.') on ASCII,
|
|||
|
# but ('dr.','Dr.') on EBCDIC
|
|||
|
|
|||
|
The property of lower case before uppercase letters in EBCDIC is
|
|||
|
even carried to the Latin 1 EBCDIC pages such as 0037 and 1047.
|
|||
|
An example would be that E<Euml> C<E WITH DIAERESIS> (203) comes
|
|||
|
before E<euml> C<e WITH DIAERESIS> (235) on an ASCII machine, but
|
|||
|
the latter (83) comes before the former (115) on an EBCDIC machine.
|
|||
|
(Astute readers will note that the upper case version of E<szlig>
|
|||
|
C<SMALL LETTER SHARP S> is simply "SS" and that the upper case version of
|
|||
|
E<yuml> C<y WITH DIAERESIS> is not in the 0..255 range but it is
|
|||
|
at U+x0178 in Unicode, or C<"\x{178}"> in a Unicode enabled Perl).
|
|||
|
|
|||
|
The sort order will cause differences between results obtained on
|
|||
|
ASCII machines versus EBCDIC machines. What follows are some suggestions
|
|||
|
on how to deal with these differences.
|
|||
|
|
|||
|
=head2 Ignore ASCII vs. EBCDIC sort differences.
|
|||
|
|
|||
|
This is the least computationally expensive strategy. It may require
|
|||
|
some user education.
|
|||
|
|
|||
|
=head2 MONO CASE then sort data.
|
|||
|
|
|||
|
In order to minimize the expense of mono casing mixed test try to
|
|||
|
C<tr///> towards the character set case most employed within the data.
|
|||
|
If the data are primarily UPPERCASE non Latin 1 then apply tr/[a-z]/[A-Z]/
|
|||
|
then sort(). If the data are primarily lowercase non Latin 1 then
|
|||
|
apply tr/[A-Z]/[a-z]/ before sorting. If the data are primarily UPPERCASE
|
|||
|
and include Latin-1 characters then apply:
|
|||
|
|
|||
|
tr/[a-z]/[A-Z]/;
|
|||
|
tr/[<5B><><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>]/[<5B><><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD><EFBFBD>]/;
|
|||
|
s/<2F>/SS/g;
|
|||
|
|
|||
|
then sort(). Do note however that such Latin-1 manipulation does not
|
|||
|
address the E<yuml> C<y WITH DIAERESIS> character that will remain at
|
|||
|
code point 255 on ASCII machines, but 223 on most EBCDIC machines
|
|||
|
where it will sort to a place less than the EBCDIC numerals. With a
|
|||
|
Unicode enabled Perl you might try:
|
|||
|
|
|||
|
tr/^?/\x{178}/;
|
|||
|
|
|||
|
The strategy of mono casing data before sorting does not preserve the case
|
|||
|
of the data and may not be acceptable for that reason.
|
|||
|
|
|||
|
=head2 Convert, sort data, then re convert.
|
|||
|
|
|||
|
This is the most expensive proposition that does not employ a network
|
|||
|
connection.
|
|||
|
|
|||
|
=head2 Perform sorting on one type of machine only.
|
|||
|
|
|||
|
This strategy can employ a network connection. As such
|
|||
|
it would be computationally expensive.
|
|||
|
|
|||
|
=head1 TRANFORMATION FORMATS
|
|||
|
|
|||
|
There are a variety of ways of transforming data with an intra character set
|
|||
|
mapping that serve a variety of purposes. Sorting was discussed in the
|
|||
|
previous section and a few of the other more popular mapping techniques are
|
|||
|
discussed next.
|
|||
|
|
|||
|
=head2 URL decoding and encoding
|
|||
|
|
|||
|
Note that some URLs have hexadecimal ASCII code points in them in an
|
|||
|
attempt to overcome character or protocol limitation issues. For example
|
|||
|
the tilde character is not on every keyboard hence a URL of the form:
|
|||
|
|
|||
|
http://www.pvhp.com/~pvhp/
|
|||
|
|
|||
|
may also be expressed as either of:
|
|||
|
|
|||
|
http://www.pvhp.com/%7Epvhp/
|
|||
|
|
|||
|
http://www.pvhp.com/%7epvhp/
|
|||
|
|
|||
|
where 7E is the hexadecimal ASCII code point for '~'. Here is an example
|
|||
|
of decoding such a URL under CCSID 1047:
|
|||
|
|
|||
|
$url = 'http://www.pvhp.com/%7Epvhp/';
|
|||
|
# this array assumes code page 1047
|
|||
|
my @a2e_1047 = (
|
|||
|
0, 1, 2, 3, 55, 45, 46, 47, 22, 5, 21, 11, 12, 13, 14, 15,
|
|||
|
16, 17, 18, 19, 60, 61, 50, 38, 24, 25, 63, 39, 28, 29, 30, 31,
|
|||
|
64, 90,127,123, 91,108, 80,125, 77, 93, 92, 78,107, 96, 75, 97,
|
|||
|
240,241,242,243,244,245,246,247,248,249,122, 94, 76,126,110,111,
|
|||
|
124,193,194,195,196,197,198,199,200,201,209,210,211,212,213,214,
|
|||
|
215,216,217,226,227,228,229,230,231,232,233,173,224,189, 95,109,
|
|||
|
121,129,130,131,132,133,134,135,136,137,145,146,147,148,149,150,
|
|||
|
151,152,153,162,163,164,165,166,167,168,169,192, 79,208,161, 7,
|
|||
|
32, 33, 34, 35, 36, 37, 6, 23, 40, 41, 42, 43, 44, 9, 10, 27,
|
|||
|
48, 49, 26, 51, 52, 53, 54, 8, 56, 57, 58, 59, 4, 20, 62,255,
|
|||
|
65,170, 74,177,159,178,106,181,187,180,154,138,176,202,175,188,
|
|||
|
144,143,234,250,190,160,182,179,157,218,155,139,183,184,185,171,
|
|||
|
100,101, 98,102, 99,103,158,104,116,113,114,115,120,117,118,119,
|
|||
|
172,105,237,238,235,239,236,191,128,253,254,251,252,186,174, 89,
|
|||
|
68, 69, 66, 70, 67, 71,156, 72, 84, 81, 82, 83, 88, 85, 86, 87,
|
|||
|
140, 73,205,206,203,207,204,225,112,221,222,219,220,141,142,223
|
|||
|
);
|
|||
|
$url =~ s/%([0-9a-fA-F]{2})/pack("c",$a2e_1047[hex($1)])/ge;
|
|||
|
|
|||
|
Conversely, here is a partial solution for the task of encoding such
|
|||
|
a URL under the 1047 code page:
|
|||
|
|
|||
|
$url = 'http://www.pvhp.com/~pvhp/';
|
|||
|
# this array assumes code page 1047
|
|||
|
my @e2a_1047 = (
|
|||
|
0, 1, 2, 3,156, 9,134,127,151,141,142, 11, 12, 13, 14, 15,
|
|||
|
16, 17, 18, 19,157, 10, 8,135, 24, 25,146,143, 28, 29, 30, 31,
|
|||
|
128,129,130,131,132,133, 23, 27,136,137,138,139,140, 5, 6, 7,
|
|||
|
144,145, 22,147,148,149,150, 4,152,153,154,155, 20, 21,158, 26,
|
|||
|
32,160,226,228,224,225,227,229,231,241,162, 46, 60, 40, 43,124,
|
|||
|
38,233,234,235,232,237,238,239,236,223, 33, 36, 42, 41, 59, 94,
|
|||
|
45, 47,194,196,192,193,195,197,199,209,166, 44, 37, 95, 62, 63,
|
|||
|
248,201,202,203,200,205,206,207,204, 96, 58, 35, 64, 39, 61, 34,
|
|||
|
216, 97, 98, 99,100,101,102,103,104,105,171,187,240,253,254,177,
|
|||
|
176,106,107,108,109,110,111,112,113,114,170,186,230,184,198,164,
|
|||
|
181,126,115,116,117,118,119,120,121,122,161,191,208, 91,222,174,
|
|||
|
172,163,165,183,169,167,182,188,189,190,221,168,175, 93,180,215,
|
|||
|
123, 65, 66, 67, 68, 69, 70, 71, 72, 73,173,244,246,242,243,245,
|
|||
|
125, 74, 75, 76, 77, 78, 79, 80, 81, 82,185,251,252,249,250,255,
|
|||
|
92,247, 83, 84, 85, 86, 87, 88, 89, 90,178,212,214,210,211,213,
|
|||
|
48, 49, 50, 51, 52, 53, 54, 55, 56, 57,179,219,220,217,218,159
|
|||
|
);
|
|||
|
# The following regular expression does not address the
|
|||
|
# mappings for: ('.' => '%2E', '/' => '%2F', ':' => '%3A')
|
|||
|
$url =~ s/([\t "#%&\(\),;<=>\?\@\[\\\]^`{|}~])/sprintf("%%%02X",$e2a_1047[ord($1)])/ge;
|
|||
|
|
|||
|
where a more complete solution would split the URL into components
|
|||
|
and apply a full s/// substitution only to the appropriate parts.
|
|||
|
|
|||
|
In the remaining examples a @e2a or @a2e array may be employed
|
|||
|
but the assignment will not be shown explicitly. For code page 1047
|
|||
|
you could use the @a2e_1047 or @e2a_1047 arrays just shown.
|
|||
|
|
|||
|
=head2 uu encoding and decoding
|
|||
|
|
|||
|
The C<u> template to pack() or unpack() will render EBCDIC data in EBCDIC
|
|||
|
characters equivalent to their ASCII counterparts. For example, the
|
|||
|
following will print "Yes indeed\n" on either an ASCII or EBCDIC computer:
|
|||
|
|
|||
|
$all_byte_chrs = '';
|
|||
|
for (0..255) { $all_byte_chrs .= chr($_); }
|
|||
|
$uuencode_byte_chrs = pack('u', $all_byte_chrs);
|
|||
|
($uu = <<' ENDOFHEREDOC') =~ s/^\s*//gm;
|
|||
|
M``$"`P0%!@<("0H+#`T.#Q`1$A,4%187&!D:&QP='A\@(2(C)"4F)R@I*BLL
|
|||
|
M+2XO,#$R,S0U-C<X.3H[/#T^/T!!0D-$149'2$E*2TQ-3D]045)35%565UA9
|
|||
|
M6EM<75Y?8&%B8V1E9F=H:6IK;&UN;W!Q<G-T=79W>'EZ>WQ]?G^`@8*#A(6&
|
|||
|
MAXB)BHN,C8Z/D)&2DY25EI>8F9J;G)V>GZ"AHJ.DI::GJ*FJJZRMKJ^PL;*S
|
|||
|
MM+6VM[BYNKN\O;Z_P,'"P\3%QL?(R<K+S,W.S]#1TM/4U=;7V-G:V]S=WM_@
|
|||
|
?X>+CY.7FY^CIZNOL[>[O\/'R\_3U]O?X^?K[_/W^_P``
|
|||
|
ENDOFHEREDOC
|
|||
|
if ($uuencode_byte_chrs eq $uu) {
|
|||
|
print "Yes ";
|
|||
|
}
|
|||
|
$uudecode_byte_chrs = unpack('u', $uuencode_byte_chrs);
|
|||
|
if ($uudecode_byte_chrs eq $all_byte_chrs) {
|
|||
|
print "indeed\n";
|
|||
|
}
|
|||
|
|
|||
|
Here is a very spartan uudecoder that will work on EBCDIC provided
|
|||
|
that the @e2a array is filled in appropriately:
|
|||
|
|
|||
|
#!/usr/local/bin/perl
|
|||
|
@e2a = ( # this must be filled in
|
|||
|
);
|
|||
|
$_ = <> until ($mode,$file) = /^begin\s*(\d*)\s*(\S*)/;
|
|||
|
open(OUT, "> $file") if $file ne "";
|
|||
|
while(<>) {
|
|||
|
last if /^end/;
|
|||
|
next if /[a-z]/;
|
|||
|
next unless int(((($e2a[ord()] - 32 ) & 077) + 2) / 3) ==
|
|||
|
int(length() / 4);
|
|||
|
print OUT unpack("u", $_);
|
|||
|
}
|
|||
|
close(OUT);
|
|||
|
chmod oct($mode), $file;
|
|||
|
|
|||
|
|
|||
|
=head2 Quoted-Printable encoding and decoding
|
|||
|
|
|||
|
On ASCII encoded machines it is possible to strip characters outside of
|
|||
|
the printable set using:
|
|||
|
|
|||
|
# This QP encoder works on ASCII only
|
|||
|
$qp_string =~ s/([=\x00-\x1F\x80-\xFF])/sprintf("=%02X",ord($1))/ge;
|
|||
|
|
|||
|
Whereas a QP encoder that works on both ASCII and EBCDIC machines
|
|||
|
would look somewhat like the following (where the EBCDIC branch @e2a
|
|||
|
array is omitted for brevity):
|
|||
|
|
|||
|
if (ord('A') == 65) { # ASCII
|
|||
|
$delete = "\x7F"; # ASCII
|
|||
|
@e2a = (0 .. 255) # ASCII to ASCII identity map
|
|||
|
}
|
|||
|
else { # EBCDIC
|
|||
|
$delete = "\x07"; # EBCDIC
|
|||
|
@e2a = # EBCDIC to ASCII map (as shown above)
|
|||
|
}
|
|||
|
$qp_string =~
|
|||
|
s/([^ !"\#\$%&'()*+,\-.\/0-9:;<>?\@A-Z[\\\]^_`a-z{|}~$delete])/sprintf("=%02X",$e2a[ord($1)])/ge;
|
|||
|
|
|||
|
(although in production code the substitutions might be done
|
|||
|
in the EBCDIC branch with the @e2a array and separately in the
|
|||
|
ASCII branch without the expense of the identity map).
|
|||
|
|
|||
|
Such QP strings can be decoded with:
|
|||
|
|
|||
|
# This QP decoder is limited to ASCII only
|
|||
|
$string =~ s/=([0-9A-Fa-f][0-9A-Fa-f])/chr hex $1/ge;
|
|||
|
$string =~ s/=[\n\r]+$//;
|
|||
|
|
|||
|
Whereas a QP decoder that works on both ASCII and EBCDIC machines
|
|||
|
would look somewhat like the following (where the @a2e array is
|
|||
|
omitted for brevity):
|
|||
|
|
|||
|
$string =~ s/=([0-9A-Fa-f][0-9A-Fa-f])/chr $a2e[hex $1]/ge;
|
|||
|
$string =~ s/=[\n\r]+$//;
|
|||
|
|
|||
|
=head2 Caesarian cyphers
|
|||
|
|
|||
|
The practice of shifting an alphabet one or more characters for encipherment
|
|||
|
dates back thousands of years and was explicitly detailed by Gaius Julius
|
|||
|
Caesar in his B<Gallic Wars> text. A single alphabet shift is sometimes
|
|||
|
referred to as a rotation and the shift amount is given as a number $n after
|
|||
|
the string 'rot' or "rot$n". Rot0 and rot26 would designate identity maps
|
|||
|
on the 26 letter English version of the Latin alphabet. Rot13 has the
|
|||
|
interesting property that alternate subsequent invocations are identity maps
|
|||
|
(thus rot13 is its own non-trivial inverse in the group of 26 alphabet
|
|||
|
rotations). Hence the following is a rot13 encoder and decoder that will
|
|||
|
work on ASCII and EBCDIC machines:
|
|||
|
|
|||
|
#!/usr/local/bin/perl
|
|||
|
|
|||
|
while(<>){
|
|||
|
tr/n-za-mN-ZA-M/a-zA-Z/;
|
|||
|
print;
|
|||
|
}
|
|||
|
|
|||
|
In one-liner form:
|
|||
|
|
|||
|
perl -ne 'tr/n-za-mN-ZA-M/a-zA-Z/;print'
|
|||
|
|
|||
|
|
|||
|
=head1 Hashing order and checksums
|
|||
|
|
|||
|
XXX
|
|||
|
|
|||
|
=head1 I18N AND L10N
|
|||
|
|
|||
|
Internationalization(I18N) and localization(L10N) are supported at least
|
|||
|
in principle even on EBCDIC machines. The details are system dependent
|
|||
|
and discussed under the L<perlebcdic/OS ISSUES> section below.
|
|||
|
|
|||
|
=head1 MULTI OCTET CHARACTER SETS
|
|||
|
|
|||
|
Multi byte EBCDIC code pages; Unicode, UTF-8, UTF-EBCDIC, XXX.
|
|||
|
|
|||
|
=head1 OS ISSUES
|
|||
|
|
|||
|
There may be a few system dependent issues
|
|||
|
of concern to EBCDIC Perl programmers.
|
|||
|
|
|||
|
=head2 OS/400
|
|||
|
|
|||
|
The PASE environment.
|
|||
|
|
|||
|
=over 8
|
|||
|
|
|||
|
=item IFS access
|
|||
|
|
|||
|
XXX.
|
|||
|
|
|||
|
=back
|
|||
|
|
|||
|
=head2 OS/390
|
|||
|
|
|||
|
Perl runs under Unix Systems Services or USS.
|
|||
|
|
|||
|
=over 8
|
|||
|
|
|||
|
=item chcp
|
|||
|
|
|||
|
B<chcp> is supported as a shell utility for displaying and changing
|
|||
|
one's code page. See also L<chcp>.
|
|||
|
|
|||
|
=item dataset access
|
|||
|
|
|||
|
For sequential data set access try:
|
|||
|
|
|||
|
my @ds_records = `cat //DSNAME`;
|
|||
|
|
|||
|
or:
|
|||
|
|
|||
|
my @ds_records = `cat //'HLQ.DSNAME'`;
|
|||
|
|
|||
|
See also the OS390::Stdio module on CPAN.
|
|||
|
|
|||
|
=item OS/390 iconv
|
|||
|
|
|||
|
B<iconv> is supported as both a shell utility and a C RTL routine.
|
|||
|
See also the iconv(1) and iconv(3) manual pages.
|
|||
|
|
|||
|
=item locales
|
|||
|
|
|||
|
On OS/390 see L<locale> for information on locales. The L10N files
|
|||
|
are in F</usr/nls/locale>. $Config{d_setlocale} is 'define' on OS/390.
|
|||
|
|
|||
|
=back
|
|||
|
|
|||
|
=head2 VM/ESA?
|
|||
|
|
|||
|
XXX.
|
|||
|
|
|||
|
=head2 POSIX-BC?
|
|||
|
|
|||
|
XXX.
|
|||
|
|
|||
|
=head1 BUGS
|
|||
|
|
|||
|
This pod document contains literal Latin 1 characters and may encounter
|
|||
|
translation difficulties. In particular one popular nroff implementation
|
|||
|
was known to strip accented characters to their unaccented counterparts
|
|||
|
while attempting to view this document through the B<pod2man> program
|
|||
|
(for example, you may see a plain C<y> rather than one with a diaeresis
|
|||
|
as in E<yuml>). Another nroff truncated the resultant man page at
|
|||
|
the first occurence of 8 bit characters.
|
|||
|
|
|||
|
Not all shells will allow multiple C<-e> string arguments to perl to
|
|||
|
be concatenated together properly as recipes 2, 3, and 4 might seem
|
|||
|
to imply.
|
|||
|
|
|||
|
Perl does not yet work with any Unicode features on EBCDIC platforms.
|
|||
|
|
|||
|
=head1 SEE ALSO
|
|||
|
|
|||
|
L<perllocale>, L<perlfunc>.
|
|||
|
|
|||
|
=head1 REFERENCES
|
|||
|
|
|||
|
http://anubis.dkuug.dk/i18n/charmaps
|
|||
|
|
|||
|
http://www.unicode.org/
|
|||
|
|
|||
|
http://www.unicode.org/unicode/reports/tr16/
|
|||
|
|
|||
|
http://www.wps.com/texts/codes/
|
|||
|
B<ASCII: American Standard Code for Information Infiltration> Tom Jennings,
|
|||
|
September 1999.
|
|||
|
|
|||
|
B<The Unicode Standard Version 2.0> The Unicode Consortium,
|
|||
|
ISBN 0-201-48345-9, Addison Wesley Developers Press, July 1996.
|
|||
|
|
|||
|
B<The Unicode Standard Version 3.0> The Unicode Consortium, Lisa Moore ed.,
|
|||
|
ISBN 0-201-61633-5, Addison Wesley Developers Press, February 2000.
|
|||
|
|
|||
|
B<CDRA: IBM - Character Data Representation Architecture -
|
|||
|
Reference and Registry>, IBM SC09-2190-00, December 1996.
|
|||
|
|
|||
|
"Demystifying Character Sets", Andrea Vine, Multilingual Computing
|
|||
|
& Technology, B<#26 Vol. 10 Issue 4>, August/September 1999;
|
|||
|
ISSN 1523-0309; Multilingual Computing Inc. Sandpoint ID, USA.
|
|||
|
|
|||
|
B<Codes, Ciphers, and Other Cryptic and Clandestine Communication>
|
|||
|
Fred B. Wrixon, ISBN 1-57912-040-7, Black Dog & Leventhal Publishers,
|
|||
|
1998.
|
|||
|
|
|||
|
=head1 AUTHOR
|
|||
|
|
|||
|
Peter Prymmer pvhp@best.com wrote this in 1999 and 2000
|
|||
|
with CCSID 0819 and 0037 help from Chris Leach and
|
|||
|
AndrE<eacute> Pirard A.Pirard@ulg.ac.be as well as POSIX-BC
|
|||
|
help from Thomas Dorner Thomas.Dorner@start.de.
|
|||
|
Thanks also to Vickie Cooper, Philip Newton, William Raffloer, and
|
|||
|
Joe Smith. Trademarks, registered trademarks, service marks and
|
|||
|
registered service marks used in this document are the property of
|
|||
|
their respective owners.
|
|||
|
|
|||
|
|