Новости

Sourcing Scripts in Different Encodings

The Tcl
source
command always reads files using the system
encoding. Therefore, Tcl Developer Xchange recommends that whenever possible, you
author scripts in the native system encoding.

A difficulty arises when distributing scripts internationally, as
you don’t necessarily know what the system encoding will
be. Fortunately, most common character encodings include the standard
7-bit ASCII characters as a subset. Therefore, you are usually
safe if your script contains only 7-bit ASCII characters.

If you need to use an extended character set for your scripts that
you distribute, you can provide a small «bootstrap» script written in
7-bit ASCII. The bootstrap script can then load and execute
scripts in any encoding that you choose.

You can execute a script written in an encoding other than the
system encoding by opening the file, setting the proper encoding using
the
fconfigure
-encoding
command, reading the file into a
variable, and then evaluating the string with the
eval
command. For example, the following reads and executes a Tcl script
encoded in EUC-JP:

set fd 
fconfigure $fd encoding euc-jp
set jpscript 
close $fd
eval $jpscript

Note: This technique works only if the file
contains actual EUC-JP encoded characters (for example, you created
the file with a EUC-JP text editor). This technique doesn’t work if
you build the EUC-JP encoded characters using the «\x» or octal digit
escape sequences. Tcl 8.1 interprets each «\x» or octal digit escape
sequence as a single Unicode character with the upper bits set to
0. For example, if the script app.tcl above contained the
line:

set ha "\xA4\xCF"

then the variable ha would contain two characters,
«¤Ï» (Unicode characters «CURRENCY SIGN» and «LATIN CAPITAL
LETTER I WITH DIAERESIS»), not the Unicode HA character.

Summary: Tcl Internationalization Support at a Glance

The following list is a quick summary of the issues you should be
aware of concerning the new internationalization support introduced in
Tcl 8.1:

  • Tcl encodes all strings internally as Unicode characters in UTF-8
    format.
  • The introduction of Unicode/UTF-8 encoding requires no changes to
    legacy Tcl scripts that use only 7-bit ASCII characters, because UTF-8
    characters corresponding to the standard 7-bit ASCII set (up to ASCII
    value 0x7F in hexadecimal) have the same byte values in both UTF-8 and
    ASCII encoding. Furthermore, because the use of Unicode/UTF-8 encoding
    is internal to Tcl, most string handling in legacy Tcl scripts works
    the same in Tcl 8.1 as it did in Tcl 8.0.
  • You can specify a Unicode character by its four-digit, hexadecimal
    Unicode code value with the «\uxxxx» escape sequence.
  • All Tcl string functions properly handle multi-byte UTF-8
    characters as single characters.
  • Tk widgets that display text accept text string arguments in
    standard Unicode/UTF-8 encoding. Tk automatically handles any encoding
    conversion necessary to display the characters in a particular
    font. If the master font that you set for a widget doesn’t contain a
    glyph (a visual representation) for a particular Unicode character, Tk
    attempts to locate a font that does. Where possible, Tk attempts to
    locate a font that matches as many characteristics of the widget’s
    master font as possible (for example, weight, slant, etc.). In some
    cases, Tk is unable to identify a suitable font, even if one is
    actually installed on the system. Therefore, for best results, you
    should try to select as a widget’s master font one that is capable of
    handling the characters you expect to display.
  • The system encoding is the character encoding used by
    the operating system. Tcl automatically handles conversions between
    UTF-8 and the system encoding when interacting with the operating
    system.
  • Tcl usually can determine a reasonable default system encoding
    based on the platform and locale settings, but if for some reason it
    cannot, it uses ISO 8859-1 as the default system encoding. You can
    explicitly set the system encoding used by Tcl with the
    encoding
    system
    command.
  • By default, Tcl uses the system encoding when reading from and
    writing to channels, and converts the text to UTF-8 format. You can
    change the character encoding for a channel using the
    fconfigure
    -encoding
    command.
  • The
    source
    command always reads files using the system
    encoding. Therefore, Scriptics recommends that whenever possible, you
    author scripts in the native system encoding. Furthermore, most common
    character encodings include the standard 7-bit ASCII characters as
    a subset, so you are usually safe writing scripts using only 7-bit
    ASCII characters. You can execute a script written in a different
    encoding by opening the file, setting the proper encoding using the
    fconfigure
    -encoding
    command, reading the file into a variable,
    and then evaluating the string with the
    eval command.
  • You can convert a string to a different encoding using the
    encoding
    convertfrom
    and
    encoding
    convertto
    commands.
  • Tcl has built-in knowledge of approximately 30 common character
    encodings. The
    encoding
    names
    command displays a list of all
    known encodings. You can create additional encodings as described in
    the
    Tcl_GetEncoding.3
    reference page.
  • The new
    msgcat
    package provides a set of functions for
    managing multilingual user interfaces. It allows you to define strings
    in a message catalog, which is independent from your application and
    which you can edit or localize without modifying the application
    source code. See the
    msgcat.n
    reference page for more information.

You should also read the section of this document
if you use the Tcl APIs in C programs.

Add link to comments for /doc/howto/i18n.html

Converting Strings to Different Encodings

You can convert a string to a different encoding using the
encoding
convertfrom
and
encoding
convertto
commands. The
encoding
convertfrom
command converts a string
from a specified encoding into UTF-8 Unicode characters; the
encoding
convertto
command converts a string from UTF-8 Unicode
into a specified encoding. In either case, if you omit the encoding
argument, the command uses the current system encoding.

As an example, the following command converts a string representing
the Hiragana letter HA from EUC-JP encoding into a Unicode
string:

set ha 

(In Tcl 8.1, the «\x» and octal digit escape sequences specify the
lower 8 bits of a Unicode character with the upper 8 bits set to
0. The thus the string «\xA4\xCF» still specifies two
characters in Tcl 8.1, just as it did in Tcl 8.0; however Tcl
8.1 stores those characters in four bytes, whereas Tcl 8.0
stored them in two bytes.)

Fonts, Encodings, and Tk Widgets

Tk widgets that display text now require text strings in
Unicode/UTF-8 encoding. Tk automatically handles any encoding
conversion necessary to display the characters in a particular
font.

If the master font that you set for a widget doesn’t contain a
glyph for a particular Unicode character that you want to display, Tk
attempts to locate a font that does. Where possible, Tk attempts to
locate a font that matches as many characteristics of the widget’s
master font as possible (for example, weight, slant, etc.). Once Tk
finds a suitable font, it displays the character in that font. In
other words, the widget uses the master font for all characters it is
capable of
displaying, and alternative fonts only as needed.

In some cases, Tk is unable to identify a suitable font, in which
case the widget cannot display the characters. (Instead, the widget
displays a system-dependent fallback character such as «?») The
process of identifying suitable fonts is complex, and Tk’s algorithms
don’t always find a font even if one is actually installed on the
system. Therefore, for best results, you should try to select as a
widget’s master font one that is capable of handling the characters
you expect to display. For example, «Times» is likely to be a poor
choice if you know that you need to display Japanese or Arabic
characters in a widget.

If you work with text in a variety of character sets, you may need
to search out fonts to represent them. Markus Kuhn has developed a
free 6×13 font that supports essentially all the Unicode characters
that can be displayed in a 6×13 glyph. This does not include Japanese,
Chinese, and other Asian languages, but it does cover many others. The
font is available at
http://www.cl.cam.ac.uk/~mgk25/ucs-fonts.html .
His site also contains many useful links to other sources of fonts and
font information.

Message Catalogs

The new
msgcat
package provides a set of functions for
managing multilingual user interfaces. It allows you to define strings
in a message catalog, which is independent from your application or
package and which you can edit or localize without modifying the
application source code. The
msgcat
package is optional, but
Tcl Developer Xchange recommends using it for all multilingual applications and
packages.

The basic principle of the
msgcat
package is that you create
a set of message files, one for each supported language,
containing localized versions of all the strings your application or
package can display. Then in your application or package, instead of
using a string directly, you call the
::msgcat::mc
command to return a localized version of the string you want.

This document provides only a brief introduction to message
catalogs. The
msgcat
package provides additional features such
as namespace support and «best match» handling of sublocales. See the
msgcat.n
reference page for more information.

Using Message Catalogs

Using message catalogs from within your application or package
requires the following steps:

  1. Optionally set the locale using the
    ::msgcat::mclocale
    command. If you don’t call
    mclocale,
    the locale defaults to the
    value of the env(LANG) environment variable at the time the
    msgcat
    package is loaded. If env(LANG) isn’t
    defined, then the locale defaults to «C».
  2. Call
    ::msgcat::mcload
    to load the appropriate message
    files. The
    mcload
    command requires as an argument a directory
    containing your message files.
  3. Anywhere in your script that you would typically specify a string
    to display, use the
    ::msgcat::mc
    command instead. The
    mc
    command takes as an argument a source string and returns the
    translation of that string in the current locale.

The following code fragment demonstrates how you could use the
msgcat
package in a script:

# Use the default locale as specified by env(LANG).
# You could explicitly set the locale with a command such as
# ::msgcat::mclocale "en_UK"

# Load the messages files.  In this example, they are stored
# in a subdirectory named "msgs" which is in the same directory
# as this script.

::msgcat::mcload ] msgs]

# Display a welcome message

puts 

In this example, instead of directly displaying the message
«Welcome to Tcl!», the application calls
mc
to retrieve a
localized version of the string. The string returned by
mc
depends on the current locale. For example, in the «es» locale
mc
could return the Spanish-language greeting «¡Bienvenido a
Tcl!»

If a message file doesn’t exist for the current locale,
mc
executes the procedure
::msgcat::mcunknown.
The default behavior of
mcunknown
is to return the original string («Welcome to Tcl!»
in this case), but you can redefine it to perform any action you
want.

Creating Localized Message Files

To use the
msgcat
package, you need to prepare a set of
message files for your package or application, all contained within
the same directory. The name of each message file is a locale
specifier followed by the extension «.msg» (for example,
es.msg for a Spanish message file or en_UK.msg for a UK
English message file).

Each message file contains a series of calls to
::msgcat::mcset
to set the translation strings for that
language. The format of the
mcset
command is:

::msgcat::mcset locale src-string ?translation-string?

The
mcset
command defines a locale-specific translation for
the given src-string. If no translation-string
argument is present, then the value of src-string is also
used as the locale-specific translation string.

So, if American English is the «source language» for your
application, an en_UK.msg file might contain commands such
as:

::msgcat::mcset en_UK "Welcome to Tcl!"
::msgcat::mcset en_UK "Select a color:" "Select a colour:"

Note that no translation string is provided for the first line, so
the resulting «translation» for the en_UK locale is the same as the
American source string, «Welcome to Tcl!» If you omitted this entry in
the message file, then calling
mc
with the source string
«Welcome to Tcl!» in the en_UK locale would result in
mcunknown
being called. Although the default behavior of
mcunknown
would
produce the desired results (returning «Welcome to Tcl!»), you could
run into problems if you override the behavior of
mcunknown.
Therefore, it is always safest to include a
mcset
mapping for every source string in your application, even
if a particular locale doesn’t require a «translation» for that
string.

An equivalent Spanish-language message file, es.msg, would
contain:

::msgcat::mcset es "Welcome to Tcl!" "¡Bienvenido a Tcl!"
::msgcat::mcset es "Select a color:" "Elige un color:"

Channel Input/Output

When reading and writing data on a channel, you need to ensure that
Tcl uses the proper character encoding for that channel. The default
encoding for newly opened channels (both files and sockets) is the
same as the platform- and locale-dependent system encoding used for
interfacing with the operating system. (See the
section of this document
for more information.) In most cases, you don’t
need to do anything special to read or write data because most text
files are created in the system encoding. You need to take special
steps only when accessing files in an encoding other than the system
encoding (for example, reading a file encoded in Shift-JIS format when
your system encoding is ISO 8859-1).

The fconfigure
-encoding
option allows you to specify the
encoding for a channel. Thus, to read from a file encoded in Shift-JIS
format, you should execute the following commands:

set fd 
fconfigure $fd -encoding shiftjis

Tcl then automatically converts any text you read from the file
into standard UTF-8 format.

Similarly, if you are writing to a channel, you can use
fconfigure
-encoding
to specify the target character encoding
and Tcl automatically converts strings from UTF-8 to that encoding on
output.

Note:

source

Character Encodings and the Operating System

The system encoding is the character encoding used by
the operating system for items such as file names and environment
variables. Text files used by text editors and other applications are
usually encoded in the system encoding as well, unless the application
that produced them explicitly saves them in another format (for
example, if you use a Shift-JIS text editor on an ISO 8859-1
system).

Tcl automatically converts strings from UTF-8 format to the system
encoding and vice versa whenever it communicates with the operating
system. For example, Tcl automatically handles any encoding conversion
needed if you execute commands such as:

% glob *

or

% set fd 

The Tcl source
command also reads files using the system
encoding, and strings passed to and from the Tcl
exec command
are converted to and from the system encoding.

Tcl attempts to determine the system encoding during initialization
based on the platform and locale settings. Tcl usually can determine a
reasonable default system encoding based on these settings, but if for
some reason it cannot, it uses ISO 8859-1 as the default system
encoding.

You can override the default system encoding with the
encoding
system
command. Tcl Developer Xchange recommends that you avoid using this
command if at all possible. If you set the default system encoding to
anything other than the actual encoding used by your operating system,
Tcl will likely find it impossible to communicate properly with your
operating system.

Note: For reading and writing files in an encoding
other than the system encoding, you need to use the fconfigure
-encoding
command (not the encoding
system
command) as described in the section of this document.
Also see the section of this document
for special instructions for sourcing files in formats other than the
system encoding.

Internationalization and the Tcl C APIs

Tcl 8.1 introduces new C APIs to support all new
internationalization features. Tcl 8.1 also introduces new convenience
functions for manipulating Unicode/UTF-8 strings. By using the new
APIs in your applications, you can easily add full Unicode support to
your application. Coupled with Tk’s powerful font and layout support,
you can quickly create fully internationalized applications.

When programming with the Tcl C APIs, you should be aware of the
following issues, in addition to the Tcl scripting language
internationalization features:

  • The Tcl C APIs now require all strings to be passed to functions
    as Unicode characters in UTF-8 format. You must convert strings in
    native system encodings to UTF-8 before passing them to Tcl C
    functions. Similarly, you must convert Tcl UTF-8 strings to the native
    system encoding before passing them to system functions. Tcl provides
    functions for handling encodings and converting strings from one
    encoding to another. See the
    GetEncoding.3
    reference page for details.
  • Because 7-bit ASCII characters have the same encoding in UTF-8
    format, legacy code that uses only 7-bit ASCII characters functions
    the same in Tcl 8.1 as it did in Tcl 8.0. Therefore, if you’re
    certain that your strings contain only 7-bit ASCII
    characters, no conversion is required.
  • Because strings in Tcl are now stored as Unicode characters in
    UTF-8 format, the number of characters in a string is not necessarily
    equal to the number of bytes in a string. In particular, you should no
    longer use the standard C string functions such as strlen to
    count characters in a string. Similarly, other standard C string
    functions such as toupper don’t work with Unicode
    characters. Tcl provides a set of equivalent Unicode string functions,
    such as
    Tcl_NumUtfChars
    and
    Tcl_UtfToUpper,
    as well as
    other convenience functions for manipulating Unicode strings. See the
    Utf.3
    and
    UtfToUpper.3
    reference pages for details.

Character Encoding Overview

A character encoding is simply a mapping of characters
and symbols used in written language into a binary format used by
computers. For example, in the standard ASCII encoding, the upper-case
«A» character from the Latin character set is represented by the byte
value 0x41 in hexadecimal. Other widely used character encodings
include ISO 8859-1, used by many European languages, Shift-JIS and
EUC-JP for Japanese characters, and Big5 for Chinese characters.

The Unicode Standard is a fixed-width, uniform encoding scheme for
virtually all characters used in the world’s major written
languages. Unicode uses a 16-bit encoding for all text elements. These
text elements include letters such as «w» or «M», characters such as
those used in Japanese Hiragana to represent syllables, or ideographs
such as those used in Chinese to represent full words or concepts. The
Unicode Standard does not specify the visual representation of a
character, which is known as a glyph. For more
information on the Unicode Standard, visit the Unicode web site at
http://www.unicode.org .

UTF-8 is a standard transformation format for Unicode
characters. It is a method of transforming all Unicode characters into
a variable length encoding of bytes; a single Unicode character can be
represented by one, two, or three bytes. The advantage of the UTF-8
standard is that it and the Unicode standard were designed so that
Unicode characters corresponding to the standard ASCII set (up to
ASCII value 0x7F in hexadecimal) have the same byte values in both
UTF-8 and ASCII encoding. In other words, an upper-case «A» character
is represented by the single-byte value 0x41 in both UTF-8 and ASCII
encoding.

Beginning in Tcl 8.1, Tcl represents all strings internally as
Unicode characters in UTF-8 format. Tcl 8.1 also ships with built-in
support for approximately 30 common character encoding standards, and
can convert strings from one encoding to another. The
encoding
names
command displays a list of all known encodings. You can
create additional encodings as described in the
Tcl_GetEncoding.3
reference page.

Tip: Because 7-bit ASCII characters have the same
encoding in UTF-8 format, legacy Tcl scripts that use only 7-bit ASCII
characters function the same in Tcl 8.1 as they did in Tcl
8.0. Furthermore, because the use of Unicode/UTF-8 encoding is
internal to Tcl, most string handling in legacy Tcl scripts works the
same in Tcl 8.1 as it did in Tcl 8.0. Most problems in converting from
Tcl 8.0 to 8.1 occur in: 1) using non-Latin characters, 2) reading and
writing strings from a channel, and 3) writing code that assumes that
each character in a string is a fixed byte width (for example, one
byte per character).

General String Manipulation

Beginning in Tcl 8.1, all Tcl string manipulation functions expect
and return Unicode strings encoded in UTF-8 format. Because the use of
Unicode/UTF-8 encoding is internal to Tcl, you should see no
difference in Tcl 8.0 and 8.1 string handling in your scripts.

The Tcl string functions properly handle multi-byte UTF-8
characters as single characters. For example in the following
commands, Tcl treats the string «Café» as a four-character
string, even though the internal representation in UTF-8 format
requires five bytes. (As with previous versions of Tcl, string indexes
start with «0»; that is, the first character is index «0», the second
character is index «1», etc.)

% set unistr "Café"
Café
% string length $unistr
4
% string index $unistr 3
é

Furthermore, the new regular expression implementation introduced
in Tcl 8.1 handles the full range of Unicode characters.

The «\uxxxx» escape sequence allows you to specify a
Unicode character by its four-digit, hexadecimal Unicode code
value. For example, the following assigns to a variable two ideograph
characters corresponding to the Chinese transliteration of «Tcl»
(TAI-KU):

set tclstr "\u592a\u9177"
Добавить комментарий

Ваш адрес email не будет опубликован. Обязательные поля помечены *

Adblock
detector