SKF(1) SKF(1)
NAME
skf - simple Kanji Filter
SYNOPSIS
skf [-AEIJKNQRSXZabdehjknqrsuvxz] [-i multi_byte_charset ]
[-o single_byte_charset ] [ long_format_options ]
[infiles..]
DESCRIPTION
skf is a yet another i18n capable kanji-filter, which
enables users to read various Japanese kanji-coded files
on the Net. It converts input kanji texts or streams into
a character stream using designated kanji code and output
them to standard output. Specifically, skf is intended to
be a versatile filter to read documents in various code
sets, and does not have fancy features which are not
directly related to code conversion (like folding, mime-
encoding support).
Like nkf, skf automatically recognizes input file code
when it is some kind of ISO-2022 code, and also recognize
Microsoft JIS(SJIS) code and EUC if input file does not
include X0201 kanas. skf 1.9x can read various iso-2022
compliant codesets, including JIS Kanji code (X0208, X0212
and X0213), EUC encoding, ISO Europian latins
(ISO-8859-1/2/3/4/6/7/10/11/14/15/16), BS 4730, NF Z
62-010 and X0201 kana with ESC-(-I, SS0, Locking shift.
skf also supports some non-iso2022 compliant sets, includ-
ing Microsoft Shifted-JIS code, KOI-8-R/U, Unicode stan-
dard(UCS2/UTF-16, UTF7 and UTF8), X0221 JIS(2octet only)
and some vendor specific codes (KEIS83 and JEF). Sup-
ported output codesets are X-0208/X-0212 JIS, X-0201 JIS,
ASCII, EUC, Microsoft JIS, EUC and Unicode.
Unlike nkf, skf is designed to convert input code into
some kind of human-readable form under a local environment
(i.e. codeset), and has several extra conversion features.
Such conversions include Windows/Macintosh specific code
swap and old-new jis glyph change, html-format/TeX format
conversion and variant unifications.
If file name(s) are specified, skf read files and output
converted stream to stdout. If no file names are given,
input is taken from stdin and output to stdout. OPTIONS
are taken from Environment Variables SKFENV, skfenv and
command line, respectively in this order. Environment
variables are not used when executed as root.
skf does not use LOCALE-related environment variables for
conversion, but output error messages are controlled by
given LOCALES.
OPTIONS
skf is internally a different program from nkf. However,
skf is intended to be a plug-in replacement to nkf(v1.4)
and has a subset of nkf options.
skf 1.9x recognizes following options.
-u use unbuffered output.
-b use buffered output. This is default.
Input/Output codeset settings
-n -j output encoding is 7-bit JIS code using JIS
X0208(1983/1990) character set.
-s -x output encoding is Microsoft JIS using JIS
X0208(1983/1990) character set.
-a -e output encoding is EUC using JIS X0208(1983/1990)
character set.
-q output encoding is Unicode UTF-16 (v3.2). Output is
little endian byte ordered by default, and includes
endian mark by default unless --suppress-endian is
specified. Output range is within UTF-32 with sur-
rogate pair unless --limit-to-ucs2 is specified.
-z output encoding is UTF-8 encoded Unicode (v3.2)
-y output encoding is UTF-7 encoded Unicode (v3.2)
-k (experimental)
output encoding and character set is KEIS83.
-i_ use ESC-$-_ as a designate sequence for JIS Kanji
(Default is B ). This setting and output codeset
setting is separate setting.
-o_ use ESC-$-_ as a designate sequence for single-byte
roman character (Default is J ). Note that this
setting and output codeset setting is separate set-
ting. This setting does not specify output code-
set.
-A, -E, --input-euc
Assume input code set is EUC.
--input-euc-x0213 (experimental)
Assume input code set is EUC with X-0213 1st plane
extension.
-N, --input-jis
Assume input code set is JIS X-0208.
-S, -X, --input-sjis
Assume input code set is Microsoft JIS.
--input-sjis-x0213 (experimental)
Assume input code set is Microsoft JIS with X-0213
extension (i.e. JIS X-0213 Shift encoding).
-Q, --input-ucs2 --input-utf16
Assume input code set is Unicode UTF-16. Default
endian is BIG, and byte order marking is recog-
nized.
-Y, --input-utf7
Assume input code set is UTF-7 encoded Unicode.
--no-utf7
Assume input code set is *NOT* UTF-7 encoded Uni-
code. This option disables input utf7 testing.
-Z, --input-utf8
Assume input code set is UTF-8 encoded Unicode.
-K, --input-keis (experimental)
Assume input code set is KEIS83 code.
--input-jef (experimental)
Assume input code set is JEF-ebcdik kana code.
--input-jef-small (experimental)
Assume input code set is JEF with ebcdik latin-
small code.
EXTENDED OPTIONS
skf has various features to fit output file to local envi-
ronment, and many of these are controlled by extended con-
trol switch described in this section.
X-0201 Kana handling
skf by default converts X-0201 kanas to X-0208 kanas. To
output X-0201 kana as it is, use one of following options.
When output is designated to EUC or SJIS, these three
options enable X-0201 kana output by ways provided by each
code set. When Unicode output is specified, (equiv.) kana
part output is controlled by --use-compat, not following
switches.
--kana-jis7
use SI/SO locking shift sequence to designate
X-0201 kana.
--kana-jis8
output X-0201 kana using 8-bit code right plane.
--kana-esci --kana-call
use ESC-(-I to designate X-0201 kana.
--kana-enable
use X-0201 kana when EUC (with G2) or SJIS output
code is used. When JIS output, it is same as
--kana-call.
JIS X-0212(Supplement Kanji code) Support
--x0212-enable
skf by default does not output JIS X-0212 code.
This option enables use of JIS X-0212 part. Output
code set may be neither Microsoft code nor KEIS.
For Unicode variant encodings, this option is on by
default.
Latin code handling
With Unicode(tm) family output codings, skf output non-
ascii latin character part as it is, but with other output
codings, skf converts these characters using following
rules:
(1) If code is defined in iso-8859-1 and --use-iso8859-1
is defined, it is outputted as is using iso-8859-1 as GR.
(2) If html convert mode enabled and code is defined in
html/sgml codeset, it is converted to html escape
sequence.
(3) If tex convert mode enabled and code is defined in tex
codeset, it is converted to tex format.
(4) If code is defined in X-0208/X0212, it is converted to
X-0208/X0212 respectively.
--use-iso8859-1
Enable iso-8859-1 output. Iso-8859-1 is invoked to
G1 and set to GR plane. This mode is cleared by
--reset.
--convert-html --convert-sgml
Enable html convert mode. This mode is disabled by
--reset. These two options are aliases, and are
treated as same option.
--convert-html-decimal
Enable html code-point decimal convert mode. This
mode is cleared by --reset.
--convert-html-hexadecimal
Enable html code-point hexadecimal convert mode.
This mode is cleared by --reset.
--convert-tex
Enable tex convert mode. This mode is cleared by
--reset.
Codeset/Vendor Specific codeset handling flags
skf by default assumes machine specific parts of kanji
code are Microsoft Windows compatible. Here are some
options that control this behavior.
--disable-gaiji-support
Assume machine specific part is undefined.
--use-apple-gaiji
Assume machine specific part in input file is Mac-
intosh(Kanjitalk7) compatible.
--dsbl-ibm-gaiji
Disable machine specific part in input file.
--disable-chart
Do not use Moji-keisen characters. This is for old
Macintosh system compatibility.
--disable-jis90
Disable 2 added characters of JIS X-0208(1990). If
this option is specified, these two characters are
replaced by Kanji variants. This option is off by
default.
--input-detect-jis78
Distinguish JIS X-0208(1978) codeset and JIS
X-0208(1983/90) codeset. This option is valid only
when input encoding is JIS (ISO-2022). This option
needs -DDYNAMIC_LOADING at compile time.
--output-jis78
When output, codeset for JIS table is JIS
X-0208(1978). This option is valid when output
encoding is JIS, EUC or Microsoft code(cp932).
--convert-jis78-jis83
In JIS X-0208 1983 revision, some characters in JIS
X-0208(1978) is moved to JIS X-0212(1990). This
switch specifies skf to output these characters
with variants in X-0208(1983).
ISO-2022 Specific controls
--set-g0=`char_set'
Set code set predefined to plane 0 (G0). Supported
`char_set' is `ascii' (default) and `x0201'. It is
automatically invoked to GL (iso-2022-jp-1/2/3
assumption). This option works only with JIS
input.
--set-g1=`char_set'
Set code set predefined to right plane(G1). Sup-
ported `char_set' is `x0201' (default)
`iso8859-1',`iso8859-2',`iso8859-3',`iso8859-7',
`iso8859-14',`iso8859-15',`koi8-r' and `x0212'.
This option works only with JIS input.
--set-g2=`char_set'
Set code set predefined to G2 plane. Supported
`char_set' is `x0201' (default)
`iso8859-1',`iso8859-2',`iso8859-3',`iso8859-7',
`iso8859-14',`iso8859-15',`koi8-r' and `x0212'.
This option works with EUC and JIS input.
--set-g3=`char_set'
Set code set predefined to G3 plane. Supported
`char_set' is `x0201' (default)
`iso8859-1',`iso8859-2',`iso8859-3',`iso8859-7',
`iso8859-14',`iso8859-15',`koi8-r' and `x0212'.
This option works with EUC and JIS input.
--euc-protect-g1
In EUC input mode, suppress sequences to set a
charset to G1. Such sequences are discarded.
--old-nec-compat
Enable old NEC kanji sequence (ESC-K,H). Needs com-
pile option -DOLD_NEC_COMPAT.
--add-annon
Add announcer for JIS X-0208(1990) to X-0208 desig-
nate sequence. This option works only with JIS out-
put.
Unicode coding specific control
--use-compat
When output is one of translation format of Unicode
standard, enable characters in compatibility plane
(0xfxxx). skf by default does not use these char-
acters.
--use-ms-compat
When output is Unicode, make translation Microsoft
wind*ws compatible. This only affect some symbols
in JIS-Kanji, and adding --use-compat option is
recommended.
--little-endian
When output is Unicode, use little endian byte-
order. This is default.
--big-endian
When output is Unicode, use big endian byte-order.
--suppress-endian-mark
When output is UTF-16, do not use byte order
marking. To make UTF-8N, use this option with
--little-endian. This is off by default.
--enable-endian-mark
When output is UTF-8, output byte order marking.
This is off by default.
--input-little-endian
When input is Unicode, assume input is little
endian byte-ordered. This is default, but skf
respects byte-order mark.
--input-big-endian
When input is Unicode, assume input is big endian
byte-ordered. Note that skf respects byte-order
mark.
--endian-protect
Do not use endian mark in the input stream. Endian
mark is just discarded.
--use-replace-char
skf by default converts undefined (except 0x2xxx
part) characters into "geta (U+3013)" code. This
option specifies skf to use replacement char
(0xfffc in UCS2) instead.
--limit-to-ucs2
Do not use > 0x10000 area code in Unicode (i.e.
limit code to ucs2 area).
--suppress-cjk-extension
Treat CJK extension A/B area as undefined.
--old-hangle-location
Treat U-3400 area as hangle (Unicode 1.0 compati-
bility).
Encoding controls
--decode=`encoding scheme'
Specify encoding scheme for input stream. Supported
encoding scheme is `hex', 'mime', 'mime_q',
'mime_b' and `rot47'. Each option means CAP hex-
code, mime, mime Q-encoding, mime B-encoding and
rot13/47 respectively. When mime decoding is speci-
fied, base text is assumed to be EUC encoding
unless specified otherwise.
End of line controls
--lineend-thru
Output end of line code as it is. also output ^Z
code as it is. This is default.
--lineend-cr --lineend-mac
Use CR as end of line code. Also delete ^Z code
from input stream.
--lineend-lf --lineend-unix
Use LF as end of line code. Also delete ^Z code
from input stream.
--lineend-crlf --lineend-windows
Use CRLF as end of line code. Also delete ^Z code
from input stream.
File controls
--filewise-detect --force-reset
Reset and re-detect input code set at the start of
each file.
--linewise-detect
Reset and re-detect input code set at the start of
each line. This option needs -DKUNIMOTO at compile
time.
Misc. Controls
--suppress-space-convert
skf by default, converts an ideographic space into
two ascii spaces. This option suppresses this
behavior.
--reset
Reset all flags specified by extended controls and
given input code.
--inquiry
skf detects code and output detect result to std-
out. No filtering output is performed.
--show-filename
When inquiry(--inquiry) is on, this option adds
each file name to output. Enabled by default when
multiple input files are specified.
--invis-strip
Delete all escape sequences not belonging to
ISO-2022 code extension. This is intended to
replace invisstrip command bundled in inews pack-
age.
--html-sanitize
Convert several characters in HTML document to
entity reference expression. Specifically,
"!#$&%()/<>:;?' is escaped by entity expression.
-I Warn if input has unassigned code points.
-v print version and exit.
-h print brief help.
FILES
/usr/(local/)share/skf/lib/
where external codeset conversion tables go. The
location that current skf assumes are shown by -h
option.
AUTHOR
skf is written by Seiji Kaneko (skaneko@a2.mbn.or.jp)
based on idea from nkf written by Itaru Ichikawa
(ichikawa@flab.fujitsu.co.jp) X-0213 code table is derived
from work of earthian@tama.or.jp.
ACKNOWLEDGEMENT
skf is inspired by works or requests by
shinoda@cs.titech, kato@cs.titech, uematsu@cs.titech,
void@global ohta@ricoh, Hinata(HKE) Ashizawa(CRL) Kuni-
moto(SDL)
BUGS AND LIMITATIONS
1. skf can handle mixed coding with some limitations. How-
ever, code detection easily fails for mixed code, and giv-
ing explicit input code set is strongly encouraged.
In case of emergency, --linewise-detect option may help.
2. When using UCS2, UTF-16, UTF-8 and UTF-7, skf tries to
detect input code, but giving explicit code set is encour-
aged. skf doesn't support UCS4, but does support
UTF-16/UTF-32 (i.e. surrogate pairs). skf just pass Com-
posite characters to output. No further process is per-
formed.
3. skf implements ISO-2022 with following exceptions
(1) GL 0x20 is always space.
(2) if unknown sequence is given to G[0-3], G[0-3] is
set to ascii, and locking/single shift is cleared.
(3) standard return sequence is ignored.
(4) Sequences related to C1 and C2 is just ignored.
(5) Sequences for 96 character multibyte coding is
ignored.
4. Since skf by default is testing input to detect utf7
coding, skf sometimes misdetects pure ascii text as utf7.
If this occurs, use --no-utf7 option.
5. error output coding is controlled by LOCALE environment
variables in UN*X system. Since skf can't recognize that
stdout and stderr is redirecting into same stream, this
case should be cared by user.
6. IBM CCSID 1394 is not supported.
7. skf-1.91 converts KEIS/JIS X-0213 code using CJK-exten-
sion B and CJK compatibility area. For this reason, X-0213
and KEIS convert result varies depending on --use-compat
and --limit-to-ucs2 switches.
8. Current external table format supports only UCS2 char-
acters.
10. JIS X-0207(1979) is not supported. JIS X-0211(1987) is
designed to be supported (i.e. common terminal control
sequence is transparently passed to output).
Note
1. Extended options are changed extensively from skf-1.3.
Some archaic options (eg. -B, -@ and -r) have been deleted
from this version.
2. From version 1.9, default code set assumed by skf has
changed to JIS X-0208(1990) with Microsoft Japanese Win-
dows gaiji (i.e. CP932).
3. From version 1.9, skf supports iso8859 and other
charset by using Unicode as internal code set. For this
reason, skf-1.9 behaves differently from earlier versions.
4. Code autodetection is not perfect by design. If it has
failed to detect input code properly, please give input
code information explicitly.
5. Some ligatures in Unicode, cp932 gaiji and KEIS83 are
converted using JIS X-0124 and other convention. During
this conversion, its byte length is not preserved.
6. skf is intended to pass ANSI compatible terminal con-
trol code transparently, but this is not guaranteed.
7. There are some undocumented options. These options
should be considered as highly experimental.
Notice
Unicode(TM) is a trademark of Unicode, Inc. Microsoft and
Windows are registered trademarks of Microsoft corpora-
tion. Macintosh is a registered trademark of Apple Com-
puter Inc. Other names and terms may be trademarks or reg-
istered trademarks of their respective owner. Trademark
symbol (TM) is omitted in this manual page.
09/MAY/2002 SKF(1)