ctype for sjis characters
Fork

(Original repository, No fork origin)

R/O
HTTP
SSH
HTTPS

提交

Commit MetaInfo

修訂	bf405c21bf7602f23590b1d6a23c7b85353f8f08 (tree)
時間	2013-07-29 13:50:10
作者	Joel Matthew Rees <reiisi@user...>
Commiter	Joel Matthew Rees

Log Message

Residual source issues from adding the hiragana-crypt example.

Change Summary

modified: Makefile (diff)
modified: slowsjctype.c (diff)

差異

--- a/Makefile

+++ b/Makefile

		@@ -15,25 +15,33 @@
15	15
16	16	CFLAGS = -Wall
17	17
18		-sjcobjects = slowsjctype.o sjisctypetest.o showch16.o showch8.o port.o
	18	+sjcobjects = slowsjctype.o sjisctypetest.o showch16.o showch8.o port.o hiragana-crypt.o
19	19
	20	+sjcexecutables = sjisctypetest showch16 showch8 hiragana-crypt
20	21
21		-all: slowsjctype.o sjisctypetest showch16 showch8
22	22
	23	+all: slowsjctype.o $(sjcexecutables)
	24	+
	25	+# The example:
	26	+hiragana-crypt: hiragana-crypt.o slowsjctype.o
	27	+
	28	+# The library:
23	29	slowsjctype.o: slowsjctype.h sj16bitChars.h sj8bitChars.h sjctypenv.h
24	30
	31	+# The test file (by calibrated eyeball):
25	32	sjisctypetest: sjisctypetest.o port.o slowsjctype.o sj16bitChars.h sj8bitChars.h sjctypenv.h
26	33
	34	+# A helper:
27	35	showch16: showch16.o port.o sj16bitChars.h sjctypenv.h
28	36
	37	+# Another helper
29	38	showch8: showch8.o port.o
30	39
	40	+# For making working on the old (classic, pre-Mac OS X) Macintosh livable:
31	41	port.o: port.h
32	42
33	43
34		-
35		-
36	44	.PHCLEAN: sjcclean
37	45	sjcclean:
38		- -rm $(sjcobjects)
	46	+ -rm $(sjcobjects) $(sjcexecutables)
39	47

--- a/slowsjctype.c

+++ b/slowsjctype.c

		@@ -1 +1 @@
1		-/* slowsjctype.c v00.00.01.jmr // Near-ctype functions for shift-JIS characters, slow version. // Written by Joel Matthew Rees, Amagasaki, Hyogo, Japan, beginning April 2001. // joel_rees@sannet.ne.jp // // Shifting strategy for usability in current C environments: // pass char pointers instead of unsigned char pointers. // Also, adding P to names to emphasize pointer usage. // // Copyright 2000, 2001 Joel Matthew Rees. // All rights reserved. // // Assignment of Stewardship, or Terms of Use: // // The author grants permission to use and/or redistribute the code in this // file, in either source or translated form, under the following conditions: // 1. When redistributing the source code, the copyright notices and terms of // use must be neither removed nor modified. // 2. When redistributing in a form not generally read by humans, the // copyright notices and terms of use, with proper indication of elements // covered, must be reproduced in the accompanying documentation and/or // other materials provided with the redistribution. In addition, if the // source includes statements designed to compile a copyright notice // into the output object code, the redistributor is required to take // such steps as necessary to preserve the notice in the translated // object code. // 3. Modifications must be annotated, with attribution, including the name(s) // of the author(s) and the contributor(s) thereof, the conditions for // distribution of the modification, and full indication of the date(s) // and scope of the modification. Rights to the modification itself // shall necessarily be retained by the author(s) thereof. // 4. These grants shall not be construed as an assignment or assumption of // liability of any sort or to any degree. Neither shall these grants be // construed as endorsement or represented as such. Any party using this // code in any way does so under the agreement to entirely indemnify the // author and any contributors concerning the code and any use thereof. // Specifically, THIS SOFTWARE IS PROVIDED AT NO COST, AS IT IS, WITHOUT // ANY EXPRESS OR IMPLIED WARRANTY OF ANY SORT, INCLUDING, BUT NOT LIMITED // TO, WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. // UNDER NO CIRCUMSTANCES SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE FOR // ANY DAMAGES WHATSOEVER ARISING FROM ITS USE OR MISUSE, EVEN IF ADVISED // OF THE EXISTENCE OF THE POSSIBILITY OF SUCH DAMAGE. // 5. This code should not be used for any illegal or immoral purpose, // including, but not limited to, the theft of property or services, // deliberate communication of false information, the distribution of drugs // for purposes other than medical, the distribution of pornography, the // provision of illicit sexual services, the maintenance of oppressive // governments or organizations, or the imposture of false religion and // false science. // Any illegal or immoral use incurs natural and legal penalties, which the // author invokes in full force upon the heads of those who so use it. // 6. Alternative redistribution arrangements: // a. If the above conditions are unacceptable, redistribution under the // following commonly used public licenses is expressly permitted: // i. The GNU General Public License (GPL) of the Free Software // Foundation. // ii. The Perl Artistic License, only as a part of Perl. // iii. The Apple Public Source License, only as a part of Darwin or // a Macintosh Operating System using Darwin. // b. No other alternative redistribution arrangement is permitted. // (The original author reserves the right to add to this list.) // c. When redistributing this code under an alternative license, the // specific license being invoked shall be noted immediately beneath // the body of the terms of use. The terms of the license so specified // shall apply only to the redistribution of the source so noted. // 7. In no case shall the rights of the original author to the original work // be impaired by any distribution or redistribution arrangement. // // End of the Assignment of Stewardship, or terms of use. // // License invoked: Assignment of Stewardship. // Notes concerning license: // Compiler directives are strongly encouraged as a means of meeting // the attribution requirements in the Assignment of Stewardship. / / Primary references for the ranges chosen below: // // Character palette from Apple's Kotoeri input method, systems 7/8/9. // Publisher: Apple, included with Apple's Macintosh operating systems. // The character palettes since sys. 8.0 or 8.1 have included primary pronunciations, // as well as JIS, kuten, and UNICODE assignments, in a detailed view. // Since at least sys. 8.5 or 8.6, a flag appears when a non-standard character is selected. // Newer versions track the changes to the various standards. // // Pasokon/Waapuro Kanji Jiten, 1987 Edition // Compiler: Tsutomu Uegaki; Publisher: Natsume-sha (Chiyouda-ku). // Lists and tables of Kanji and other JIS characters and character codes. // Contains a nice rectanglular arrangement of Kanji on pages 588-599. // // Waapuro/Pasokon Saishin Kanji Jiten, 1st Edition (1994) // Compiler: Shougakukan Dictionary Editors Department; // Publisher: Shougakukan (Chiyouda-ku). // Lists and tables of Kanji and other JIS characters and character codes. // Includes a list of the proposed annex characters, with annex numbers. // The annex characters have been assigned actual codes since this edition was published. // // Pasokon Yougo Jiten, 1992-93 Edition // Authors: Shigeru Okamoto, Ichirou Senba, Yoshiaki Nakamura, Kazuko Takahashi; // Publisher: Gijutsu Hyouron-sha (Shinjuku-ku). // Dictionary of personal computer terminology, // particularly referenced the JIS/ISO/ANSI 8-bit character tables starting page 409. / #include "sjctypenv.h" #include "sj8bitChars.h" #include "sj16bitChars.h" #include "slowsjctype.h" / Because char is probably signed, // it is usually liable to induce errors to use escaped char constant notation. // '\x80' may well be something like 0xffffff80, rather than 0x80. // Hopefully, I have been consistent about this. <erg/> // Note the problems when comparing a char variable with a character constant: // char scan; . . . while ( scan <= 0x9f ) // will produce an infinite loop, which is probably not the desired effect. // 0x9f is an integer equal to decimal 159. // '\x9f' is a char and promotes to integer with sign extension: // ( -( 256 - 159 ) ) == ( -97 ) // Two's complement. // . . . while ( scan <= 'x9f' ) // will probably produce the desired result, but by an un-expected calculation. // For instance, // scan = 0x9e; if ( scan < '\x9f' ) // yields true because -98 is less than -97, not because 158 is less than 159. // I tend to forget which is which in the middle of loops, // so I usually use long integers in loops (which is a good idea anyway) // and avoid comparing to integer constants. // This is also a reason I use symbolic constants instead of directly using characters. // // This shows one of the many reasons for having some means of dialect control, // instead of constraining the one-and-only standard in ways that turn out to be non-optimal. / / Cleared the unwanted dependency on sjctypenv.h (bool) -- JMR2001.05.31 // This required changing the bool typed functions to int typed functions, as noted below. // This mod by Joel Matthew Rees, released under original terms of use. / int slowsjIsPOneByte( char chp ) /* changed from bool to int JMR2001.05.31 / { int b = ( (ubyte ) chp ); return b < 0x80 \|\| ( b >= 0xa1 && b <= 0xdf ); } int slowsjIsPHighByte( char chp ) /* changed from bool to int JMR2001.05.31 / { int bHi = ( (ubyte ) chp )[ 0 ]; return ( bHi >= 0x81 && bHi <= 0x9f ) \|\| ( bHi >= 0xe0 && bHi <= 0xfc ); } int slowsjIsPLowByte( char * chp ) /* changed from bool to int JMR2001.05.31 / { int bLo = ( (ubyte ) chp ); return bLo >= 0x40 && bLo <= 0xfc && bLo != 0x7f; } int slowsjIsP7bit( char chp ) /* changed from bool to int JMR2001.05.31 / { int bLo = ( (ubyte ) chp ); return bLo < 0x80; } int slowsjPGuessCount( char chp ) { return ( slowsjIsPHighByte( chp ) && slowsjIsPLowByte( chp + 1 ) ) ? 2 : slowsjIsPOneByte( chp ) ? 1 : 0; } int slowsjIsPCntrl( char * chp ) { int uch = (ubyte) chp[ 0 ]; return ( uch <= 0x1f \|\| uch == 0x7f ) ? 1 : 0; /* DEL added JMR2001.05.23 / / The standard doesn't know for unit separator. / } int slowsjIsPSpace( char chp ) { ubyte * uchp = (ubyte ) chp; switch ( uchp ) { case b7_HT: case b7_LF: case b7_VT: case b7_FF: case b7_CR: case b7_SP: return 1; default: return ( uchp[ 0 ] == b16_SP[ 0 ] && uchp[ 1 ] == b16_SP[ 1 ] ) ? 2 : 0; /* 0x8140 is sjis 2-byte space / } } int slowsjIsPDigit( char chp ) { ubyte * uchp = (ubyte ) chp; int b = uchp; if ( b == b16_ZERO[ 0 ] ) { b = uchp[ 1 ]; return ( b >= b16_ZERO[ 1 ] && b <= b16_NINE[ 1 ] ) ? 2 : 0; } else { return ( b >= b7_ZERO && b <= b7_NINE ) ? 1 : 0; } } int slowsjIsPXDigit( char * chp ) { ubyte * uchp = (ubyte ) chp; int b = uchp; if ( b == b16_A[ 0 ] ) { b = uchp[ 1 ]; return ( ( b >= b16_A[ 1 ] && b <= b16_F[ 1 ] ) \|\| ( b >= b16_a[ 1 ] && b <= b16_f[ 1 ] ) ) ? 2 : slowsjIsPDigit( chp ); } else { return ( ( b >= b7_A && b <= b7_F ) \|\| ( b >= b7_a && b <= b7_f ) ) ? 1 : slowsjIsPDigit( chp ); } } int slowsjIsPRomanLower( char * chp ) { ubyte * uchp = (ubyte ) chp; int b = uchp; if ( b == b16_a[ 0 ] ) { b = uchp[ 1 ]; return ( b >= b16_a[ 1 ] && b <= b16_z[ 1 ] && b != 0x7f ) ? 2 : 0; } else { return ( b >= b7_a && b <= b7_z ) ? 1 : 0; } } int slowsjIsPRomanUpper( char * chp ) { ubyte * uchp = (ubyte ) chp; int b = uchp; if ( b == b16_A[ 0 ] ) { b = uchp[ 1 ]; return ( b >= b16_A[ 1 ] && b <= b16_Z[ 1 ] && b != 0x7f ) ? 2 : 0; } else { return ( b >= b7_A && b <= b7_Z ) ? 1 : 0; } } /* Time biased against upper case, but we don't care on the slow version. / int slowsjIsPRoman( char chp ) { int result = slowsjIsPRomanLower( chp ); if ( result == 0 ) result = slowsjIsPRomanUpper( chp ); return result; } int slowsjIsPGreekLower( char * chp ) { ubyte * uchp = (ubyte ) chp; int b = uchp[ 1 ]; return ( ( uchp == b16_alpha[ 0 ] ) && ( b >= b16_alpha[ 1 ] && b <= b16_omega[ 1 ] && b != 0x7f ) )? 2 : 0; } int slowsjIsPGreekUpper( char * chp ) { ubyte * uchp = (ubyte ) chp; int b = uchp[ 1 ]; return ( ( uchp == b16_ALPHA[ 0 ] ) && ( b >= b16_ALPHA[ 1 ] && b <= b16_OMEGA[ 1 ] && b != 0x7f ) ) ? 2 : 0; } /* Time biased against upper case, but we don't care on the slow version. / int slowsjIsPGreek( char chp ) { int result = slowsjIsPGreekLower( chp ); if ( result == 0 ) slowsjIsPGreekUpper( chp ); return result; } int slowsjIsPRussianLower( char * chp ) { ubyte * uchp = (ubyte ) chp; int b = uchp[ 1 ]; return ( ( uchp == b16_Russian_a[ 0 ] ) && ( b >= b16_Russian_a[ 1 ] && b <= b16_Russian_ya[ 1 ] && b != 0x7f ) )? 2 : 0; } int slowsjIsPRussianUpper( char * chp ) { ubyte * uchp = (ubyte ) chp; int b = uchp[ 1 ]; return ( ( uchp == b16_Russian_A[ 0 ] ) && ( b >= b16_Russian_A[ 1 ] && b <= b16_Russian_YA[ 1 ] && b != 0x7f ) ) ? 2 : 0; } /* Time biased against upper case, but we don't care on the slow version. / int slowsjIsPRussian( char chp ) { int result = slowsjIsPRussianLower( chp ); if ( result == 0 ) slowsjIsPRussianUpper( chp ); return result; } /* Time biased against Greek and Russian, but we don't care on the slow version. / int slowsjIsPUpper( char chp ) { int result = slowsjIsPRomanUpper( chp ); if ( result == 0 ) result = slowsjIsPGreekUpper( chp ); if ( result == 0 ) result = slowsjIsPRussianUpper( chp ); return result; } /* Time biased against Greek and Russian, but we don't care on the slow version. / int slowsjIsPLower( char chp ) { int result = slowsjIsPRomanLower( chp ); if ( result == 0 ) result = slowsjIsPGreekLower( chp ); if ( result == 0 ) result = slowsjIsPRussianLower( chp ); return result; } /* Time biased against Greek and Russian, but we don't care on the slow version. / int slowsjIsPEurAsianAlpha( char chp ) { int result = slowsjIsPRoman( chp ); if ( result == 0 ) result = slowsjIsPGreek( chp ); if ( result == 0 ) result = slowsjIsPRussian( chp ); return result; } int slowsjIsPQuasiEurAsianAlpha( char * chp ) { ubyte * uchp = (ubyte ) chp; int b = uchp; if ( b == b16_AccentAcute_Prime[ 0 ] ) { b = uchp[ 1 ]; return ( b == b16_AccentAcute_Prime[ 1 ] \|\| b == b16_AccentGrave[ 1 ] \|\| b == b16_Umlaut[ 1 ] \|\| b == b16_AccentCircumflex[ 1 ] \|\| b == b16_Overline_Negate[ 1 ] \|\| b == b16_QuarterDash_Hyphen[ 1 ] \|\| b == b16_WavyDash_Tilde[ 1 ] ) ? 2 : 0; } else { return ( b == b7_HYPHEN \|\| b == b7_ACCENTGRAVE \|\| b == b7_TILDE \|\| b == b7_CARET ) ? 1 : 0; } } int slowsjIsPHiragana( char * chp ) { ubyte * uchp = (ubyte ) chp; int b = uchp[ 1 ]; return ( ( uchp == b16_hiraganaSub_a[ 0 ] ) && ( b >= b16_hiraganaSub_a[ 1 ] && b <= b16_hiragana_ng[ 1 ] && b != 0x7f ) )? 2 : 0; } int slowsjIsPKatakana( char * chp ) { ubyte * uchp = (ubyte ) chp; int b = uchp; if ( b == b16_katakanaSub_a[ 0 ] ) { b = uchp[ 1 ]; return ( b >= b16_katakanaSub_a[ 1 ] && b <= b16_katakanaSub_ke[ 1 ] && b != 0x7f ) ? 2 : 0; } else { return ( ( b >= b8_katakana_wo && b <= b8_katakanaSub_tu ) \|\| ( b >= b8_katakana_a && b <= b8_katakana_ng ) ) ? 1 : 0; } } /* Time biased against katakana, but we don't care on the slow version. / int slowsjIsPKana( char chp ) { int result = slowsjIsPHiragana( chp ); if ( result == 0 ) result = slowsjIsPKatakana( chp ); return result; } int slowsjIsPQuasiKana( char * chp ) { ubyte * uchp = (ubyte ) chp; int b = uchp; if ( b == b16_DakuTen[ 0 ] ) { b = uchp[ 1 ]; return ( b == b16_DakuTen[ 1 ] \|\| b == b16_HanDakuTen[ 1 ] \|\| b == b16_KatakanaRepeat[ 1 ] \|\| b == b16_KatakanaRepeatVoiced[ 1 ] \|\| b == b16_HiraganaRepeat[ 1 ] \|\| b == b16_HiraganaRepeatVoiced[ 1 ] \|\| b == b16_ChoOn[ 1 ] ) ? 2 : 0; } else { return ( b == b8_ChoOn \|\| b == b8_DakuTen \|\| b == b8_HandakuTen ) ? 1 : 0; } } /* This has even time-bias for JIS level 1. / int slowsjIsPKanji( char chp ) { ubyte * uchp = (ubyte ) chp; int bHi = uchp[ 0 ]; int bLo = uchp[ 1 ]; if ( slowsjIsPHighByte( chp ) && slowsjIsPLowByte( chp + 1 ) && ( ( bHi == b16_kanji1Low_a[ 0 ] && bLo >= b16_kanji1Low_a[ 1 ] ) \|\| ( bHi > b16_kanji1Low_a[ 0 ] && bHi < b16_kanji1High_ude[ 0 ] ) \|\| ( bHi == b16_kanji1High_ude[ 0 ] && bLo <= b16_kanji1High_ude[ 1 ] ) \|\| ( bHi == b16_kanji2aLow_ichi[ 0 ] && bLo >= b16_kanji2aLow_ichi[ 1 ] ) \|\| ( bHi > b16_kanji2aLow_ichi[ 0 ] && bHi <= b16_kanji2aHigh_jou[ 0 ] ) / The rows at the end of 2a and beginning of 2b are complete. / \|\| ( bHi >= b16_kanji2bLow_you[ 0 ] && bHi <= b16_kanji2bHigh_hikaru[ 0 ] ) \|\| ( bHi == b16_kanji2bHigh_hikaru[ 0 ] && bLo <= b16_kanji2bHigh_hikaru[ 1 ] ) ) ) return 2; else return 0; } / This is completely time-biased against kanji, and a little harder to mentally verify. { ubyte * uchp = (ubyte ) chp; int bHi = uchp[ 0 ]; int bLo = uchp[ 1 ]; if ( !slowsjIsPHighByte( chp ) \|\| !slowsjIsPLowByte( chp + 1 ) \|\| bHi < b16_kanji1Low_a_sub[ 0 ] \|\| ( bHi == b16_kanji1Low_a_sub[ 0 ] && bLo < b16_kanji1Low_a_sub[ 1 ] ) \|\| ( bHi == b16_kanji1High_ude_arm[ 0 ] && bLo > b16_kanji1High_ude_arm[ 1 ] && bLo < b16_kanji2aLow_ichi_formalOne[ 1 ] ) \|\| ( bHi > b16_kanji2aHigh_ude_arm[ 0 ] && bHi < b16_kanji2bLow_yo_e040[ 0 ] ) \|\| ( bHi == b16_kanji2bHigh_hikaru_eaa4[ 0 ] && bLo > b16_kanji2bHigh_hikaru_eaa4[ 1 ] ) \|\| bHi > b16_kanji2bHigh_hikaru_eaa4[ 0 ] ) return 0; else return 2; } / int slowsjIsPQuasiKanji( char * chp ) { ubyte * uchp = (ubyte ) chp; int b = uchp[ 1 ]; return ( ( uchp == b16_KanjiIbid[ 0 ] ) && ( b >= b16_KanjiIbid[ 1 ] /* This might be a proper Kanji? / \|\| b <= b16_Ditto[ 1 ] / Should this be only with European mods? / \|\| b <= b16_Shime[ 1 ] / Probably not Kanji? / \|\| b <= b16_KanjiZero[ 1 ] / Should this be Kanji? / \|\| b <= b16_OpenCircle_Maru[ 1 ] / Often used as fill-in-th-blank. / \|\| b <= b16_KanjiRepeat[ 1 ] ) )? 2 : 0; } / Run-time bias against everybody. // Should give fairly even timing in general use // and give best timing for generating tables. / int slowsjIsPAlpha( char chp ) { int result = slowsjIsPKanji( chp ); if ( result == 0 ) result = slowsjIsPKana( chp ); if ( result == 0 ) result = slowsjIsPEurAsianAlpha( chp ); return result; } /* Use the same bias as alpha, just to be obnoxious. / int slowsjIsPQuasiAlpha( char chp ) { int result = slowsjIsPQuasiKanji( chp ); if ( result == 0 ) result = slowsjIsPQuasiKana( chp ); if ( result == 0 ) result = slowsjIsPQuasiEurAsianAlpha( chp ); return result; } /* Bias? What bias? / int slowsjIsPAlNum( char chp ) { int result = slowsjIsPDigit( chp ); if ( result == 0 ) result = slowsjIsPAlpha( chp ); return result; } /* Bias? What bias? / int slowsjIsPAlNumQuasi( char chp ) { int result = slowsjIsPQuasiAlpha( chp ); if ( result == 0 ) result = slowsjIsPAlNum( chp ); return result; } int slowsjIsPLineDraw( char * chp ) { ubyte * uchp = (ubyte ) chp; int b = uchp[ 1 ]; return ( ( uchp == b16_LineDraw_1H[ 0 ] ) && ( b >= b16_LineDraw_1H[ 1 ] && b <= b16_LineDraw_1H2V[ 1 ] && b != 0x7f ) )? 2 : 0; } int slowsjIsPPunct( char * chp ) { ubyte * uchp = (ubyte ) chp; int b = uchp; if ( b == b16_ToTen[ 0 ] ) /* Nice of the JIS comittee to put them all together. / { b = uchp[ 1 ]; return ( b != 0x7f / Check and excuse later / && ( ( b >= b16_ToTen[ 1 ] && b <= b16_Geta[ 1 ] ) \|\| ( b >= b16_Element[ 1 ] && b <= b16_Intersection[ 1 ] ) \|\| ( b >= b16_Conjunction_And[ 1 ] && b <= b16_Exists[ 1 ] ) \|\| ( b >= b16_Angle[ 1 ] && b <= b16_DoubleIntegral[ 1 ] ) \|\| ( b >= b16_Angstrom[ 1 ] && b <= b16_Paragraph[ 1 ] ) \|\| ( b == b16_CompositionCircle[ 1 ] ) ) ) ? 2 : 0; } else { return ( ( b >= b7_EXCLAIM && b <= b7_SLASH ) \|\| ( b >= b7_COLON && b <= b7_ATEACH ) \|\| ( b >= b7_LEFTBRACKET && b <= b7_ACCENTGRAVE ) \|\| ( b >= b7_LEFTBRACE && b <= b7_TILDE ) \|\| ( b >= b8_Kuten && b <= b8_ChuTen ) \|\| ( b == b8_ChoOn ) \|\| ( b >= b8_DakuTen && b <= b8_HandakuTen ) ) ? 1 : 0; } } int slowsjIsPGraph( char chp ) { int result = slowsjIsPAlNum( chp ); if ( result == 0 ) result = slowsjIsPPunct( chp ); return result; } int slowsjIsPPrint( char * chp ) { ubyte * uchp = (ubyte ) chp; if ( uchp == b7_SP ) return 1; else if ( uchp[ 0 ] == b16_SP[ 0 ] && uchp[ 1 ] == b16_SP[ 1 ] ) return 2; else return slowsjIsPGraph( chp ); } /* Macro to isprint() works just fine because there are no two-byte control characters. int slowsjIsP2Byte( char * chp ) {} / / ToLower/Upper will have to test the 7f gap specifically for each range that suffers it. // Some are entirely above and some entirely below. // JIS Roman/Greek/Russian doesn't include any caseless characters in my materials. // But if they did I could test the converted character for validity before returning it. // Just for fun, I'll include the test anyway. / int slowsjPToLowerRoman( char chpin, char * chpout ) { int count = slowsjIsPRomanUpper( chpin ); char temp[ 4 ] = { 0 }; switch ( count ) { case 1: temp[ 0 ] = chpin[ 0 ] + ( b7_a - b7_A ); break; case 2: temp[ 0 ] = chpin[ 0 ]; temp[ 1 ] = chpin[ 1 ] + ( b16_a[ 1 ] - b16_A[ 1 ] ); /* No gap / break; } if ( count > 0 && slowsjIsPRomanLower( temp ) == count ) { chpout[ 0 ] = temp[ 0 ]; if ( count > 1 ) chpout[ 1 ] = temp[ 1 ]; } else count = 0; return count; } int slowsjPToUpperRoman( char chpin, char * chpout ) { int count = slowsjIsPRomanLower( chpin ); char temp[ 4 ] = { 0 }; switch ( count ) { case 1: temp[ 0 ] = chpin[ 0 ] - ( b7_a - b7_A ); break; case 2: temp[ 0 ] = chpin[ 0 ]; temp[ 1 ] = chpin[ 1 ] - ( b16_a[ 1 ] - b16_A[ 1 ] ); /* No gap / break; } if ( count > 0 && slowsjIsPRomanUpper( temp ) == count ) { chpout[ 0 ] = temp[ 0 ]; if ( count > 1 ) chpout[ 1 ] = temp[ 1 ]; } else count = 0; return count; } int slowsjPToLowerGreek( char chpin, char * chpout ) { int count = slowsjIsPGreekUpper( chpin ); char temp[ 4 ] = { 0 }; if ( count == 2 ) { temp[ 0 ] = chpin[ 0 ]; temp[ 1 ] = chpin[ 1 ] + ( b16_alpha[ 1 ] - b16_ALPHA[ 1 ] ); /* No gap / } if ( count == 2 && slowsjIsPGreekLower( temp ) == count ) { chpout[ 0 ] = temp[ 0 ]; chpout[ 1 ] = temp[ 1 ]; } else count = 0; return count; } int slowsjPToUpperGreek( char chpin, char * chpout ) { int count = slowsjIsPGreekLower( chpin ); char temp[ 4 ] = { 0 }; if ( count == 2 ) { temp[ 0 ] = chpin[ 0 ]; temp[ 1 ] = chpin[ 1 ] - ( b16_alpha[ 1 ] - b16_ALPHA[ 1 ] ); /* No gap / } if ( count == 2 && slowsjIsPGreekUpper( temp ) == count ) { chpout[ 0 ] = temp[ 0 ]; chpout[ 1 ] = temp[ 1 ]; } else count = 0; return count; } int slowsjPToLowerRussian( char chpin, char * chpout ) { int count = slowsjIsPRussianUpper( chpin ); char temp[ 4 ] = { 0 }; if ( count == 2 ) { temp[ 0 ] = chpin[ 0 ]; temp[ 1 ] = chpin[ 1 ] + ( b16_Russian_a[ 1 ] - b16_Russian_A[ 1 ] ); if ( temp[ 1 ] >= 0x7f ) /* Adjust for the gap. / temp[ 1 ] += 1; } if ( count == 2 && slowsjIsPRussianLower( temp ) == count ) { chpout[ 0 ] = temp[ 0 ]; chpout[ 1 ] = temp[ 1 ]; } else count = 0; return count; } int slowsjPToUpperRussian( char chpin, char * chpout ) { int count = slowsjIsPRussianLower( chpin ); /* Checks the gap. / char temp[ 4 ] = { 0 }; if ( count == 2 ) { temp[ 0 ] = chpin[ 0 ]; temp[ 1 ] = chpin[ 1 ] - ( b16_Russian_a[ 1 ] - b16_Russian_A[ 1 ] ); if ( chpin[ 1 ] > 0x7f ) / Adjust for the gap (0x7f already filtered above). / temp[ 1 ] -= 1; } if ( count == 2 && slowsjIsPRussianUpper( temp ) == count ) { chpout[ 0 ] = temp[ 0 ]; chpout[ 1 ] = temp[ 1 ]; } else count = 0; return count; } / Again, time-biased in favor of the most likely. (Russian and Greek are not as commonly used.) // Would be faster to test directly, but that increases logical coupling // (increases the chance for algorithmic errors). // Reducing errors is a higher priority than speed. / int slowsjPToLower( char chpin, char * chpout ) { int count = slowsjPToLowerRoman( chpin, chpout ); if ( count == 0 ) count = slowsjPToLowerGreek( chpin, chpout ); if ( count == 0 ) count = slowsjPToLowerRussian( chpin, chpout ); return count; } int slowsjPToUpper( char * chpin, char * chpout ) { int count = slowsjPToUpperRoman( chpin, chpout ); if ( count == 0 ) count = slowsjPToUpperGreek( chpin, chpout ); if ( count == 0 ) count = slowsjPToUpperRussian( chpin, chpout ); return count; } /* ToLower/Upper will have to test the 7f gap specifically for each range that suffers it. Some are entirely above and some entirely below. JIS Roman/Greek/Russian doesn't include caseless. For converting katakana to hiragana, I can test whether the result is valid before returning it. int slowsjToUpper( unsigned char * mbcin, unsigned char * mbcout ) Converts cased word forming characters to upper case, including 8 bit JIS-Latin; 16 bit JIS-Roman, -Greek, and -Russian characters, but no kanji. Returns byte count converted or zero. / / So, the initial, standard function headers: int slowsjIsCntrl( unsigned char * mbc ) As near as I can tell, all one byte, between 0 and 0x1f, inclusive. Returns byte count. int slowsjIsSpace( unsigned char * mbc ) Adds one two byte version of the space character. Returns byte count. int slowsjIsPrint( unsigned char * mbc ) All graphic characters, including non-control space characters. Returns byte count. int slowsjIsGraph( unsigned char * mbc ) All graphic non-space characters. Returns byte count. int slowsjIsPunct( unsigned char * mbc ) All non-word-forming characters. Will later be subdivided for the richer JIS set. Returns byte count. int slowsjIsDigit( unsigned char * mbc ) The standard digits 0..9, as specified in ANSI/ISO ctype. Includes both one and two byte digits. Does not include kanji numbers. Returns byte count. int slowsjIsXDigit( unsigned char * mbc ) The standard hexadecimal digits specified in ANSI/ISO ctype. Includes both one and two byte digits. Does not include kanji numbers. Returns byte count. int slowsjIsAlpha( unsigned char * mbc ) Characters used to form words, as used by non-programmers. Does not include the standard decimal digits, but does include the kanji numbers. Includes a lot of caseless characters, of course. Returns byte count. int slowsjIsAlNum( unsigned char * mbc ) Characters used to form words, as used by programmers, thus including digits. Returns byte count. int slowsjIsUpper( unsigned char * mbc ) Upper cased characters, includes 8 bit JIS-Latin; 16 bit JIS-Roman, -Greek, and -Russian characters, but no kanji. Returns byte count. int slowsjIsLower( unsigned char * mbc ) Lower cased characters, includes 8 bit JIS-Latin; 16 bit JIS-Roman, -Greek, and -Russian characters, but no kanji. Returns byte count. int slowsjToLower( unsigned char * mbcin, unsigned char * mbcout ) Converts cased word forming characters to lower case, including 8 bit JIS-Latin; 16 bit JIS-Roman, -Greek, and -Russian characters, but no kanji. Returns byte count converted or zero. int slowsjToUpper( unsigned char * mbcin, unsigned char * mbcout ) Converts cased word forming characters to upper case, including 8 bit JIS-Latin; 16 bit JIS-Roman, -Greek, and -Russian characters, but no kanji. Returns byte count converted or zero. int slowsjIs1Byte( unsigned char * mbc ) Valid one byte character. Returns byte count. int slowsjIs2Byte( unsigned char * mbc ) Valid two byte character? Returns byte count. int slowsjCouldBe2Byte( unsigned char * mbc ) A combination of valid lead byte and valid tail byte? Returns byte count. The second, or fast version slowsjIsXX() functions will use constants of the pattern slowsjIsXX_k. The constants and the general call will also be provided in the source header, as mentioned above, for optimization: int slowsjCType( unsigned long type, unsigned char * mbc ) Test the type formed by the bit-or of the type constants passed as the first parameter. Returns byte count on test true or zero on test false. The initial slow version functions will have names of the pattern slow_slowsjIsXX() so they can co-exist during debugging. slowsjrIsXX()? Now, some of the foreseeable necessary extensions: int slowsjIsMath( unsigned char * mbc ) The plethora of math and logic symbols in JIS. Returns byte count. int slowsjIsUnit( unsigned char * mbc ) The plethora of unit symbols in JIS, but not system specific extensions like m2. Does not include kanji. Returns byte count. int slowsjIsQuote( unsigned char * mbc ) The plethora of quoting and parenthetic characters in JIS. Returns byte count. int slowsjIsKanji( unsigned char * mbc ) All the proper kanji characters. Returns byte count. int isNumberKanji( unsigned char * mbc ) All the number kanji, including the special ones used, for example, on currency and bank notes. Returns byte count. int slowsjIsKana( unsigned char * mbc ) All the katakana and hiragana characters, including the one byte katakana. Also including the free-standing voicing and plosive symbols, dakuten and handakuten. Returns byte count. int slowsjIsKata( unsigned char * mbc ) All the katakana, including the SJIS one byte katakana, but not the free-standing voicing and plosive symbols, dakuten and handakuten. Returns byte count. int slowsjIsHira( unsigned char * mbc ) All the hiragana, not including the free-standing voicing and plosive symbols, dakuten and handakuten. Returns byte count. int slowsjToKata( unsigned char * mbcin, unsigned char * mbcout ) Converts hiragana to katakana. Returns byte count converted or zero. int slowsjToHira( unsigned char * mbcin, unsigned char * mbcout ) Converts katakana to hiragana, where possible. Moves the unconvertable katakana as they are. Does not convert the one byte katakana. Returns byte count converted or zero. int slowsjTo16Kata( unsigned char * mbcin, unsigned char * mbcout ) Converts the one byte katakana to two byte katakana. Round trip slowsjTo16Kata() -> slowsjTo8Kata() should be guaranteeable. Returns byte count converted or zero. int slowsjTo8Kata( unsigned char * mbcin, unsigned char * mbcout ) Converts two byte katakana to one byte katakana, where possible. Round trip slowsjTo8Kata() -> slowsjTo16Kata() may be guaranteeable, I'm not sure yet. Returns byte count converted or zero. Some of the hypothetical extensions: int slowsjIsMusic( unsigned char * mbc ) The music symbols in JIS. Returns byte count. int slowsjIsKanjiUnit( unsigned char * mbc ) The kanji version of units, including also ten, hundred, thousand, ten-thousand, etc. Returns byte count. int slowsjIsRoman( unsigned char * mbc ) All the JIS Roman (two byte Latin) characters. Returns byte count. int slowsjIsGreek( unsigned char * mbc ) All the JIS Greek characters. Returns byte count. int slowsjIsRussian( unsigned char * mbc ) All the JIS Russian characters. Returns byte count. int slowsjIsLatin( unsigned char * mbc ) All the Latin characters, including the two byte Roman (Latin) and one byte Latin. Returns byte count. int slowsjToRoman( unsigned char * mbcin, unsigned char * mbcout ) Convert one byte Latin to two byte JIS Roman (Latin). Returns byte count converted or zero. int slowsjToLatin( unsigned char * mbcin, unsigned char * mbcout ) Convert two byte JIS Roman (Latin) to one byte Latin. Returns byte count converted or zero. */
		\ No newline at end of file
	1	+/* slowsjctype.c v00.00.01.jmr // Near-ctype functions for shift-JIS characters, slow version. // Written by Joel Matthew Rees, Amagasaki, Hyogo, Japan, beginning April 2001. // joel_rees@sannet.ne.jp // // Shifting strategy for usability in current C environments: // pass char pointers instead of unsigned char pointers. // Also, adding P to names to emphasize pointer usage. // // Copyright 2000, 2001 Joel Matthew Rees. // All rights reserved. // // Assignment of Stewardship, or Terms of Use: // // The author grants permission to use and/or redistribute the code in this // file, in either source or translated form, under the following conditions: // 1. When redistributing the source code, the copyright notices and terms of // use must be neither removed nor modified. // 2. When redistributing in a form not generally read by humans, the // copyright notices and terms of use, with proper indication of elements // covered, must be reproduced in the accompanying documentation and/or // other materials provided with the redistribution. In addition, if the // source includes statements designed to compile a copyright notice // into the output object code, the redistributor is required to take // such steps as necessary to preserve the notice in the translated // object code. // 3. Modifications must be annotated, with attribution, including the name(s) // of the author(s) and the contributor(s) thereof, the conditions for // distribution of the modification, and full indication of the date(s) // and scope of the modification. Rights to the modification itself // shall necessarily be retained by the author(s) thereof. // 4. These grants shall not be construed as an assignment or assumption of // liability of any sort or to any degree. Neither shall these grants be // construed as endorsement or represented as such. Any party using this // code in any way does so under the agreement to entirely indemnify the // author and any contributors concerning the code and any use thereof. // Specifically, THIS SOFTWARE IS PROVIDED AT NO COST, AS IT IS, WITHOUT // ANY EXPRESS OR IMPLIED WARRANTY OF ANY SORT, INCLUDING, BUT NOT LIMITED // TO, WARRANTIES OF MERCHANTABILITY OR FITNESS FOR A PARTICULAR PURPOSE. // UNDER NO CIRCUMSTANCES SHALL THE AUTHOR OR CONTRIBUTORS BE LIABLE FOR // ANY DAMAGES WHATSOEVER ARISING FROM ITS USE OR MISUSE, EVEN IF ADVISED // OF THE EXISTENCE OF THE POSSIBILITY OF SUCH DAMAGE. // 5. This code should not be used for any illegal or immoral purpose, // including, but not limited to, the theft of property or services, // deliberate communication of false information, the distribution of drugs // for purposes other than medical, the distribution of pornography, the // provision of illicit sexual services, the maintenance of oppressive // governments or organizations, or the imposture of false religion and // false science. // Any illegal or immoral use incurs natural and legal penalties, which the // author invokes in full force upon the heads of those who so use it. // 6. Alternative redistribution arrangements: // a. If the above conditions are unacceptable, redistribution under the // following commonly used public licenses is expressly permitted: // i. The GNU General Public License (GPL) of the Free Software // Foundation. // ii. The Perl Artistic License, only as a part of Perl. // iii. The Apple Public Source License, only as a part of Darwin or // a Macintosh Operating System using Darwin. // b. No other alternative redistribution arrangement is permitted. // (The original author reserves the right to add to this list.) // c. When redistributing this code under an alternative license, the // specific license being invoked shall be noted immediately beneath // the body of the terms of use. The terms of the license so specified // shall apply only to the redistribution of the source so noted. // 7. In no case shall the rights of the original author to the original work // be impaired by any distribution or redistribution arrangement. // // End of the Assignment of Stewardship, or terms of use. // // License invoked: Assignment of Stewardship. // Notes concerning license: // Compiler directives are strongly encouraged as a means of meeting // the attribution requirements in the Assignment of Stewardship. / / Primary references for the ranges chosen below: // // Character palette from Apple's Kotoeri input method, systems 7/8/9. // Publisher: Apple, included with Apple's Macintosh operating systems. // The character palettes since sys. 8.0 or 8.1 have included primary pronunciations, // as well as JIS, kuten, and UNICODE assignments, in a detailed view. // Since at least sys. 8.5 or 8.6, a flag appears when a non-standard character is selected. // Newer versions track the changes to the various standards. // // Pasokon/Waapuro Kanji Jiten, 1987 Edition // Compiler: Tsutomu Uegaki; Publisher: Natsume-sha (Chiyouda-ku). // Lists and tables of Kanji and other JIS characters and character codes. // Contains a nice rectanglular arrangement of Kanji on pages 588-599. // // Waapuro/Pasokon Saishin Kanji Jiten, 1st Edition (1994) // Compiler: Shougakukan Dictionary Editors Department; // Publisher: Shougakukan (Chiyouda-ku). // Lists and tables of Kanji and other JIS characters and character codes. // Includes a list of the proposed annex characters, with annex numbers. // The annex characters have been assigned actual codes since this edition was published. // // Pasokon Yougo Jiten, 1992-93 Edition // Authors: Shigeru Okamoto, Ichirou Senba, Yoshiaki Nakamura, Kazuko Takahashi; // Publisher: Gijutsu Hyouron-sha (Shinjuku-ku). // Dictionary of personal computer terminology, // particularly referenced the JIS/ISO/ANSI 8-bit character tables starting page 409. / #include "sjctypenv.h" #include "sj8bitChars.h" #include "sj16bitChars.h" #include "slowsjctype.h" / Because char is probably signed, // it is usually liable to induce errors to use escaped char constant notation. // '\x80' may well be something like 0xffffff80, rather than 0x80. // Hopefully, I have been consistent about this. <erg/> // Note the problems when comparing a char variable with a character constant: // char scan; . . . while ( scan <= 0x9f ) // will produce an infinite loop, which is probably not the desired effect. // 0x9f is an integer equal to decimal 159. // '\x9f' is a char and promotes to integer with sign extension: // ( -( 256 - 159 ) ) == ( -97 ) // Two's complement. // . . . while ( scan <= 'x9f' ) // will probably produce the desired result, but by an un-expected calculation. // For instance, // scan = 0x9e; if ( scan < '\x9f' ) // yields true because -98 is less than -97, not because 158 is less than 159. // I tend to forget which is which in the middle of loops, // so I usually use long integers in loops (which is a good idea anyway) // and avoid comparing to integer constants. // This is also a reason I use symbolic constants instead of directly using characters. // // This shows one of the many reasons for having some means of dialect control, // instead of constraining the one-and-only standard in ways that turn out to be non-optimal. / / Cleared the unwanted dependency on sjctypenv.h (bool) -- JMR2001.05.31 // This required changing the bool typed functions to int typed functions, as noted below. // This mod by Joel Matthew Rees, released under original terms of use. / int slowsjIsPOneByte( char chp ) /* changed from bool to int JMR2001.05.31 / { int b = ( (ubyte ) chp ); return b < 0x80 \|\| ( b >= 0xa1 && b <= 0xdf ); } int slowsjIsPHighByte( char chp ) /* changed from bool to int JMR2001.05.31 / { int bHi = ( (ubyte ) chp )[ 0 ]; return ( bHi >= 0x81 && bHi <= 0x9f ) \|\| ( bHi >= 0xe0 && bHi <= 0xfc ); } int slowsjIsPLowByte( char * chp ) /* changed from bool to int JMR2001.05.31 / { int bLo = ( (ubyte ) chp ); return bLo >= 0x40 && bLo <= 0xfc && bLo != 0x7f; } int slowsjIsP7bit( char chp ) /* changed from bool to int JMR2001.05.31 / { int bLo = ( (ubyte ) chp ); return bLo < 0x80; } int slowsjPGuessCount( char chp ) { return ( slowsjIsPHighByte( chp ) && slowsjIsPLowByte( chp + 1 ) ) ? 2 : slowsjIsPOneByte( chp ) ? 1 : 0; } int slowsjIsPCntrl( char * chp ) { int uch = (ubyte) chp[ 0 ]; return ( uch <= 0x1f \|\| uch == 0x7f ) ? 1 : 0; /* DEL added JMR2001.05.23 / / The standard doesn't know for unit separator. / } int slowsjIsPSpace( char chp ) { ubyte * uchp = (ubyte ) chp; switch ( uchp ) { case b7_HT: case b7_LF: case b7_VT: case b7_FF: case b7_CR: case b7_SP: return 1; default: return ( uchp[ 0 ] == b16_SP[ 0 ] && uchp[ 1 ] == b16_SP[ 1 ] ) ? 2 : 0; /* 0x8140 is sjis 2-byte space / } } int slowsjIsPDigit( char chp ) { ubyte * uchp = (ubyte ) chp; int b = uchp; if ( b == b16_ZERO[ 0 ] ) { b = uchp[ 1 ]; return ( b >= b16_ZERO[ 1 ] && b <= b16_NINE[ 1 ] ) ? 2 : 0; } else { return ( b >= b7_ZERO && b <= b7_NINE ) ? 1 : 0; } } int slowsjIsPXDigit( char * chp ) { ubyte * uchp = (ubyte ) chp; int b = uchp; if ( b == b16_A[ 0 ] ) { b = uchp[ 1 ]; return ( ( b >= b16_A[ 1 ] && b <= b16_F[ 1 ] ) \|\| ( b >= b16_a[ 1 ] && b <= b16_f[ 1 ] ) ) ? 2 : slowsjIsPDigit( chp ); } else { return ( ( b >= b7_A && b <= b7_F ) \|\| ( b >= b7_a && b <= b7_f ) ) ? 1 : slowsjIsPDigit( chp ); } } int slowsjIsPRomanLower( char * chp ) { ubyte * uchp = (ubyte ) chp; int b = uchp; if ( b == b16_a[ 0 ] ) { b = uchp[ 1 ]; return ( b >= b16_a[ 1 ] && b <= b16_z[ 1 ] && b != 0x7f ) ? 2 : 0; } else { return ( b >= b7_a && b <= b7_z ) ? 1 : 0; } } int slowsjIsPRomanUpper( char * chp ) { ubyte * uchp = (ubyte ) chp; int b = uchp; if ( b == b16_A[ 0 ] ) { b = uchp[ 1 ]; return ( b >= b16_A[ 1 ] && b <= b16_Z[ 1 ] && b != 0x7f ) ? 2 : 0; } else { return ( b >= b7_A && b <= b7_Z ) ? 1 : 0; } } /* Time biased against upper case, but we don't care on the slow version. / int slowsjIsPRoman( char chp ) { int result = slowsjIsPRomanLower( chp ); if ( result == 0 ) result = slowsjIsPRomanUpper( chp ); return result; } int slowsjIsPGreekLower( char * chp ) { ubyte * uchp = (ubyte ) chp; int b = uchp[ 1 ]; return ( ( uchp == b16_alpha[ 0 ] ) && ( b >= b16_alpha[ 1 ] && b <= b16_omega[ 1 ] && b != 0x7f ) )? 2 : 0; } int slowsjIsPGreekUpper( char * chp ) { ubyte * uchp = (ubyte ) chp; int b = uchp[ 1 ]; return ( ( uchp == b16_ALPHA[ 0 ] ) && ( b >= b16_ALPHA[ 1 ] && b <= b16_OMEGA[ 1 ] && b != 0x7f ) ) ? 2 : 0; } /* Time biased against upper case, but we don't care on the slow version. / int slowsjIsPGreek( char chp ) { int result = slowsjIsPGreekLower( chp ); if ( result == 0 ) slowsjIsPGreekUpper( chp ); return result; } int slowsjIsPRussianLower( char * chp ) { ubyte * uchp = (ubyte ) chp; int b = uchp[ 1 ]; return ( ( uchp == b16_Russian_a[ 0 ] ) && ( b >= b16_Russian_a[ 1 ] && b <= b16_Russian_ya[ 1 ] && b != 0x7f ) )? 2 : 0; } int slowsjIsPRussianUpper( char * chp ) { ubyte * uchp = (ubyte ) chp; int b = uchp[ 1 ]; return ( ( uchp == b16_Russian_A[ 0 ] ) && ( b >= b16_Russian_A[ 1 ] && b <= b16_Russian_YA[ 1 ] && b != 0x7f ) ) ? 2 : 0; } /* Time biased against upper case, but we don't care on the slow version. / int slowsjIsPRussian( char chp ) { int result = slowsjIsPRussianLower( chp ); if ( result == 0 ) slowsjIsPRussianUpper( chp ); return result; } /* Time biased against Greek and Russian, but we don't care on the slow version. / int slowsjIsPUpper( char chp ) { int result = slowsjIsPRomanUpper( chp ); if ( result == 0 ) result = slowsjIsPGreekUpper( chp ); if ( result == 0 ) result = slowsjIsPRussianUpper( chp ); return result; } /* Time biased against Greek and Russian, but we don't care on the slow version. / int slowsjIsPLower( char chp ) { int result = slowsjIsPRomanLower( chp ); if ( result == 0 ) result = slowsjIsPGreekLower( chp ); if ( result == 0 ) result = slowsjIsPRussianLower( chp ); return result; } /* Time biased against Greek and Russian, but we don't care on the slow version. / int slowsjIsPEurAsianAlpha( char chp ) { int result = slowsjIsPRoman( chp ); if ( result == 0 ) result = slowsjIsPGreek( chp ); if ( result == 0 ) result = slowsjIsPRussian( chp ); return result; } int slowsjIsPQuasiEurAsianAlpha( char * chp ) { ubyte * uchp = (ubyte ) chp; int b = uchp; if ( b == b16_AccentAcute_Prime[ 0 ] ) { b = uchp[ 1 ]; return ( b == b16_AccentAcute_Prime[ 1 ] \|\| b == b16_AccentGrave[ 1 ] \|\| b == b16_Umlaut[ 1 ] \|\| b == b16_AccentCircumflex[ 1 ] \|\| b == b16_Overline_Negate[ 1 ] \|\| b == b16_QuarterDash_Hyphen[ 1 ] \|\| b == b16_WavyDash_Tilde[ 1 ] ) ? 2 : 0; } else { return ( b == b7_HYPHEN \|\| b == b7_ACCENTGRAVE \|\| b == b7_TILDE \|\| b == b7_CARET ) ? 1 : 0; } } int slowsjIsPHiragana( char * chp ) { ubyte * uchp = (ubyte ) chp; int b = uchp[ 1 ]; return ( ( uchp == b16_hiraganaSub_a[ 0 ] ) && ( b >= b16_hiraganaSub_a[ 1 ] && b <= b16_hiragana_ng[ 1 ] && b != 0x7f ) )? 2 : 0; } int slowsjIsPKatakana( char * chp ) { ubyte * uchp = (ubyte ) chp; int b = uchp; if ( b == b16_katakanaSub_a[ 0 ] ) { b = uchp[ 1 ]; return ( b >= b16_katakanaSub_a[ 1 ] && b <= b16_katakanaSub_ke[ 1 ] && b != 0x7f ) ? 2 : 0; } else { return ( ( b >= b8_katakana_wo && b <= b8_katakanaSub_tu ) \|\| ( b >= b8_katakana_a && b <= b8_katakana_ng ) ) ? 1 : 0; } } /* Time biased against katakana, but we don't care on the slow version. / int slowsjIsPKana( char chp ) { int result = slowsjIsPHiragana( chp ); if ( result == 0 ) result = slowsjIsPKatakana( chp ); return result; } int slowsjIsPQuasiKana( char * chp ) { ubyte * uchp = (ubyte ) chp; int b = uchp; if ( b == b16_DakuTen[ 0 ] ) { b = uchp[ 1 ]; return ( b == b16_DakuTen[ 1 ] \|\| b == b16_HanDakuTen[ 1 ] \|\| b == b16_KatakanaRepeat[ 1 ] \|\| b == b16_KatakanaRepeatVoiced[ 1 ] \|\| b == b16_HiraganaRepeat[ 1 ] \|\| b == b16_HiraganaRepeatVoiced[ 1 ] \|\| b == b16_ChoOn[ 1 ] ) ? 2 : 0; } else { return ( b == b8_ChoOn \|\| b == b8_DakuTen \|\| b == b8_HandakuTen ) ? 1 : 0; } } /* This has even time-bias for JIS level 1. / int slowsjIsPKanji( char chp ) { ubyte * uchp = (ubyte ) chp; int bHi = uchp[ 0 ]; int bLo = uchp[ 1 ]; if ( slowsjIsPHighByte( chp ) && slowsjIsPLowByte( chp + 1 ) && ( ( bHi == b16_kanji1Low_a[ 0 ] && bLo >= b16_kanji1Low_a[ 1 ] ) \|\| ( bHi > b16_kanji1Low_a[ 0 ] && bHi < b16_kanji1High_ude[ 0 ] ) \|\| ( bHi == b16_kanji1High_ude[ 0 ] && bLo <= b16_kanji1High_ude[ 1 ] ) \|\| ( bHi == b16_kanji2aLow_ichi[ 0 ] && bLo >= b16_kanji2aLow_ichi[ 1 ] ) \|\| ( bHi > b16_kanji2aLow_ichi[ 0 ] && bHi <= b16_kanji2aHigh_jou[ 0 ] ) / The rows at the end of 2a and beginning of 2b are complete. / \|\| ( bHi >= b16_kanji2bLow_you[ 0 ] && bHi <= b16_kanji2bHigh_hikaru[ 0 ] ) \|\| ( bHi == b16_kanji2bHigh_hikaru[ 0 ] && bLo <= b16_kanji2bHigh_hikaru[ 1 ] ) ) ) return 2; else return 0; } / This is completely time-biased against kanji, and a little harder to mentally verify. { ubyte * uchp = (ubyte ) chp; int bHi = uchp[ 0 ]; int bLo = uchp[ 1 ]; if ( !slowsjIsPHighByte( chp ) \|\| !slowsjIsPLowByte( chp + 1 ) \|\| bHi < b16_kanji1Low_a_sub[ 0 ] \|\| ( bHi == b16_kanji1Low_a_sub[ 0 ] && bLo < b16_kanji1Low_a_sub[ 1 ] ) \|\| ( bHi == b16_kanji1High_ude_arm[ 0 ] && bLo > b16_kanji1High_ude_arm[ 1 ] && bLo < b16_kanji2aLow_ichi_formalOne[ 1 ] ) \|\| ( bHi > b16_kanji2aHigh_ude_arm[ 0 ] && bHi < b16_kanji2bLow_yo_e040[ 0 ] ) \|\| ( bHi == b16_kanji2bHigh_hikaru_eaa4[ 0 ] && bLo > b16_kanji2bHigh_hikaru_eaa4[ 1 ] ) \|\| bHi > b16_kanji2bHigh_hikaru_eaa4[ 0 ] ) return 0; else return 2; } / int slowsjIsPQuasiKanji( char * chp ) { ubyte * uchp = (ubyte ) chp; int b = uchp[ 1 ]; return ( ( uchp == b16_KanjiIbid[ 0 ] ) && ( b >= b16_KanjiIbid[ 1 ] /* This might be a proper Kanji? / \|\| b <= b16_Ditto[ 1 ] / Should this be only with European mods? / \|\| b <= b16_Shime[ 1 ] / Probably not Kanji? / \|\| b <= b16_KanjiZero[ 1 ] / Should this be Kanji? / \|\| b <= b16_OpenCircle_Maru[ 1 ] / Often used as fill-in-th-blank. / \|\| b <= b16_KanjiRepeat[ 1 ] ) )? 2 : 0; } / Run-time bias against everybody. // Should give fairly even timing in general use // and give best timing for generating tables. / int slowsjIsPAlpha( char chp ) { int result = slowsjIsPKanji( chp ); if ( result == 0 ) result = slowsjIsPKana( chp ); if ( result == 0 ) result = slowsjIsPEurAsianAlpha( chp ); return result; } /* Use the same bias as alpha, just to be obnoxious. / int slowsjIsPQuasiAlpha( char chp ) { int result = slowsjIsPQuasiKanji( chp ); if ( result == 0 ) result = slowsjIsPQuasiKana( chp ); if ( result == 0 ) result = slowsjIsPQuasiEurAsianAlpha( chp ); return result; } /* Bias? What bias? / int slowsjIsPAlNum( char chp ) { int result = slowsjIsPDigit( chp ); if ( result == 0 ) result = slowsjIsPAlpha( chp ); return result; } /* Bias? What bias? / int slowsjIsPAlNumQuasi( char chp ) { int result = slowsjIsPQuasiAlpha( chp ); if ( result == 0 ) result = slowsjIsPAlNum( chp ); return result; } int slowsjIsPLineDraw( char * chp ) { ubyte * uchp = (ubyte ) chp; int b = uchp[ 1 ]; return ( ( uchp == b16_LineDraw_1H[ 0 ] ) && ( b >= b16_LineDraw_1H[ 1 ] && b <= b16_LineDraw_1H2V[ 1 ] && b != 0x7f ) )? 2 : 0; } int slowsjIsPPunct( char * chp ) { ubyte * uchp = (ubyte ) chp; int b = uchp; if ( b == b16_ToTen[ 0 ] ) /* Nice of the JIS comittee to put them all together. / { b = uchp[ 1 ]; return ( b != 0x7f / Check and excuse later / && ( ( b >= b16_ToTen[ 1 ] && b <= b16_Geta[ 1 ] ) \|\| ( b >= b16_Element[ 1 ] && b <= b16_Intersection[ 1 ] ) \|\| ( b >= b16_Conjunction_And[ 1 ] && b <= b16_Exists[ 1 ] ) \|\| ( b >= b16_Angle[ 1 ] && b <= b16_DoubleIntegral[ 1 ] ) \|\| ( b >= b16_Angstrom[ 1 ] && b <= b16_Paragraph[ 1 ] ) \|\| ( b == b16_CompositionCircle[ 1 ] ) ) ) ? 2 : 0; } else { return ( ( b >= b7_EXCLAIM && b <= b7_SLASH ) \|\| ( b >= b7_COLON && b <= b7_ATEACH ) \|\| ( b >= b7_LEFTBRACKET && b <= b7_ACCENTGRAVE ) \|\| ( b >= b7_LEFTBRACE && b <= b7_TILDE ) \|\| ( b >= b8_Kuten && b <= b8_ChuTen ) \|\| ( b == b8_ChoOn ) \|\| ( b >= b8_DakuTen && b <= b8_HandakuTen ) ) ? 1 : 0; } } int slowsjIsPGraph( char chp ) { int result = slowsjIsPAlNum( chp ); if ( result == 0 ) result = slowsjIsPPunct( chp ); return result; } int slowsjIsPPrint( char * chp ) { ubyte * uchp = (ubyte ) chp; if ( uchp == b7_SP ) return 1; else if ( uchp[ 0 ] == b16_SP[ 0 ] && uchp[ 1 ] == b16_SP[ 1 ] ) return 2; else return slowsjIsPGraph( chp ); } /* Macro to isprint() works just fine because there are no two-byte control characters. int slowsjIsP2Byte( char * chp ) {} / / ToLower/Upper will have to test the 7f gap specifically for each range that suffers it. // Some are entirely above and some entirely below. // JIS Roman/Greek/Russian doesn't include any caseless characters in my materials. // But if they did I could test the converted character for validity before returning it. // Just for fun, I'll include the test anyway. / int slowsjPToLowerRoman( char chpin, char * chpout ) { int count = slowsjIsPRomanUpper( chpin ); ubyte * uchpin = (ubyte ) chpin; ubyte uchpout = (ubyte ) chpout; ubyte temp[ 4 ] = { 0 }; switch ( count ) { case 1: temp[ 0 ] = uchpin[ 0 ] + ( b7_a - b7_A ); break; case 2: temp[ 0 ] = uchpin[ 0 ]; temp[ 1 ] = uchpin[ 1 ] + ( b16_a[ 1 ] - b16_A[ 1 ] ); / No gap / break; } if ( count > 0 && slowsjIsPRomanLower( (char ) temp ) == count ) { uchpout[ 0 ] = temp[ 0 ]; if ( count > 1 ) uchpout[ 1 ] = temp[ 1 ]; } else count = 0; return count; } int slowsjPToUpperRoman( char * chpin, char * chpout ) { int count = slowsjIsPRomanLower( chpin ); ubyte * uchpin = (ubyte ) chpin; ubyte uchpout = (ubyte ) chpout; ubyte temp[ 4 ] = { 0 }; switch ( count ) { case 1: temp[ 0 ] = uchpin[ 0 ] - ( b7_a - b7_A ); break; case 2: temp[ 0 ] = uchpin[ 0 ]; temp[ 1 ] = uchpin[ 1 ] - ( b16_a[ 1 ] - b16_A[ 1 ] ); / No gap / break; } if ( count > 0 && slowsjIsPRomanUpper( (char ) temp ) == count ) { uchpout[ 0 ] = temp[ 0 ]; if ( count > 1 ) uchpout[ 1 ] = temp[ 1 ]; } else count = 0; return count; } int slowsjPToLowerGreek( char * chpin, char * chpout ) { int count = slowsjIsPGreekUpper( chpin ); ubyte * uchpin = (ubyte ) chpin; ubyte uchpout = (ubyte ) chpout; ubyte temp[ 4 ] = { 0 }; if ( count == 2 ) { temp[ 0 ] = uchpin[ 0 ]; temp[ 1 ] = uchpin[ 1 ] + ( b16_alpha[ 1 ] - b16_ALPHA[ 1 ] ); / No gap / } if ( count == 2 && slowsjIsPGreekLower( (char ) temp ) == count ) { uchpout[ 0 ] = temp[ 0 ]; uchpout[ 1 ] = temp[ 1 ]; } else count = 0; return count; } int slowsjPToUpperGreek( char * chpin, char * chpout ) { int count = slowsjIsPGreekLower( chpin ); ubyte * uchpin = (ubyte ) chpin; ubyte uchpout = (ubyte ) chpout; ubyte temp[ 4 ] = { 0 }; if ( count == 2 ) { temp[ 0 ] = uchpin[ 0 ]; temp[ 1 ] = uchpin[ 1 ] - ( b16_alpha[ 1 ] - b16_ALPHA[ 1 ] ); / No gap / } if ( count == 2 && slowsjIsPGreekUpper( (char ) temp ) == count ) { uchpout[ 0 ] = temp[ 0 ]; uchpout[ 1 ] = temp[ 1 ]; } else count = 0; return count; } int slowsjPToLowerRussian( char * chpin, char * chpout ) { int count = slowsjIsPRussianUpper( chpin ); ubyte * uchpin = (ubyte ) chpin; ubyte uchpout = (ubyte ) chpout; ubyte temp[ 4 ] = { 0 }; if ( count == 2 ) { temp[ 0 ] = uchpin[ 0 ]; temp[ 1 ] = uchpin[ 1 ] + ( b16_Russian_a[ 1 ] - b16_Russian_A[ 1 ] ); if ( temp[ 1 ] >= 0x7f ) / Adjust for the gap. / temp[ 1 ] += 1; } if ( count == 2 && slowsjIsPRussianLower( (char ) temp ) == count ) { uchpout[ 0 ] = temp[ 0 ]; uchpout[ 1 ] = temp[ 1 ]; } else count = 0; return count; } int slowsjPToUpperRussian( char * chpin, char * chpout ) { int count = slowsjIsPRussianLower( chpin ); /* Checks the gap. / ubyte uchpin = (ubyte ) chpin; ubyte uchpout = (ubyte ) chpout; ubyte temp[ 4 ] = { 0 }; if ( count == 2 ) { temp[ 0 ] = uchpin[ 0 ]; temp[ 1 ] = uchpin[ 1 ] - ( b16_Russian_a[ 1 ] - b16_Russian_A[ 1 ] ); if ( uchpin[ 1 ] > 0x7f ) / Adjust for the gap (0x7f already filtered above). / temp[ 1 ] -= 1; } if ( count == 2 && slowsjIsPRussianUpper( (char ) temp ) == count ) { uchpout[ 0 ] = temp[ 0 ]; uchpout[ 1 ] = temp[ 1 ]; } else count = 0; return count; } /* Again, time-biased in favor of the most likely. (Russian and Greek are not as commonly used.) // Would be faster to test directly, but that increases logical coupling // (increases the chance for algorithmic errors). // Reducing errors is a higher priority than speed. / int slowsjPToLower( char chpin, char * chpout ) { int count = slowsjPToLowerRoman( chpin, chpout ); if ( count == 0 ) count = slowsjPToLowerGreek( chpin, chpout ); if ( count == 0 ) count = slowsjPToLowerRussian( chpin, chpout ); return count; } int slowsjPToUpper( char * chpin, char * chpout ) { int count = slowsjPToUpperRoman( chpin, chpout ); if ( count == 0 ) count = slowsjPToUpperGreek( chpin, chpout ); if ( count == 0 ) count = slowsjPToUpperRussian( chpin, chpout ); return count; } /* ToLower/Upper will have to test the 7f gap specifically for each range that suffers it. Some are entirely above and some entirely below. JIS Roman/Greek/Russian doesn't include caseless. For converting katakana to hiragana, I can test whether the result is valid before returning it. int slowsjToUpper( unsigned char * mbcin, unsigned char * mbcout ) Converts cased word forming characters to upper case, including 8 bit JIS-Latin; 16 bit JIS-Roman, -Greek, and -Russian characters, but no kanji. Returns byte count converted or zero. / / So, the initial, standard function headers: int slowsjIsCntrl( unsigned char * mbc ) As near as I can tell, all one byte, between 0 and 0x1f, inclusive. Returns byte count. int slowsjIsSpace( unsigned char * mbc ) Adds one two byte version of the space character. Returns byte count. int slowsjIsPrint( unsigned char * mbc ) All graphic characters, including non-control space characters. Returns byte count. int slowsjIsGraph( unsigned char * mbc ) All graphic non-space characters. Returns byte count. int slowsjIsPunct( unsigned char * mbc ) All non-word-forming characters. Will later be subdivided for the richer JIS set. Returns byte count. int slowsjIsDigit( unsigned char * mbc ) The standard digits 0..9, as specified in ANSI/ISO ctype. Includes both one and two byte digits. Does not include kanji numbers. Returns byte count. int slowsjIsXDigit( unsigned char * mbc ) The standard hexadecimal digits specified in ANSI/ISO ctype. Includes both one and two byte digits. Does not include kanji numbers. Returns byte count. int slowsjIsAlpha( unsigned char * mbc ) Characters used to form words, as used by non-programmers. Does not include the standard decimal digits, but does include the kanji numbers. Includes a lot of caseless characters, of course. Returns byte count. int slowsjIsAlNum( unsigned char * mbc ) Characters used to form words, as used by programmers, thus including digits. Returns byte count. int slowsjIsUpper( unsigned char * mbc ) Upper cased characters, includes 8 bit JIS-Latin; 16 bit JIS-Roman, -Greek, and -Russian characters, but no kanji. Returns byte count. int slowsjIsLower( unsigned char * mbc ) Lower cased characters, includes 8 bit JIS-Latin; 16 bit JIS-Roman, -Greek, and -Russian characters, but no kanji. Returns byte count. int slowsjToLower( unsigned char * mbcin, unsigned char * mbcout ) Converts cased word forming characters to lower case, including 8 bit JIS-Latin; 16 bit JIS-Roman, -Greek, and -Russian characters, but no kanji. Returns byte count converted or zero. int slowsjToUpper( unsigned char * mbcin, unsigned char * mbcout ) Converts cased word forming characters to upper case, including 8 bit JIS-Latin; 16 bit JIS-Roman, -Greek, and -Russian characters, but no kanji. Returns byte count converted or zero. int slowsjIs1Byte( unsigned char * mbc ) Valid one byte character. Returns byte count. int slowsjIs2Byte( unsigned char * mbc ) Valid two byte character? Returns byte count. int slowsjCouldBe2Byte( unsigned char * mbc ) A combination of valid lead byte and valid tail byte? Returns byte count. The second, or fast version slowsjIsXX() functions will use constants of the pattern slowsjIsXX_k. The constants and the general call will also be provided in the source header, as mentioned above, for optimization: int slowsjCType( unsigned long type, unsigned char * mbc ) Test the type formed by the bit-or of the type constants passed as the first parameter. Returns byte count on test true or zero on test false. The initial slow version functions will have names of the pattern slow_slowsjIsXX() so they can co-exist during debugging. slowsjrIsXX()? Now, some of the foreseeable necessary extensions: int slowsjIsMath( unsigned char * mbc ) The plethora of math and logic symbols in JIS. Returns byte count. int slowsjIsUnit( unsigned char * mbc ) The plethora of unit symbols in JIS, but not system specific extensions like m2. Does not include kanji. Returns byte count. int slowsjIsQuote( unsigned char * mbc ) The plethora of quoting and parenthetic characters in JIS. Returns byte count. int slowsjIsKanji( unsigned char * mbc ) All the proper kanji characters. Returns byte count. int isNumberKanji( unsigned char * mbc ) All the number kanji, including the special ones used, for example, on currency and bank notes. Returns byte count. int slowsjIsKana( unsigned char * mbc ) All the katakana and hiragana characters, including the one byte katakana. Also including the free-standing voicing and plosive symbols, dakuten and handakuten. Returns byte count. int slowsjIsKata( unsigned char * mbc ) All the katakana, including the SJIS one byte katakana, but not the free-standing voicing and plosive symbols, dakuten and handakuten. Returns byte count. int slowsjIsHira( unsigned char * mbc ) All the hiragana, not including the free-standing voicing and plosive symbols, dakuten and handakuten. Returns byte count. int slowsjToKata( unsigned char * mbcin, unsigned char * mbcout ) Converts hiragana to katakana. Returns byte count converted or zero. int slowsjToHira( unsigned char * mbcin, unsigned char * mbcout ) Converts katakana to hiragana, where possible. Moves the unconvertable katakana as they are. Does not convert the one byte katakana. Returns byte count converted or zero. int slowsjTo16Kata( unsigned char * mbcin, unsigned char * mbcout ) Converts the one byte katakana to two byte katakana. Round trip slowsjTo16Kata() -> slowsjTo8Kata() should be guaranteeable. Returns byte count converted or zero. int slowsjTo8Kata( unsigned char * mbcin, unsigned char * mbcout ) Converts two byte katakana to one byte katakana, where possible. Round trip slowsjTo8Kata() -> slowsjTo16Kata() may be guaranteeable, I'm not sure yet. Returns byte count converted or zero. Some of the hypothetical extensions: int slowsjIsMusic( unsigned char * mbc ) The music symbols in JIS. Returns byte count. int slowsjIsKanjiUnit( unsigned char * mbc ) The kanji version of units, including also ten, hundred, thousand, ten-thousand, etc. Returns byte count. int slowsjIsRoman( unsigned char * mbc ) All the JIS Roman (two byte Latin) characters. Returns byte count. int slowsjIsGreek( unsigned char * mbc ) All the JIS Greek characters. Returns byte count. int slowsjIsRussian( unsigned char * mbc ) All the JIS Russian characters. Returns byte count. int slowsjIsLatin( unsigned char * mbc ) All the Latin characters, including the two byte Roman (Latin) and one byte Latin. Returns byte count. int slowsjToRoman( unsigned char * mbcin, unsigned char * mbcout ) Convert one byte Latin to two byte JIS Roman (Latin). Returns byte count converted or zero. int slowsjToLatin( unsigned char * mbcin, unsigned char * mbcout ) Convert two byte JIS Roman (Latin) to one byte Latin. Returns byte count converted or zero. */
		\ No newline at end of file

ctype for sjis characters Fork

提交

標籤

Frequently used words (click to add to your profile)

Commit MetaInfo

Log Message

Change Summary

差異

ctype for sjis characters
Fork