討論區: ヘルプ (Thread #37445)

help implementing a not so posix script (2016-01-21 10:34 by trek00 #77527)

first of all I know that my script is not 100% posixly conformant, but it runs fine on other shells and I would like to find a safe way to run it on yash too, because it is a really fast shell (see the benchmark in the .tar.gz)

the script converts a string to a list of integers (like the output of od -A n -t u1 -v) and then restores the string starting from the interger list

how? the function n_str_set reads one character at time from the string, checks the character in a big case statement and adds the corresponding integer to the list; then the function n_str_get reads one integer at time and adds the corresponding character to the string

with the ASCII characters (1-127) there is no problem, but 8bit characters (128-255) cannot be read by yash from the script file if the character encoding is not an 8bit single-byte one (like *.iso88591)

in my head, there are two possible workarounds:
- adding an alternative yash version of the functions that only allows ASCII characters to be converted (but I don't really like this solution)
- setting LANG to an 8bit single-byte character encondig before sourcing the file (but the script must discover a compatible locale in the hope that it is installed)

if you want to test the different behaviors, unpack the .tar.gz file and then cd into naive-0.0.2

with a posix shell:

$ LANG=C sh -c '. lib/str.sh; n_str_set bin OK; n_str_get "$bin" str; echo $str'
OK

with yash and C locale:

$ LANG=C yash -c '. lib/str.sh; n_str_set bin OK; n_str_get "$bin" str; echo $str'
.: cannot read input: Invalid or incomplete multibyte or wide character
.: cannot read input: Invalid or incomplete multibyte or wide character
lib/str.sh:157: syntax error: the single quotation is not closed
lib/str.sh:157: syntax error: `)' is missing
lib/str.sh:157: syntax error: `esac' is missing
lib/str.sh:157: syntax error: `done' is missing
lib/str.sh:157: syntax error: `}' is missing
yash: no such command `n_str_set'
yash: no such command `n_str_get'

with yash and iso8859 charset:

$ LANG=en_GB.iso88591 yash -c '. lib/str.sh; n_str_set bin OK; n_str_get "$bin" str; echo $str'
OK

note: inside the n_str_set function LC_ALL is set to C and then restored at the end, so yash need iso88591 to read the file, but then it can manage 8bit characters with the already parsed code even with the C locale

note2: yash can print an 8bit char even in the C locale:

$ LANG=C yash -c 'echo \\0364 | od -An -c'
364 \n

I'm using the version 2.36-1 included in Debian GNU/Linux 8 (jessie)

the functions n_str_set and n_str_get are in the lib/str.sh file included in http://www.trek.eu.org/devel/naive/naive-0.0.2.tar.gz

thanks for your time!

回覆 #77527×

You can not use Wiki syntax
You are not logged in. To discriminate your posts from the rest, you need to pick a nickname. (The uniqueness of nickname is not reserved. It is possible that someone else could use the exactly same nickname. If you want assurance of your identity, you are recommended to login before posting.) 登入

Re: help implementing a not so posix script (2016-01-21 23:39 by magicant #77531)

This is an interesting but difficult challenge. One of the important things is that yash converts byte characters to wide characters (char -> wchar_t) when reading the script file. Thus, the bytes must be valid characters in the current locale when the file is read (not when the case statement is parsed or executed thereafter).

Also note that _n_buf=${_n_str#?} trims the first character, not the first byte. It will matter in multibyte locales like UTF-8.

For the purpose of byte-to-integer conversion, it would be the best to use external commands like "od" and "xxd" which do not convert bytes to wide characters. If you insist on performing conversion in the shell script, some kind of workaround would be inevitable, since POSIX defines no portable 8bit-safe locale.

I'll follow up if I come up with another workaround.
回覆: #77527

回覆 #77531×

You can not use Wiki syntax
You are not logged in. To discriminate your posts from the rest, you need to pick a nickname. (The uniqueness of nickname is not reserved. It is possible that someone else could use the exactly same nickname. If you want assurance of your identity, you are recommended to login before posting.) 登入

Re: help implementing a not so posix script (2016-01-22 19:40 by trek00 #77536)

[Reply To Message #77531]
> This is an interesting but difficult challenge. One of the important things is that yash converts byte characters to wide characters (char -> wchar_t) when reading the script file. Thus, the bytes must be valid characters in the current locale when the file is read (not when the case statement is parsed or executed thereafter).

you confirmed my speculations on yash inner workings (merely based on the glibc error message and some test)

> Also note that _n_buf=${_n_str#?} trims the first character, not the first byte. It will matter in multibyte locales like UTF-8.

yes, I discovered it with LANG=C.UTF-8 in ksh, bash and yash, where other shells probably don't deals well with UTF-8

so I'm setting LANG=C (or more precisely LC_ALL) before trimming characters

> For the purpose of byte-to-integer conversion, it would be the best to use external commands like "od" and "xxd" which do not convert bytes to wide characters. If you insist on performing conversion in the shell script, some kind of workaround would be inevitable, since POSIX defines no portable 8bit-safe locale.

od is actually used in bytes.sh (that handles full 8 bits, included NUL) to read an entire file, but cannot be used in str.sh

the n_str_set function is used by the associative arrays (hashtab.sh and map.sh via the n_str_hash function) and forking od for each call would be catastrophic for performances (to see an example you should check the hashtab chapter in the API.txt file, as compress.sh don't uses n_str_hash)

probably, the best option I have is to check the locale when running under yash:
- if it is single-byte full 8bits (like iso88591) behave normally
- if it is 7bits (C/POSIX) I will use a reduced version of str.sh (only firsts 127 characters), as the script file from which is sourced cannot contains the other characters in any way, because yash would not parse it
- if it is UTF-8 (or any multi-byte), I should fall back to the 7bits version, but the other characters are ignored in the hash computation (and this may lead to hash collisions and bad performances)

yes it's far from perfect, but the UTF-8/multibyte thing was entirely missed in my design and then spotted out at the testing phase :-!

> I'll follow up if I come up with another workaround.

thanks again, your reply already led me to better understanding and, again, let me say the yash code is very efficient, as its speed is comparable to ksh and dash

if you are interested I will post here a link to the full benchmarks once released (yash is always one of the firsts three)

c-ya!
回覆: #77531

回覆 #77536×

You can not use Wiki syntax
You are not logged in. To discriminate your posts from the rest, you need to pick a nickname. (The uniqueness of nickname is not reserved. It is possible that someone else could use the exactly same nickname. If you want assurance of your identity, you are recommended to login before posting.) 登入