|
|
What are useful Perl one-liners for working with UTF-8 |
Aus Wiki.csoft.at
|
These examples assume that you have Perl 5.8.1 or newer and that you work in a UTF-8 locale (i.e., “locale charmap” outputs “UTF-8”). For Perl 5.8.0, option -C is not needed and the examples without -C will not work in a UTF-8 locale. You really should no longer use Perl 5.8.0, as its Unicode support had lots of bugs. Print the euro sign (U+20AC) to stdout: perl -C -e 'print pack("U",0x20ac)."\n"'
perl -C -e 'print "\x{20ac}\n"' # works only from U+0100 upwards
Locate malformed UTF-8 sequences: perl -ne '/^(([\x00-\x7f]|[\xc0-\xdf][\x80-\xbf]|[\xe0-\xef][\x80-\xbf]{2}|[\xf0-\xf7][\x80-\xbf]{3})*)(.*)$/;print "$ARGV:$.:".($-[3]+1).":$_" if length($3)'
Locate non-ASCII bytes: perl -ne '/^([\x00-\x7f]*)(.*)$/;print "$ARGV:$.:".($-[2]+1).":$_" if length($2)' Convert non-ASCII characters into SGML/HTML/XML-style decimal numeric character references (e.g. Ş becomes Ş): perl -C -pe 's/([^\x00-\x7f])/sprintf("&#%d;", ord($1))/ge;'
Convert (hexa)decimal numeric character references to UTF-8: perl -C -pe 's/&\#(\d+);/chr($1)/ge;s/&\#x([a-fA-F\d]+);/chr(hex($1))/ge;' Weblinks |