NAME Lingua::JA::NormalizeText - Text Normalizer SYNOPSIS use Lingua::JA::NormalizeText; use utf8; my @options = ( qw/nfkc decode_entities/, \&dearinsu_to_desu ); my $normalizer = Lingua::JA::NormalizeText->new(@options); print $normalizer->normalize('鳥が㌧㌦でありんす♥'); # -> 鳥がトンドルです♥ sub dearinsu_to_desu { my $text = shift; $text =~ s/でありんす/です/g; return $text; } # or use Lingua::JA::NormalizeText qw/old2new_kanji/; use utf8; print old2new_kanji('惡の華'); # -> 悪の華 DESCRIPTION Lingua::JA::NormalizeText normalizes text. METHODS new(@options) Creates a new Lingua::JA::NormalizeText instance. The following options are available: OPTION SAMPLE INPUT OUTPUT FOR SAMPLE INPUT --------------------- --------------------- ----------------------- lc DdD ddd uc DdD DDD nfkc ㌦ ドル (length: 2) nfkd ㌦ ドル (length: 3) nfc nfd decode_entities ♥ ♥ strip_html <em>あ</em> あ alnum_z2h ABC123 ABC123 alnum_h2z ABC123 ABC123 space_z2h space_h2z katakana_z2h ハァハァ ハァハァ katakana_h2z スーハースーハー スーハースーハー katakana2hiragana パンツ ぱんつ hiragana2katakana ぱんつ パンツ wave2tilde 〜 ~ tilde2wave ~ 〜 wavetilde2long 〜, ~ ー wave2long 〜 ー tilde2long ~ ー fullminus2long − ー dashes2long — ー drawing_lines2long ─ ー unify_long_repeats ヴァーーー ヴァー nl2space (LF)(CR)(CRLF} (space)(space)(space) unify_nl (LF)(CR)(CRLF) \n\n\n unify_long_spaces あ(space)(space)あ あ(space)あ unify_whitespaces \x{00A0} (space) trim (space)あ(space)あ(space) あ(space)あ ltrim (space)あ(space) あ(space) rtrim ああ(space)(space) ああ old2new_kana ゐヰゑヱヸヹ いイえエイ゙エ゙ old2new_kanji 亞逸鬭 亜逸闘 tab2space (tab)(tab) (space)(space) remove_controls あ\x{0000}あ ああ dakuon_normalize さ\x{3099} ざ handakuon_normalize は\x{309A} ぱ all_dakuon_normalize さ\x{3099}は\x{309A} ざぱ The order in which these options are applied is according to the order of the elements of @options. (i.e., The first element is applied first, and the last element is applied last.) External functions are also addable. (See dearinsu_to_desu function of the SYNOPSIS section.) normalize($text) normalizes $text. OPTIONS dashes2long Note that this option does not convert hyphens into long. unify_long_spaces Note that this option unifies only SPACE(U+0020) and IDEOGRAPHIC SPACE(U+3000). remove_controls Note that this option does not remove the following chars: CHARACTER TABULATION LINE FEED CARRIAGE RETURN unify_whitespaces This option converts the following chars into SPACE(U+0020). LINE TABULATION FORM FEED NEXT LINE NO-BREAK SPACE OGHAM SPACE MARK MONGOLIAN VOWEL SEPARATOR EN QUAD EM QUAD EN SPACE EM SPACE THREE-PER-EM SPACE FOUR-PER-EM SPACE SIX-PER-EM SPACE FIGURE SPACE PUNCTUATION SPACE THIN SPACE HAIR SPACE LINE SEPARATOR PARAGRAPH SEPARATOR NARROW NO-BREAK SPACE MEDIUM MATHEMATICAL SPACE Note that this does not convert the following chars: CHARACTER TABULATION LINE FEED CARRIAGE RETURN IDEOGRAPHIC SPACE AUTHOR pawa <pawapawa@cpan.org> SEE ALSO 新旧字体表: <http://www.asahi-net.or.jp/~ax2s-kmtn/ref/old_chara.html> Lingua::JA::Regular::Unicode Lingua::JA::Dakuon Lingua::JA::Moji Unicode::Normalize HTML::Entities HTML::Scrubber LICENSE This library is free software; you can redistribute it and/or modify it under the same terms as Perl itself.