NAME
    Text::Ngram - Basis for n-gram analysis

SYNOPSIS
      use Text::Ngram qw(ngram_counts add_to_counts);
      my $text   = "abcdefghijklmnop";
      my $hash_r = ngram_counts($text, 3); # Window size = 3
      # $hash_r => { abc => 1, bcd => 1, ... }

      add_to_counts($more_text, 3, $hash_r);

DESCRIPTION
    n-Gram analysis is a field in textual analysis which uses sliding window
    character sequences in order to aid topic analysis, language
    determination and so on. The n-gram spectrum of a document can be used
    to compare and filter documents in multiple languages, prepare word
    prediction networks, and perform spelling correction.

    The neat thing about n-grams, though, is that they're really easy to
    determine. For n=3, for instance, we compute the n-gram counts like so:

        the cat sat on the mat
        ---                     $counts{"the"}++;
         ---                    $counts{"he "}++;
          ---                   $counts{"e c"}++;
           ...

    This module provides an efficient XS-based implementation of n-gram
    spectrum analysis.

    There are two functions which can be imported:

        $href = ngram_counts($text[, $window]);

    This first function returns a hash reference with the n-gram histogram
    of the text for the given window size. If the window size is omitted,
    then 5-grams are used. This seems relatively standard.

        add_to_counts($more_text, $window, $href)

    This incrementally adds to the supplied hash; if $window is zero or
    undefined, then the window size is computed from the hash keys.

Important note on text preparation
    Most of the published algorithms for textual n-gram analysis assume that
    the only characters you're interested in are alphabetic characters and
    spaces. So before the text is counted, the following preparation is
    made.

    All characters are lowercased; (most papers use upper-casing, but that
    just feels so 1970s) punctuation and numerals are replaced by stop
    characters flanked by blanks; multiple spaces are compressed into a
    single space.

    After the counts are made, n-grams containing stop characters are
    dropped from the hash.

    If you prefer to do your own text preparation, use the internal routines
    "process_text" and "process_text_incrementally" instead of
    "count_ngrams" and "add_to_counts" respectively.

SEE ALSO
    Cavnar, W. B. (1993). N-gram-based text filtering for TREC-2. In D.
    Harman (Ed.), *Proceedings of TREC-2: Text Retrieval Conference 2*.
    Washington, DC: National Bureau of Standards.

    Shannon, C. E. (1951). Predication and entropy of printed English. *The
    Bell System Technical Journal, 30*. 50-64.

    Ullmann, J. R. (1977). Binary n-gram technique for automatic correction
    of substitution, deletion, insert and reversal errors in words.
    *Computer Journal, 20*. 141-147.

SUPPORT
    Beep... beep... this is a recorded announcement:

    I've released this software because I find it useful, and I hope you
    might too. But I am a being of finite time and I'd like to spend more of
    it writing cool modules like this and less of it answering email, so
    please excuse me if the support isn't as great as you'd like.

    Nevertheless, there is a general discussion list for users of all my
    modules, to be found at
    http://lists.netthink.co.uk/listinfo/module-mayhem

    If you have a problem with this module, someone there will probably have
    it too.

AUTHOR
    Simon Cozens, "simon@cpan.org"

COPYRIGHT AND LICENSE
    Copyright 2003 by Simon Cozens

    This library is free software; you can redistribute it and/or modify it
    under the same terms as Perl itself.