Python’s Natural Language Took Kit (NLTK) and Hadoop

Welcome back to part 3 of Ben’s talk about Big Data and Natural Language Processing. (Click through to see the intro, part 1, and part 2).

We chose NLTK (Natural Language Toolkit) particularly because it’s not Stanford. Stanford is kind of a magic black box, and it costs money to get a commercial license. NLTK is open source and it’s Python. But more importantly, NLTK is completely built around stochastic analysis techniques and comes with data sets and training mechanisms built in. Particularly because the magic and foo of Big Data with NLP requires using your own domain knowledge and data set, NLTK is extremely valuable from a leadership perspective! And anyway, it does come with out of the box NLP – use the Viterbi parser with a trained PCFG (Probabilistic Context Free Grammer, also called a Weighted Grammar) from the Penn Treebank, and you’ll get excellent parses immediately.

Choosing Hadoop might seem obvious given that we’re talking about Data Science and particularly Big Data. But I do want to point out that the NLP tasks that we’re going to talk about right off the bat are embarrassingly parallel – meaning that they are extremely well suited for the Map Reduce paradigm. If you consider the unit of natural language the sentence, then each sentence (at least to begin with) can be analyzed on its own with little required knowledge about the surrounding processing of sentences.

Combine that with the many flavors of Hadoop and the fact that you can get a cluster going in your closet for cheap– it’s the right price for a startup!

The glue to making NLTK (Python) and Hadoop (Java) play nice is Hadoop Streaming. Hadoop Streaming will allow you to create a mapper and a reducer with any executable, and expects that the executable will receive key-value pairs via stdin and output them via stdout. Just keep in mind that all other Hadoopy-ness exists, e.g. the FileInputFormat, HDFS, and Job Scheduling, all you get to replace is the mapper and reducer, but this is enough to include NLTK, so you’re off and running!

Here’s an example of a Mapper and Reducer to get you started doing token counts with NLTK (note that these aren’t word counts — to computational linguists, words are language elements that have senses and therefore convey meaning. Instead, you’ll be counting tokens, the syntactic base for words in this example, and you might be surprised to find out what tokens are– trust me, it isn’t as simple as splitting on whitespace and punctuation!).

mapper.py

<code>
    #!/usr/bin/env python

    import sys
    from nltk.tokenize import wordpunct_tokenize

    def read_input(file):
        for line in file:
            # split the line into tokens
            yield wordpunct_tokenize(line)

    def main(separator='\t'):
        # input comes from STDIN (standard input)
        data = read_input(sys.stdin)
        for tokens in data:
            # write the results to STDOUT (standard output);
            # what we output here will be the input for the
            # Reduce step, i.e. the input for reducer.py
            #
            # tab-delimited; the trivial token count is 1
            for token in tokens:
                print '%s%s%d' % (word, separator, 1)

    if __name__ == "__main__":
        main()</code>

reducer.py

<code>
    #!/usr/bin/env python

    from itertools import groupby
    from operator import itemgetter
    import sys

    def read_mapper_output(file, separator='\t'):
        for line in file:
            yield line.rstrip().split(separator, 1)

    def main(separator='\t'):
        # input comes from STDIN (standard input)
        data = read_mapper_output(sys.stdin, separator=separator)
        # groupby groups multiple word-count pairs by word,
        # and creates an iterator that returns consecutive keys and their group:
        #   current_word - string containing a word (the key)
        #   group - iterator yielding all ["&lt;current_word&gt;", "&lt;count&gt;"] items
        for current_word, group in groupby(data, itemgetter(0)):
            try:
                total_count = sum(int(count) for current_word, count in group)
                print "%s%s%d" % (current_word, separator, total_count)
            except ValueError:
                # count was not a number, so silently discard this item
                pass

    if __name__ == "__main__":
        main()</code>

Running the Job

<code>
    hduser@ubuntu:/usr/local/hadoop$ bin/hadoop jar contrib/streaming/hadoop-*streaming*.jar \
        -file /home/hduser/mapper.py    -mapper /home/hduser/mapper.py \
        -file /home/hduser/reducer.py   -reducer /home/hduser/reducer.py \
        -input /user/hduser/gutenberg/* -output /user/hduser/gutenberg-output</code>

Some interesting notes about using Hadoop Streaming relate to memory usage. NLP can sometimes be a memory intensive task as you have to load up training data to compute various aspects of your processing- loading these things up can take minutes at the beginning of your processing. However, with Hadoop Streaming, only one interpreter per job is loaded, thus saving you repeating that loading process. Similarly, you can use generators and other python iteration techniques to carve through mountains of data very easily. There are some Python libraries, including dumbo, mrjob, and hadoopy that can make all of this a bit easier.

The post Python’s Natural Language Took Kit (NLTK) and Hadoop – Part 3 appeared first on Data Community DC.

Python’s Natural Language Took Kit (NLTK) and Hadoop – Part 3

mapper.py

reducer.py

Running the Job

Trending Articles

Bath man appears in court charged with attempted murder of a man...

MACLEAN, Allan

Black Angus Grilled Artichokes

Practice Sheet of Right form of verbs for HSC Students

Police blotter for Jan. 12

99 God Status for Whatsapp, Facebook

Rajasthan Board 12th Science Result 2018 name wise- RBSE 12th commerce result...

Notorious Naushad of Ippa gang nabbed

Child Kidnapping: Amy McNeil was kidnapped on her way to school by 5 adults;...

Sonible Smartlimit v1.1.5-R2R

NCERT Solutions for Class 9th Sanskrit Chapter 3 पाथेयम्

मतलबी दोस्त स्टेट्स | Matlabi Dost Status in Hindi – Selfish Friends Status

Arrow Flash 2 – Sinhala Dubbed – Episode 23 – 20th March 2016

[GET] AI Traffic Goldmine

[E² Plugin] HDF-Radio

Universal Multi-Patch v1.3 By RADIXX11

IWAN – Thanks and Praise ( Throw Back Thursday )

RONALD P SONDERGAARD Arrested by Miami-Dade County Corrections on Mar 03, 2017

मुख मैथुन से उठाएं सेक्स का भरपूर मज़ा, जानें क्या है इसका सही तरीकामुख मैथुन...

HSSC Excise & Taxation Inspector Result 2017 Scorecard/ Category Wise Merit List