Tokenization of Words and Sentences using NLTK

Tokenization is the process by which string is divided into smaller sub parts called tokens.

Tokenization is the first step toward solving the problems like Text classification, sentiment analysis, smart chatbot etc using Natural Language toolkit.

Natural Language toolkit has ‘Tokenizer Interface’, now this  tokenize module is further divided into sub parts

  1. word tokenize
  2. sentence tokenize

NLTK Tokenizer Package

Tokenizer package is used to find the words, sentences and the punctuation in the given string.

To import this package in Python we use
import nltk.tokenize

1. Word Tokenize Using nltk.tokenize
In this step we split a sentence into words.

from nltk.tokenize import word_tokenize

input_string = ”’Random-access memory (RAM /ræm/) is a form of computer memory that can be read and changed in any order,

typically used to store working data and machine code. It cost around $40.99 of 8gb.”’

#applying the word_tokenize method

output = word_tokenize(input_string)

print(output)

Result:

[‘Random-access’, ‘memory’, ‘(‘, ‘RAM’, ‘/ræm/’, ‘)’, ‘is’, ‘a’, ‘form’, ‘of’, ‘computer’, ‘memory’, ‘that’, ‘can’, ‘be’, ‘read’, ‘and’, ‘changed’, ‘in’, ‘any’, ‘order’, ‘,’, ‘typically’, ‘used’, ‘to’, ‘store’, ‘working’, ‘data’, ‘and’, ‘machine’, ‘code’, ‘.’, ‘It’, ‘cost’, ‘around’, ‘$’, ‘40.99’, ‘of’, ‘8gb’, ‘.’]

Word + Punctuation tokenize Using nltk.tokenize

In this step we split input text into words and punctuation.

from nltk.tokenize import wordpunct_tokenize

input_string = ”’Random-access memory (RAM /ræm/) is a form of computer memory that can be read and changed in any order,

typically used to store working data and machine code. It cost around $40.99 of 8gb.”’

output = wordpunct_tokenize(input_string)

print(output)

Result:

[‘Random’, ‘-‘, ‘access’, ‘memory’, ‘(‘, ‘RAM’, ‘/’, ‘ræm’, ‘/)’, ‘is’, ‘a’, ‘form’, ‘of’, ‘computer’, ‘memory’, ‘that’, ‘can’, ‘be’, ‘read’, ‘and’, ‘changed’, ‘in’, ‘any’, ‘order’, ‘,’, ‘typically’, ‘used’, ‘to’, ‘store’, ‘working’, ‘data’, ‘and’, ‘machine’, ‘code’, ‘.’, ‘It’, ‘cost’, ‘around’, ‘$’, ’40’, ‘.’, ’99’, ‘of’, ‘8gb’, ‘.’]

 

2.Sentence Tokenize tokenize Using nltk.tokenize
In this step we split input text into sentences.

from nltk.tokenize import sent_tokenize

input_string = ”’Random-access memory (RAM /ræm/) is a form of computer memory that can be read and changed in any order,

typically used to store working data and machine code. It cost around $40.99 of 8gb.”’

output = sent_tokenize(input_string)

print(output)

Result:

[‘Random-access memory (RAM /ræm/) is a form of computer memory that can be read and changed in any order, \ntypically used to store working data and machine code.’, ‘It cost around $40.99 of 8gb.’]

http://mycloudplace.com/natural-language-processing-an-introduction/

Natural Language Processing An Introduction

Natural_Language_Toolkit

3 thoughts on “Tokenization of Words and Sentences using NLTK”

  1. Pingback: An Introduction To Machine Learning - Mycloudplace

  2. Pingback: Natural Language Processing An Introduction - Mycloudplace

  3. Pingback: Python Programming Introduction - Mycloudplace

Leave a Comment

Your email address will not be published. Required fields are marked *