StreamTokenizer Tokenize Stream

April 18, 2011 | I/O streams in java

Breaking a string or stream into meaningful independent words is known as tokenization. Tokenization is a common practice to tool developers. java.util package includes StringTokenizer which tokenizes a string into independent words. For StringTokenizer, the source is a string. There comes a similar tokenizer, StreamTokenize with java.io package for which source is a stream. Here, the StreamTokenizer tokenizes a whole stream into independent words.

About StreamTokenizer

For StreamTokenizer, the source is a character stream, Reader. StreamTokenizer tokenizes the stream into distinct words and allows to read one by one. It is not a general tokenizer and includes the capability of knowing the type of the token like the token is a word or number. For general tokenization, there comes java.util.Scanner where each word or line can read with next(), nextLine() and nextUTF() methods. StreamTokenizer includes a method nextToken() that can be used in a for loop to print all the tokens.

The StreamTokenizer comes with the following constant variables (instance variables are also known as fields) used to decide the type of the token.

  1. int ttype: When the nextToken() returns a token, this field can be used to decide the type of the token.
  2. static final int TT_EOF: This field is used to know the end of file is reached.
  3. static final int TT_EOL: This field is used to know the end of line is reached.
  4. static final int TT_NUMBER: This field is used to decide the token returned by the nextToken() method is a number or not.
  5. static final int TT_WORD: This field is used to decide the token returned by the nextToken() method is a word or not.
  6. String sval: If the token is a word, this filed contains the word that can be used in programming.
  7. double nval: If the token is a word, this filed contains the number that can be used in programming.

Following is the class signature

public class StreamTokenizer extends Object

Two programs are given on StreamTokenization.

  1. First Program: Tokenizing text file contents
  2. Second Program: Printing file contents

First Program: Tokenizing a Text File Contents

The following program, STDemo.java, reads the AllContent.txt file and prints the number of numerical values, number of words and number of total characters that appear and the total value of all the numbers.

File Name : AllContent.txt

hello 10 world 20.5

The above file contents are tokenized in the following program.

import java.io.*;
public class STDemo
{
  public static void main(String args[]) throws IOException
  {
    FileReader freader = new FileReader("AllContent.txt");
    StreamTokenizer st = new StreamTokenizer(freader);

    double sum = 0;
    int numWords = 0, numChars = 0;

    while(st.nextToken() != st.TT_EOF)
    {
      if(st.ttype == StreamTokenizer.TT_NUMBER)
      {
        sum += st.nval;
      }         
      else if(st.ttype == StreamTokenizer.TT_WORD)
      {
        numWords++;
        numChars += st.sval.length();
      }
    }         
    System.out.println("Sum of total numbers in the file: " + sum);
    System.out.println("Total words (does not include numbers) in the file: " + numWords);
    System.out.println("No. of characters available in words: " + numChars);
    }
}

StreamTokenizer Tokenize Stream

FileReader freader = new FileReader("AllContent.txt"); StreamTokenizer st = new StreamTokenizer(freader);

The StreamTokenizer constructor is passed with an object freader of FileReader class. The FileReader converts the file AllContent.txt into a stream and passes to StreamTokenizer for tokenization.

The nextToken() returns all tokens one after another available in the stream. The ttype contains the type of the token and this type can be known by the programmer using the StreamTokenizer constants TT_NUMBER and TT_WORD. The nval (numerical value) and sval (string value) contains the actual number or word returned by the nextToken() method.

Second Program: Printing File contents

File Name: Certification.txt

SCJP certification adds value to your resume.

The above file Certification.txt contents are read and printed.

import java.io.*;
public class FileTokens
{
  public static void main(String args[]) throws IOException
  {
    FileReader freader = new FileReader("Certification.txt");
    StreamTokenizer st = new StreamTokenizer(freader);

    while(st.nextToken() != st.TT_EOF) // if the token is not end-of-file marker
  {
     if(st.ttype == st.TT_WORD)     // if the token is word
     {
       System.out.println(st.sval );	  
     }
  }
 }
}	

StreamTokenizer Tokenize Stream

Certification.txt file is tokenized and in a for loop each token is read. The value returned by the nextToken() method is implicitly stored in sval and printed at DOS prompt.