StreamTokenizer Tokenize Stream


Breaking a string or stream into meaningful independent words is known as tokenization. Tokenization is a common practice to tool developers. java.util package includes StringTokenizer which tokenizes a string into independent words. For StringTokenizer, the source is a string. There comes a similar tokenizer, StreamTokenize with java.io package for which source is a stream. Here, the StreamTokenizer tokenizes a whole stream into independent words.

About StreamTokenizer

For StreamTokenizer, the source is a character stream, Reader. StreamTokenizer tokenizes the stream into distinct words and allows to read one by one. It is not a general tokenizer and includes the capability of knowing the type of the token like the token is a word or number. For general tokenization, there comes java.util.Scanner where each word or line can read with next(), nextLine() and nextUTF() methods. StreamTokenizer includes a method nextToken() that can be used in a for loop to print all the tokens.

The StreamTokenizer comes with the following constant variables (instance variables are also known as fields) used to decide the type of the token.

  1. int ttype: When the nextToken() returns a token, this field can be used to decide the type of the token.
  2. static final int TT_EOF: This field is used to know the end of file is reached.
  3. static final int TT_EOL: This field is used to know the end of line is reached.
  4. static final int TT_NUMBER: This field is used to decide the token returned by the nextToken() method is a number or not.
  5. static final int TT_WORD: This field is used to decide the token returned by the nextToken() method is a word or not.
  6. String sval: If the token is a word, this filed contains the word that can be used in programming.
  7. double nval: If the token is a word, this filed contains the number that can be used in programming.

Following is the class signature

public class StreamTokenizer extends Object

Two programs are given on StreamTokenization.

  1. First Program: Tokenizing text file contents
  2. Second Program: Printing file contents

First Program: Tokenizing a Text File Contents

The following program, STDemo.java, reads the AllContent.txt file and prints the number of numerical values, number of words and number of total characters that appear and the total value of all the numbers.

File Name : AllContent.txt

hello 10 world 20.5

The above file contents are tokenized in the following program.

import java.io.*;
public class STDemo
{
  public static void main(String args[]) throws IOException
  {
    FileReader freader = new FileReader("AllContent.txt");
    StreamTokenizer st = new StreamTokenizer(freader);

    double sum = 0;
    int numWords = 0, numChars = 0;

    while(st.nextToken() != st.TT_EOF)
    {
      if(st.ttype == StreamTokenizer.TT_NUMBER)
      {
        sum += st.nval;
      }         
      else if(st.ttype == StreamTokenizer.TT_WORD)
      {
        numWords++;
        numChars += st.sval.length();
      }
    }         
    System.out.println("Sum of total numbers in the file: " + sum);
    System.out.println("Total words (does not include numbers) in the file: " + numWords);
    System.out.println("No. of characters available in words: " + numChars);
    }
}

StreamTokenizer Tokenize Stream

FileReader freader = new FileReader("AllContent.txt");
StreamTokenizer st = new StreamTokenizer(freader);

The StreamTokenizer constructor is passed with an object freader of FileReader class. The FileReader converts the file AllContent.txt into a stream and passes to StreamTokenizer for tokenization.

The nextToken() returns all tokens one after another available in the stream. The ttype contains the type of the token and this type can be known by the programmer using the StreamTokenizer constants TT_NUMBER and TT_WORD. The nval (numerical value) and sval (string value) contains the actual number or word returned by the nextToken() method.

5 thoughts on “StreamTokenizer Tokenize Stream”

    1. resetsyntax() it is a method
      This method resets this tokenizer’s syntax table so that all characters are “ordinary.” See the ordinaryChar method for more information on a character being ordinary.

Leave a Comment

Your email address will not be published.