Breaking a string or stream into meaningful independent words is known as tokenization. Tokenization is a common practice to tool developers. java.util package includes StringTokenizer which tokenizes a string into independent words. For StringTokenizer, the source is a string. There comes a similar tokenizer, StreamTokenize with java.io package for which source is a stream. Here, the StreamTokenizer tokenizes a whole stream into independent words.
About StreamTokenizer
For StreamTokenizer, the source is a character stream, Reader. StreamTokenizer tokenizes the stream into distinct words and allows to read one by one. It is not a general tokenizer and includes the capability of knowing the type of the token like the token is a word or number. For general tokenization, there comes java.util.Scanner where each word or line can read with next(), nextLine() and nextUTF() methods. StreamTokenizer includes a method nextToken() that can be used in a for loop to print all the tokens.
The StreamTokenizer comes with the following constant variables (instance variables are also known as fields) used to decide the type of the token.
- int ttype: When the nextToken() returns a token, this field can be used to decide the type of the token.
- static final int TT_EOF: This field is used to know the end of file is reached.
- static final int TT_EOL: This field is used to know the end of line is reached.
- static final int TT_NUMBER: This field is used to decide the token returned by the nextToken() method is a number or not.
- static final int TT_WORD: This field is used to decide the token returned by the nextToken() method is a word or not.
- String sval: If the token is a word, this filed contains the word that can be used in programming.
- double nval: If the token is a word, this filed contains the number that can be used in programming.
Following is the class signature
Two programs are given on StreamTokenization.
- First Program: Tokenizing text file contents
- Second Program: Printing file contents
First Program: Tokenizing a Text File Contents
The following program, STDemo.java, reads the AllContent.txt file and prints the number of numerical values, number of words and number of total characters that appear and the total value of all the numbers.
File Name : AllContent.txt
hello 10 world 20.5
The above file contents are tokenized in the following program.
import java.io.*;
public class STDemo
{
public static void main(String args[]) throws IOException
{
FileReader freader = new FileReader("AllContent.txt");
StreamTokenizer st = new StreamTokenizer(freader);
double sum = 0;
int numWords = 0, numChars = 0;
while(st.nextToken() != st.TT_EOF)
{
if(st.ttype == StreamTokenizer.TT_NUMBER)
{
sum += st.nval;
}
else if(st.ttype == StreamTokenizer.TT_WORD)
{
numWords++;
numChars += st.sval.length();
}
}
System.out.println("Sum of total numbers in the file: " + sum);
System.out.println("Total words (does not include numbers) in the file: " + numWords);
System.out.println("No. of characters available in words: " + numChars);
}
}

FileReader freader = new FileReader("AllContent.txt");
StreamTokenizer st = new StreamTokenizer(freader);
The StreamTokenizer constructor is passed with an object freader of FileReader class. The FileReader converts the file AllContent.txt into a stream and passes to StreamTokenizer for tokenization.
The nextToken() returns all tokens one after another available in the stream. The ttype contains the type of the token and this type can be known by the programmer using the StreamTokenizer constants TT_NUMBER and TT_WORD. The nval (numerical value) and sval (string value) contains the actual number or word returned by the nextToken() method.
can you give example on TT_EOL
sir can you plz tell me,
what does resetSyntax() and wordchars(int i, int j) method do
wordchars() we use in StreamTokenizer and regarding resetSyntax() I did not come across in Java.
resetsyntax() it is a method
This method resets this tokenizer’s syntax table so that all characters are “ordinary.” See the ordinaryChar method for more information on a character being ordinary.
What is resetsyntax()? Have I used anywhere?