Class NexusTokenizer
- java.lang.Object
-
- pal.io.NexusTokenizer
-
public final class NexusTokenizer extends java.lang.ObjectComments
A simple token pull-parser for the NEXUS file format as specified in:
Maddison, D. R., Swofford, D. L., & Maddison, W. P., Systematic Biology, 46(4), pp. 590 - 621.
The parser is designed to break a NEXUS file into tokens which are read individually. Tokens come in four different types:
- Punctuation: any of the punctuation characters (see constants)
- Whitespace: sequences of characters composed of
' 'or'\t'. Whitespace is only returned if the option is set - Word: any string of characters delimited by whitespace or punctuation
- Newline:
'\r','\n'or'\r\n'. The parser will return the character unlessconvertNLis set, in which case it will replace the token with the user specified new line character
The parser has a set of options allowing tokens to be modified before they are returned (such as case modification or newline substitution).
Each read by the parser moves forward in the stream, at present there is no support for unreading tokens or for moving bi-directionally through the stream
NB: in this implementation, the token #NEXUS is considered special and when read by the parser, it will return one token: '#NEXUS' not two: '#' and 'NEXUS'. This token has special meaning and is reflected in it having its own token type
Usage
NexusTokenizer ntp = new NexusTokenizer(new PushbackReader(new FileReader("afile")));
ntp.setReadWhiteSpace(false);
// ignore whitespace ntp.setIgnoreComments(true);
// ignore comments ntp.setWordModification(NexusTokenizer.WORD_UPPERCASE);
// all tokens in uppercase String nToken = ntp.readToken();
while(nToken != null) {
System.out.println("Token: " + nToken);
System.out.println("Col: " + ntp.getCol());
System.out.println("Row: " + ntp.getRow());
}
- Version:
- $Id$, $Name$
- Author:
- $Author$
-
-
Field Summary
Fields Modifier and Type Field Description static charADDITIONstatic charASTERIXstatic charB_SLASHstatic charB_TICKstatic charC_RETURNstatic charCOLONstatic charCOMMAstatic charD_QUOTEstatic charDASHstatic charEQUALSstatic charF_SLASHstatic charG_THANstatic charHASHstatic intHEADER_TOKENFlag indicating last token read was the header token #NEXUSstatic charL_BRACEstatic charL_BRACKETstatic charL_FEEDstatic charL_PARENTHESISstatic charL_THANstatic intNEWLINE_TOKENFlag indicating last token read was a newline symbol/wordstatic charPERIODstatic intPUNCTUATION_TOKENFlag indicating last token read was a punctuation symbolstatic charR_BRACEstatic charR_BRACKETstatic charR_PARENTHESISstatic charS_QUOTEstatic charSEMI_COLONstatic charSPACEstatic charTABstatic intUNDEFINED_TOKENFlag indicating last token read was undefinedstatic intWHITESPACE_TOKENFlag indicating last token read was whitespacestatic intWORD_LOWERCASEFlag indicating words should be converted to lowercasestatic intWORD_TOKENFlag indicating last token read was a wordstatic intWORD_UNMODIFIEDFlag indicating words should be untouchedstatic intWORD_UPPERCASEFlag indicating words should be converted to uppercase
-
Constructor Summary
Constructors Constructor Description NexusTokenizer(java.io.PushbackReader pr)Constructor for aNexusTokenParserNexusTokenizer(java.lang.String file)Constructor for aNexusTokenParser
-
Method Summary
All Methods Instance Methods Concrete Methods Modifier and Type Method Description booleanconvertNewLine()Gets the flag indicating whether this parser instance should convert newline characters.intgetCol()Gets the current column position of the cursor.java.lang.StringgetLastReadToken()Returns the last read token.intgetLastTokenType()Determine the type of the last read token.intgetRow()Gets the current row position of the cursor.intgetWordModification()Gets the word modification flag currently in usejava.lang.StringreadToken()Reads a token in from the underlying stream.booleanreadWhiteSpace()Get the flag indicating whether or not this parser object is reading (and returning) whitespacejava.lang.Stringseek(int tokenType)Seeks through the stream to find the next token of the specified type.java.lang.Stringseek(java.lang.String token)Seeks through the stream to find the token argument.voidsetConvertNewLine(boolean b)Sets theconvertNLflag.voidsetIgnoreComments(boolean b)Sets theignoreCommentsflag.voidsetNewLineChar(char nl)Sets the character to be convert newline characters intovoidsetReadWhiteSpace(boolean b)Sets thereadWSflag.voidsetWordModification(int flag)Sets the flag value for word modification.
-
-
-
Field Detail
-
L_PARENTHESIS
public static final char L_PARENTHESIS
- See Also:
- Constant Field Values
-
R_PARENTHESIS
public static final char R_PARENTHESIS
- See Also:
- Constant Field Values
-
L_BRACKET
public static final char L_BRACKET
- See Also:
- Constant Field Values
-
R_BRACKET
public static final char R_BRACKET
- See Also:
- Constant Field Values
-
L_BRACE
public static final char L_BRACE
- See Also:
- Constant Field Values
-
R_BRACE
public static final char R_BRACE
- See Also:
- Constant Field Values
-
F_SLASH
public static final char F_SLASH
- See Also:
- Constant Field Values
-
B_SLASH
public static final char B_SLASH
- See Also:
- Constant Field Values
-
COMMA
public static final char COMMA
- See Also:
- Constant Field Values
-
SEMI_COLON
public static final char SEMI_COLON
- See Also:
- Constant Field Values
-
COLON
public static final char COLON
- See Also:
- Constant Field Values
-
EQUALS
public static final char EQUALS
- See Also:
- Constant Field Values
-
ASTERIX
public static final char ASTERIX
- See Also:
- Constant Field Values
-
S_QUOTE
public static final char S_QUOTE
- See Also:
- Constant Field Values
-
D_QUOTE
public static final char D_QUOTE
- See Also:
- Constant Field Values
-
B_TICK
public static final char B_TICK
- See Also:
- Constant Field Values
-
ADDITION
public static final char ADDITION
- See Also:
- Constant Field Values
-
DASH
public static final char DASH
- See Also:
- Constant Field Values
-
L_THAN
public static final char L_THAN
- See Also:
- Constant Field Values
-
G_THAN
public static final char G_THAN
- See Also:
- Constant Field Values
-
HASH
public static final char HASH
- See Also:
- Constant Field Values
-
PERIOD
public static final char PERIOD
- See Also:
- Constant Field Values
-
L_FEED
public static final char L_FEED
- See Also:
- Constant Field Values
-
C_RETURN
public static final char C_RETURN
- See Also:
- Constant Field Values
-
TAB
public static final char TAB
- See Also:
- Constant Field Values
-
SPACE
public static final char SPACE
- See Also:
- Constant Field Values
-
WORD_UPPERCASE
public static final int WORD_UPPERCASE
Flag indicating words should be converted to uppercase- See Also:
- Constant Field Values
-
WORD_LOWERCASE
public static final int WORD_LOWERCASE
Flag indicating words should be converted to lowercase- See Also:
- Constant Field Values
-
WORD_UNMODIFIED
public static final int WORD_UNMODIFIED
Flag indicating words should be untouched- See Also:
- Constant Field Values
-
UNDEFINED_TOKEN
public static final int UNDEFINED_TOKEN
Flag indicating last token read was undefined- See Also:
- Constant Field Values
-
WORD_TOKEN
public static final int WORD_TOKEN
Flag indicating last token read was a word- See Also:
- Constant Field Values
-
PUNCTUATION_TOKEN
public static final int PUNCTUATION_TOKEN
Flag indicating last token read was a punctuation symbol- See Also:
- Constant Field Values
-
NEWLINE_TOKEN
public static final int NEWLINE_TOKEN
Flag indicating last token read was a newline symbol/word- See Also:
- Constant Field Values
-
WHITESPACE_TOKEN
public static final int WHITESPACE_TOKEN
Flag indicating last token read was whitespace- See Also:
- Constant Field Values
-
HEADER_TOKEN
public static final int HEADER_TOKEN
Flag indicating last token read was the header token #NEXUS- See Also:
- Constant Field Values
-
-
Constructor Detail
-
NexusTokenizer
public NexusTokenizer(java.lang.String file) throws java.io.IOExceptionConstructor for aNexusTokenParser- Parameters:
file- File name for the NEXUS file- Throws:
java.io.IOException- I/O errors
-
NexusTokenizer
public NexusTokenizer(java.io.PushbackReader pr) throws java.io.IOExceptionConstructor for aNexusTokenParser- Parameters:
pr- PushbackReader- Throws:
java.io.IOException- I/O errors
-
-
Method Detail
-
readWhiteSpace
public boolean readWhiteSpace()
Get the flag indicating whether or not this parser object is reading (and returning) whitespace- Returns:
- returns the
readWSflag
-
convertNewLine
public boolean convertNewLine()
Gets the flag indicating whether this parser instance should convert newline characters. As the specification says (see link in class description above), newline characters may be '\r', '\n', '\r\n'. To provide some kind of uniformity, the parser can convert these symbols into one specified. As a default, this feature is off.- Returns:
- returns the
convertNLflag
-
setReadWhiteSpace
public void setReadWhiteSpace(boolean b)
Sets thereadWSflag. True means that the parser will return whitespace characters as a token (where whitespace = ' ' or '\t').- Parameters:
b- flag value forreadWS
-
setConvertNewLine
public void setConvertNewLine(boolean b)
Sets theconvertNLflag. True means that the the parser will convert newline characters ('\r', '\n' or '\r\n') into either the default ('\n' ifsetNewLineChar()is not called) or to a user specified newline char- Parameters:
b- flag value forconvertNL
-
setIgnoreComments
public void setIgnoreComments(boolean b)
Sets theignoreCommentsflag. True means that the the tokenizer will ignore comments (i.e. sections of a nexus file delimited by '[...]'. When set to true, the tokenizer will return the first token available after a comment.- Parameters:
b- flag value forignoreComments
-
setNewLineChar
public void setNewLineChar(char nl)
Sets the character to be convert newline characters into- Parameters:
nl- Replacement newline character
-
getCol
public int getCol()
Gets the current column position of the cursor. Changed after each read.- Returns:
- Column number (zero indexed)
-
getRow
public int getRow()
Gets the current row position of the cursor. Changed after each read.- Returns:
- Row number (zero indexed)
-
getWordModification
public int getWordModification()
Gets the word modification flag currently in use- Returns:
- Flag value for word modification
-
setWordModification
public void setWordModification(int flag)
Sets the flag value for word modification. The token case can be changed to lowercase or uppercasse once it has been read from the stream (depending on the set flag).WORD_UNMODIFIEDindicates that the tokens should be returned in the case that they are read from the stream. This value can be set at any time between token reads and thus the next token read will be altered depending on this value. The default isWORD_UNMODIFIED.- Parameters:
flag- Flag value, one ofWORD_LOWERCASE,WORD_UPPERCASEorWORD_UNMODIFIED
-
readToken
public java.lang.String readToken() throws java.io.IOException, NexusParseExceptionReads a token in from the underlying stream. Tokens are individual chunks read from the underlying stream. Each token is one of the four basic types:- Word: any string of characters delimited by whitespace or punctuation
- Punctuation: any of the punctuation characters (see constants)
- Whitespace: sequences of characters composed of ' ' or '\t'. Whitespace is only returned if the option is set
- Newline: '\r', '\n' or '\r\n'. The parser will return the character
unless
convertNLis set, in which case it will replace the token with the user specified new line character
- Returns:
- returns a
Stringtoken ornullif EOF is reached (i.e. no more tokens to read) - Throws:
java.io.IOException- I/O errorsNexusParseException- Parsing errors
-
getLastTokenType
public int getLastTokenType()
Determine the type of the last read token. AfterreadToken()has been called, the type of token returned can be determined by callinggetLastTokenType(). This returns one of five different constants:UNDEFINED_TOKEN: default before anything is read from the streamWORD_TOKEN: word token was readPUNCTUATION_TOKEN: punctuation token was readNEWLINE_TOKEN: newline token was readWHITESPACE_TOKEN: whitespace token was read (never returned unless whitespace is being returned)HEADER_TOKEN: last token was the special word #NEXUS
- Returns:
- Last token read.
-
seek
public java.lang.String seek(int tokenType) throws java.io.IOException, NexusParseExceptionSeeks through the stream to find the next token of the specified type. The type value can be one of:- WORD_TOKEN
- PUNCTUATION_TOKEN
- NEWLINE_TOKEN
- WHITESPACE_TOKEN
- HEADER_TOKEN
- Returns:
- returns a
Stringtoken ornullif EOF is reached (i.e. no more tokens to read) - Throws:
java.io.IOException- I/O errorsNexusParseException- Thrown by parsing errors or if tokenType == WHITESPACE_TOKEN && readWhiteSpace() == false
-
seek
public java.lang.String seek(java.lang.String token) throws java.io.IOException, NexusParseExceptionSeeks through the stream to find the token argument.- Returns:
- returns a
Stringtoken ornullif token is not found (i.e. EOF is reached) - Throws:
java.io.IOException- I/O errorsNexusParseException- Thrown by parsing errors or if token is whitespace && readWhiteSpace() == false
-
getLastReadToken
public java.lang.String getLastReadToken()
Returns the last read token. Each call toreadToken()stores the returned token so that it can be retrieved again. However, each consumingreadToken()call replaces this buffer with the new token.- Returns:
- return the last read token
-
-