|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||
java.lang.Objectorg.pdfbox.util.PDFStreamEngine
org.pdfbox.util.PDFTextStripper
public class PDFTextStripper
This class will take a pdf document and strip out all of the text and ignore the formatting and such.
| Field Summary | |
|---|---|
protected java.util.Vector |
charactersByArticle
The charactersByArticle is used to extract text by article divisions. |
protected java.io.Writer |
output
The stream to write the output to. |
| Constructor Summary | |
|---|---|
PDFTextStripper()
Instantiate a new PDFTextStripper object. |
|
PDFTextStripper(java.util.Properties props)
Instantiate a new PDFTextStripper object. |
|
| Method Summary | |
|---|---|
protected void |
endDocument(PDDocument pdf)
This method is available for subclasses of this class. |
protected void |
endPage(PDPage page)
End a page. |
protected void |
endParagraph()
End a paragraph. |
protected void |
flushText()
This will print the text to the output stream. |
protected java.util.List |
getCharactersByArticle()
Character strings are grouped by articles. |
protected int |
getCurrentPageNo()
Get the current page number that is being processed. |
PDOutlineItem |
getEndBookmark()
Get the bookmark where text extraction should end, inclusive. |
int |
getEndPage()
This will get the last page that will be extracted. |
java.lang.String |
getLineSeparator()
This will get the line separator. |
protected java.io.Writer |
getOutput()
The output stream that is being written to. |
java.lang.String |
getPageSeparator()
This will get the page separator. |
PDOutlineItem |
getStartBookmark()
Get the bookmark where text extraction should start, inclusive. |
int |
getStartPage()
This is the page that the text extraction will start on. |
java.lang.String |
getText(COSDocument doc)
Deprecated. |
java.lang.String |
getText(PDDocument doc)
This will return the text of a document. |
java.lang.String |
getWordSeparator()
This will get the word separator. |
protected void |
processPage(PDPage page,
COSStream content)
This will process the contents of a page. |
protected void |
processPages(java.util.List pages)
This will process all of the pages and the text that is in them. |
void |
setEndBookmark(PDOutlineItem aEndBookmark)
Set the bookmark where the text extraction should stop. |
void |
setEndPage(int endPageValue)
This will set the last page to be extracted by this class. |
void |
setLineSeparator(java.lang.String separator)
Set the desired line separator for output text. |
void |
setPageSeparator(java.lang.String separator)
Set the desired page separator for output text. |
void |
setShouldSeparateByBeads(boolean aShouldSeparateByBeads)
Set if the text stripper should group the text output by a list of beads. |
void |
setSortByPosition(boolean newSortByPosition)
The order of the text tokens in a PDF file may not be in the same as they appear visually on the screen. |
void |
setStartBookmark(PDOutlineItem aStartBookmark)
Set the bookmark where text extraction should start, inclusive. |
void |
setStartPage(int startPageValue)
This will set the first page to be extracted by this class. |
void |
setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingTextValue)
By default the text stripper will attempt to remove text that overlapps each other. |
void |
setWordSeparator(java.lang.String separator)
Set the desired word separator for output text. |
boolean |
shouldSeparateByBeads()
This will tell if the text stripper should separate by beads. |
boolean |
shouldSortByPosition()
This will tell if the text stripper should sort the text tokens before writing to the stream. |
boolean |
shouldSuppressDuplicateOverlappingText()
|
protected void |
showCharacter(TextPosition text)
This will show add a character to the list of characters to be printed to the text file. |
protected void |
startDocument(PDDocument pdf)
This method is available for subclasses of this class. |
protected void |
startPage(PDPage page)
Start a new page. |
protected void |
startParagraph()
Start a new paragraph. |
protected void |
writeCharacters(TextPosition text)
Write the string to the output stream. |
void |
writeText(COSDocument doc,
java.io.Writer outputStream)
Deprecated. |
void |
writeText(PDDocument doc,
java.io.Writer outputStream)
This will take a PDDocument and write the text of that document to the print writer. |
| Methods inherited from class org.pdfbox.util.PDFStreamEngine |
|---|
getColorSpaces, getCurrentPage, getFonts, getGraphicsStack, getGraphicsState, getGraphicsStates, getResources, getTextLineMatrix, getTextMatrix, getXObjects, processOperator, processOperator, processStream, processSubStream, registerOperatorProcessor, resetEngine, setColorSpaces, setFonts, setGraphicsStack, setGraphicsState, setGraphicsStates, setTextLineMatrix, setTextMatrix, showString |
| Methods inherited from class java.lang.Object |
|---|
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
| Field Detail |
|---|
protected java.util.Vector charactersByArticle
protected java.io.Writer output
| Constructor Detail |
|---|
public PDFTextStripper()
throws java.io.IOException
java.io.IOException - If there is an error loading the properties.
public PDFTextStripper(java.util.Properties props)
throws java.io.IOException
props - The properties containing the mapping of operators to PDFOperator
classes.
java.io.IOException - If there is an error reading the properties.| Method Detail |
|---|
public java.lang.String getText(PDDocument doc)
throws java.io.IOException
doc - The document to get the text from.
java.io.IOException - if the doc state is invalid or it is encrypted.
public java.lang.String getText(COSDocument doc)
throws java.io.IOException
doc - The document to extract the text from.
java.io.IOException - If there is an error extracting the text.getText( PDDocument )
public void writeText(COSDocument doc,
java.io.Writer outputStream)
throws java.io.IOException
doc - The document to extract the text.outputStream - The stream to write the text to.
java.io.IOException - If there is an error extracting the text.writeText( PDDocument, Writer )
public void writeText(PDDocument doc,
java.io.Writer outputStream)
throws java.io.IOException
doc - The document to get the data from.outputStream - The location to put the text.
java.io.IOException - If the doc is in an invalid state.
protected void processPages(java.util.List pages)
throws java.io.IOException
pages - The pages object in the document.
java.io.IOException - If there is an error parsing the text.
protected void startDocument(PDDocument pdf)
throws java.io.IOException
pdf - The PDF document that is being processed.
java.io.IOException - If an IO error occurs.
protected void endDocument(PDDocument pdf)
throws java.io.IOException
pdf - The PDF document that is being processed.
java.io.IOException - If an IO error occurs.
protected void processPage(PDPage page,
COSStream content)
throws java.io.IOException
page - The page to process.content - The contents of the page.
java.io.IOException - If there is an error processing the page.
protected void startParagraph()
throws java.io.IOException
java.io.IOException - If there is any error writing to the stream.
protected void endParagraph()
throws java.io.IOException
java.io.IOException - If there is any error writing to the stream.
protected void startPage(PDPage page)
throws java.io.IOException
page - The page we are about to process.
java.io.IOException - If there is any error writing to the stream.
protected void endPage(PDPage page)
throws java.io.IOException
page - The page we are about to process.
java.io.IOException - If there is any error writing to the stream.
protected void flushText()
throws java.io.IOException
java.io.IOException - If there is an error writing the text.
protected void writeCharacters(TextPosition text)
throws java.io.IOException
text - The text to write to the stream.
java.io.IOException - If there is an error when writing the text.protected void showCharacter(TextPosition text)
showCharacter in class PDFStreamEnginetext - The description of the character to display.public int getStartPage()
public void setStartPage(int startPageValue)
startPageValue - New value of property startPage.public int getEndPage()
public void setEndPage(int endPageValue)
endPageValue - New value of property endPage.public void setLineSeparator(java.lang.String separator)
separator - The desired line separator string.public java.lang.String getLineSeparator()
public void setPageSeparator(java.lang.String separator)
separator - The desired page separator string.public java.lang.String getWordSeparator()
public void setWordSeparator(java.lang.String separator)
separator - The desired page separator string.public java.lang.String getPageSeparator()
public boolean shouldSuppressDuplicateOverlappingText()
protected int getCurrentPageNo()
protected java.io.Writer getOutput()
protected java.util.List getCharactersByArticle()
public void setSuppressDuplicateOverlappingText(boolean suppressDuplicateOverlappingTextValue)
suppressDuplicateOverlappingTextValue - The suppressDuplicateOverlappingText to set.public boolean shouldSeparateByBeads()
public void setShouldSeparateByBeads(boolean aShouldSeparateByBeads)
aShouldSeparateByBeads - The new grouping of beads.public PDOutlineItem getEndBookmark()
public void setEndBookmark(PDOutlineItem aEndBookmark)
aEndBookmark - The ending bookmark.public PDOutlineItem getStartBookmark()
public void setStartBookmark(PDOutlineItem aStartBookmark)
aStartBookmark - The starting bookmark.public boolean shouldSortByPosition()
public void setSortByPosition(boolean newSortByPosition)
newSortByPosition - Tell PDFBox to sort the text positions.
|
||||||||||
| PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
| SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD | |||||||||