org.pdfbox.searchengine.lucene
Class LucenePDFDocument
java.lang.Objectorg.pdfbox.searchengine.lucene.LucenePDFDocument
public final class LucenePDFDocument
extends java.lang.Object
This class is used to create a document for the lucene search engine.
This should easily plug into the IndexHTML or IndexFiles that comes with
the lucene project. This class will populate the following fields.
| Lucene Field Name | Description |
|---|
| path | File system path if loaded from a file |
| url | URL to PDF document |
| contents | Entire contents of PDF document, indexed but not stored |
| summary | First 500 characters of content |
| modified | The modified date/time according to the url or path |
| uid | A unique identifier for the Lucene document. |
| CreationDate | From PDF meta-data if available |
| Creator | From PDF meta-data if available |
| Keywords | From PDF meta-data if available |
| ModificationDate | From PDF meta-data if available |
| Producer | From PDF meta-data if available |
| Subject | From PDF meta-data if available |
| Trapped | From PDF meta-data if available |
Document | convertDocument(File file)- This will take a reference to a PDF document and create a lucene document.
|
Document | convertDocument(InputStream is)- Convert the PDF stream to a lucene document.
|
Document | convertDocument(URL url)- Convert the document from a PDF to a lucene document.
|
DateTools.Resolution | getDateTimeResolution()- Get the Lucene data time resolution.
|
static Document | getDocument(File file)- This will get a lucene document from a PDF file.
|
static Document | getDocument(InputStream is)- This will get a lucene document from a PDF file.
|
static Document | getDocument(URL url)- This will get a lucene document from a PDF file.
|
static void | main(String[] args)- This will test creating a document.
|
void | setDateTimeResolution(DateTools.Resolution resolution)- Set the Lucene data time resolution.
|
void | setTextStripper(PDFTextStripper aStripper)- Set the text stripper that will be used during extraction.
|
LucenePDFDocument
public LucenePDFDocument()
Constructor.
convertDocument
public Document convertDocument(File file)
throws IOException This will take a reference to a PDF document and create a lucene document.
file - A reference to a PDF document.
- The converted lucene document.
convertDocument
public Document convertDocument(InputStream is)
throws IOException Convert the PDF stream to a lucene document.
- The input stream converted to a lucene document.
convertDocument
public Document convertDocument(URL url)
throws IOException Convert the document from a PDF to a lucene document.
url - A url to a PDF document.
- The PDF converted to a lucene document.
getDateTimeResolution
public DateTools.Resolution getDateTimeResolution()
Get the Lucene data time resolution.
- current date/time resolution
getDocument
public static Document getDocument(File file)
throws IOException This will get a lucene document from a PDF file.
file - The file to get the document for.
getDocument
public static Document getDocument(InputStream is)
throws IOException This will get a lucene document from a PDF file.
is - The stream to read the PDF from.
getDocument
public static Document getDocument(URL url)
throws IOException This will get a lucene document from a PDF file.
url - The file to get the document for.
main
public static void main(String[] args)
throws IOException This will test creating a document.
usage: java pdfparser.searchengine.lucene.LucenePDFDocument <pdf-document>
args - command line arguments.
setDateTimeResolution
public void setDateTimeResolution(DateTools.Resolution resolution)
Set the Lucene data time resolution.
resolution - set new date/time resolution
setTextStripper
public void setTextStripper(PDFTextStripper aStripper)
Set the text stripper that will be used during extraction.
aStripper - The new pdf text stripper.