|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |
java.lang.Objectorg.faceless.pdf2.PageExtractor
public class PageExtractor
This class enables the extraction of text and images from a PDFPage
.
You can get one by calling the PDFParser.getPageExtractor(int)
method,
assuming the PDF has the rights to let you extract text and/or images.
Once you've got one, you can extract the text of the page as a StringBuffer
by calling getTextAsStringBuffer()
. Note that extracting text from PDF's
is not an exact science - the internals of a PDF allow text to be displayed
in any order, and features like superscript, subscript, rotated text and so on which
are easy to display in PDF can only be approximated in plain text.
Features like tables etc. have to be determined using heuristics, and some PDF's are encoded in a way that makes extracting their text almost impossible (storing each letter as an image, for example).
Depending on how the font has been stored, the library may replace
unknown characters with a Unicode character in the private range (U+EF00 - U+EFFF).
These replacements will be consistent, so if you find that U+EF01 is in fact the letter
'A', you can easily run a String.replace()
on the string to
correct the letters
Extracting BitMap images is a much simpler process. The PageExtractor.Image
class represents an image on the current page. There is one instance for each time an
image is drawn, although as an image is repeated each instance may contain the same
RenderedImage
. You can retrieve the list of images by calling the getImages()
method.
This class requires the Extended Edition plus Viewer license to operate. Although it may be freely used in the trial version of the library, the extracted text will have the letter 'e' replaced with the letter 'a'.
Nested Class Summary | |
---|---|
class |
PageExtractor.Image
A class representing a bitmap image which is extracted from the PageExtractor . |
class |
PageExtractor.Text
A class representing a piece of text which is extracted from the PageExtractor . |
Method Summary | |
---|---|
static Collection |
cropText(Collection all,
Shape shape)
Given a Collection of PageExtractor.Text items, as returned by
getMatchingText() , getTextUnordered()
or getTextInDisplayOrder() , return a new Collection which
contains only Text that falls completely inside the specified Shape . |
Collection |
getImages()
Return every PageExtractor.Image on the page, in the order they
were added to the page. |
Collection |
getMatchingText(Pattern pattern)
Return a Collection of PageExtractor.Text items on this page that match the
specified Regular Expression. |
Collection |
getMatchingText(String query)
Return a Collection of PageExtractor.Text items on this page that are equal
to the specified substring. |
Collection |
getMatchingText(String[] queries)
Return a Collection of PageExtractor.Text items on this page that are equals
to one of the specified substrings. |
Collection |
getMatchingText(String[] queries,
boolean caseinsensitive)
Return a Collection of PageExtractor.Text items on this page that are equals
to one of the specified substrings. |
PDFPage |
getPage()
Return the PDFPage this PageExtractor relates to |
AttributedString |
getStyledText(PageExtractor.Text first,
int firstchar,
PageExtractor.Text last,
int lastchar,
boolean displayorder)
Return an AttributedString containing a contiguous range of text from this PageExtractor. |
StringBuffer |
getText(PageExtractor.Text first,
int firstchar,
PageExtractor.Text last,
int lastchar,
boolean displayorder)
Return a StringBuffer containing a contiguous range of text from this PageExtractor. |
StringBuffer |
getTextAsStringBuffer()
Parse and return all the text on the page as a StringBuffer. |
StringBuffer |
getTextAsStringBuffer(float x1,
float y1,
float x2,
float y2)
Parse and return the text in the specified area on the page as a String. |
Collection |
getTextInDisplayOrder()
Return every PageExtractor.Text item on the page, in the order they are
displayed on the screen - so the first item in the returned collection
will nearest to the top left of the page. |
Collection |
getTextUnordered()
Return every PageExtractor.Text item on the page, in the order they were added
to the page. |
boolean |
isExtracted()
Return true if the extraction has been run, false otherwise. |
void |
setOption(String key,
String value)
Set an option to control text extraction. |
void |
setSpaceTolerance(double zero,
double one,
double many)
Set the "space tolerance" - tunable parameters for the extractor to determine when two adjacent phrases of text are to be separated by zero, one or more than one space. |
Methods inherited from class java.lang.Object |
---|
equals, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait |
Method Detail |
---|
public boolean isExtracted()
public void setOption(String key, String value)
IgnoreArtifacts | all | regex | Set to all to ignore any text flagged as an "artifact" in the PDF stream, or set to a regular expression to match against artifacts to be ignored. |
---|---|---|
RawText | true | false | Set to true to prevent any post-processing of the extracted text. If this option is set, getTextInDisplayOrder , getTextAsStringBuffer and similar methods cannot be used - only getTextUnordered() will work |
public void setSpaceTolerance(double zero, double one, double many)
Set the "space tolerance" - tunable parameters for the extractor to determine when two adjacent phrases of text are to be separated by zero, one or more than one space. Typically this won't need to be tuned by the end user, but if you find the spacing between the extracted text is less than ideal, you can tune it to some degree with this method. The perfect value will depend on the font, the language, layout, line justification and kerning.
The values are multipliers of the width of the "space" character - so 1 means "the width of one space". Typically the parameter to tune is "one", within a rough range of 0.25 to 0.7 - reduce it if you find words are being joined together, and increase it if words are being split into two. it.
zero
- how far apart two characters should be in order for them
to be joined by zero spaces. The default value is -0.5one
- at least how far apart two characters must be in order to
be joined by one space. The default value is 0.666.many
- at least how far apart two characters must be in order to
be considered separate pieces of text. The default value is 1.5public Collection getImages()
PageExtractor.Image
on the page, in the order they
were added to the page. Some images may be displayed more than
once, in which case the value returned by PageExtractor.Image.getImage()
will be identical.
PageExtractor.Image
elements.public Collection getTextUnordered()
PageExtractor.Text
item on the page, in the order they were added
to the page. The ordering may not be consistant with the order items are
positioned on screen.
PageExtractor.Text
elements.public Collection getTextInDisplayOrder()
PageExtractor.Text
item on the page, in the order they are
displayed on the screen - so the first item in the returned collection
will nearest to the top left of the page.
PageExtractor.Text
elements.public Collection getMatchingText(String query)
Return a Collection of PageExtractor.Text
items on this page that are equal
to the specified substring. The Text items returned from
getTextInDisplayOrder()
are searched and possibly substrings
extracted from them to create this collection. In this case the
co-ordinates of the returned Text items will reflect the substring
not the original Text object.
As an example, the following method could be used to search a PDF for a specified word and add a "highlight" annotation over it. The PDF can then be rendered or saved as normal.
void highlightWords(PDF pdf, String word) { PDFParser parser = new PDFParser(pdf); for (int i=0;i<pdf.getNumberOfPages();i++) { PageExtractor extractor = parser.getPageExtractor(i); Collection co = extractor.getMatchingText(word); for (Iterator j = co.iterator();j.hasNext();) { PageExtractor.Text text = (PageExtractor.Text)j.next(); AnnotationMarkup annot = text.createAnnotationMarkup("Highlight"); text.getPage().getAnnotations().add(annot); } } }
query
- the String to search for
PageExtractor.Text
objects.public Collection getMatchingText(String[] queries)
Return a Collection of PageExtractor.Text
items on this page that are equals
to one of the specified substrings. This method runs exactly like
getMatchingText(String)
but allows more than one substring
to be matched.
queries
- a list of zero or more Strings to search for
PageExtractor.Text
objects.public Collection getMatchingText(String[] queries, boolean caseinsensitive)
Return a Collection of PageExtractor.Text
items on this page that are equals
to one of the specified substrings. This method runs exactly like
getMatchingText(String)
but allows more than one substring
to be matched.
queries
- a list of zero or more Strings to search forcaseinsensitive
- whether the search should be performed with regard to case
PageExtractor.Text
objects.public Collection getMatchingText(Pattern pattern)
Return a Collection of PageExtractor.Text
items on this page that match the
specified Regular Expression. This is likely to be more efficient than
the version of this method that takes multiple-strings.
pattern
- the Pattern
to search for
PageExtractor.Text
objects.public StringBuffer getTextAsStringBuffer()
public StringBuffer getTextAsStringBuffer(float x1, float y1, float x2, float y2)
x1
- the left-most X co-ordinate of the texty1
- the top-most Y co-ordinate of the textx2
- the right-most X co-ordinate of the texty2
- the bottom-most Y co-ordinate of the text
public StringBuffer getText(PageExtractor.Text first, int firstchar, PageExtractor.Text last, int lastchar, boolean displayorder)
PageExtractor.Text
object, and the
offsets into those strings. This method is chiefly intended for use with a GUI that
allows a range of text to be selected.
first
- the first Text from this PageExtractor to be extractedfirstchar
- the index of the first character from "first" to be extractedlast
- the last Text from this PageExtractor to be extractedlastchar
- the index after the index of the last character from "last" to be extracteddisplayorder
- if true, the iteration from first to last will go in display order, right to left and top to bottom. If false, the iteration will run in the order the text items exist in the document.public AttributedString getStyledText(PageExtractor.Text first, int firstchar, PageExtractor.Text last, int lastchar, boolean displayorder)
PageExtractor.Text
object, and the
offsets into those strings. This method is chiefly intended for use with a GUI that
allows a range of text to be selected.
first
- the first Text from this PageExtractor to be extractedfirstchar
- the first character from "first" to be extractedlast
- the last Text from this PageExtractor to be extractedlastchar
- the index after the index of the last character from "last" to be extracteddisplayorder
- if true, the iteration from first to last will go in display order, right to left and top to bottom. If false, the iteration will run in the order the text items exist in the document.public PDFPage getPage()
PDFPage
this PageExtractor relates to
public static Collection cropText(Collection all, Shape shape)
PageExtractor.Text
items, as returned by
getMatchingText()
, getTextUnordered()
or getTextInDisplayOrder()
, return a new Collection which
contains only Text that falls completely inside the specified Shape
.
For example, to get all the text in a specific rectangle:
Shape rect = new Rectangle2D.Float(0, 0, 100, 100); Collection all = extractor.trimToShape(extractor.getTextUnordered(), rect);
all
- a Collection of Text objectsshape
- the Shape to trim the text to
|
||||||||||
PREV CLASS NEXT CLASS | FRAMES NO FRAMES | |||||||||
SUMMARY: NESTED | FIELD | CONSTR | METHOD | DETAIL: FIELD | CONSTR | METHOD |