pdftron::PDF::TextExtractor Class Reference

TextExtractor is used to analyze a PDF page and extract words and logical structure within a given region. More...

#include <TextExtractor.h>

List of all members.

Classes

class  Line
 TextExtractor::Line object represents a line of text on a PDF page. More...
class  Style
 A class representing predominant text style associated with a given Line, a Word, or a Glyph. More...
class  Word
 TextExtractor::Word object represents a word on a PDF page. More...

Public Types

enum  ProcessingFlags {
  e_no_ligature_exp = 1, e_no_dup_remove = 2, e_punct_break = 4, e_remove_hidden_text = 8,
  e_no_invisible_text = 16
}
 Processing options that can be passed in Begin() method to direct the flow of content recognition algorithms. More...
enum  XMLOutputFlags { e_words_as_elements = 1, e_output_bbox = 2, e_output_style_info = 4 }
 Flags controlling the structure of XML output in a call to GetAsXML(). More...

Public Member Functions

 TextExtractor ()
 Constructor and destructor.
 ~TextExtractor ()
void Begin (Page page, const Rect *clip_ptr=0, UInt32 flags=0)
 Start reading the page.
int GetWordCount ()
void GetAsText (UString &out_str, bool dehyphen=true)
 Get all words in the current selection as a single string.
void GetAsXML (UString &out_xml, UInt32 xml_output_flags=0)
 Get text content in a form of an XML string.
int GetNumLines ()
Line GetFirstLine ()


Detailed Description

TextExtractor is used to analyze a PDF page and extract words and logical structure within a given region.

The resulting list of lines and words can be traversed element by element or accessed as a string buffer. The class also includes utility methods to extract PDF text as HTML or XML.

Possible use case scenarios for TextExtractor include:

The main task of TextExtractor is to interpret PDF pages and offer a simple to use API to:

Note: TextExtractor is analyzing only textual content of the page. This means that the rasterized (e.g. in scanned pages) or vectorized text (where glyphs are converted to path outlines) will not be recognized as text. Please note that it is still possible to extract this content using pdftron.PDF.ElementReader interface.

In some cases TextExtractor may extract text that does not appear to be on the visible page (e.g. when text is obscured by an image or a rectangle). In these situations it is possible to use processing flags such as 'e_remove_hidden_text' and 'e_no_invisible_text' to remove hidden text.

A sample use case (in C++):

 ... Initialize PDFNet ...
 PDFDoc doc(filein);
 doc.InitSecurityHandler();
 Page page = *doc.PageBegin();
 TextExtractor txt;
 txt.Begin(page, 0, TextExtractor::e_remove_hidden_text);
 UString text;
 txt.GetAsText(text);
 // or traverse words one by one...
 TextExtractor::Line line = txt.GetFirstLine(), lend;
 TextExtractor::Word word, wend;
 for (; line!=lend; line=line.GetNextLine()) {
  for (word=line.GetFirstWord(); word!=wend; word=word.GetNextWord()) {
    text.Assign(word.GetString(), word.GetStringLen());
    cout << text << '\n';
  }
 }

A sample use case (in C#):

 ... Initialize PDFNet ...
 PDFDoc doc = new PDFDoc(filein);
 doc.InitSecurityHandler();
 Page page = doc.PageBegin().Current();
 TextExtractor txt = new TextExtractor();
 txt.Begin(page, 0, TextExtractor.ProcessingFlags.e_remove_hidden_text);
 string text = txt.GetAsText();
 // or traverse words one by one...
 TextExtractor.Word word;
 for (TextExtractor.Line line = txt.GetFirstLine(); line.IsValid(); line=line.GetNextLine()) {
   for (word=line.GetFirstWord(); word.IsValid(); word=word.GetNextWord()) {
     Console.WriteLine(word.GetString());
   }
 }

For full sample code, please take a look at TextExtract sample project.


Member Enumeration Documentation

Processing options that can be passed in Begin() method to direct the flow of content recognition algorithms.

Enumerator:
e_no_ligature_exp 
e_no_dup_remove 
e_punct_break 
e_remove_hidden_text 
e_no_invisible_text 

Flags controlling the structure of XML output in a call to GetAsXML().

Enumerator:
e_words_as_elements 
e_output_bbox 
e_output_style_info 


Constructor & Destructor Documentation

pdftron::PDF::TextExtractor::TextExtractor (  ) 

Constructor and destructor.

pdftron::PDF::TextExtractor::~TextExtractor (  ) 


Member Function Documentation

void pdftron::PDF::TextExtractor::Begin ( Page  page,
const Rect clip_ptr = 0,
UInt32  flags = 0 
)

Start reading the page.

Parameters:
page Page to read.
clip_ptr A pointer to the optional clipping rectangle. This parameter can be used to selectively read text from a given rectangle.
flags A list of ProcessingFlags used to control text extraction algorithm.

int pdftron::PDF::TextExtractor::GetWordCount (  ) 

Returns:
the number of words on the page.

void pdftron::PDF::TextExtractor::GetAsText ( UString out_str,
bool  dehyphen = true 
)

Get all words in the current selection as a single string.

Parameters:
out_str The string containing all words in the current selection. Words will be separated with space (i.e. ' ') or new line (i.e. '
') characters.
dehyphen If true, finds and removes hyphens that split words across two lines. Hyphens are often used a the end of lines as an indicator that a word spans two lines. Hyphen detection enables removal of hyphen character and merging of text runs to form a single word. This option has no effect on Tagged PDF files.

void pdftron::PDF::TextExtractor::GetAsXML ( UString out_xml,
UInt32  xml_output_flags = 0 
)

Get text content in a form of an XML string.

Parameters:
out_xml - The string containing XML output.
xml_output_flags - flags controlling XML output. For more information, please see TextExtract::XMLOutputFlags.
XML output will be encoded in UTF-8 and will have the following structure:
 <Page num="1 crop_box="0, 0, 612, 792" media_box="0, 0, 612, 792" rotate="0">
  <Flow id="1">
   <Para id="1">
    <Line box="72, 708.075, 467.895, 10.02" style="font-family:Calibri; font-size:10.02; color: #000000;">
      <Word box="72, 708.075, 30.7614, 10.02">PDFNet</Word>
      <Word box="106.188, 708.075, 15.9318, 10.02">SDK</Word>
      <Word box="125.617, 708.075, 6.22242, 10.02">is</Word>
      ...
    </Line>
   </Para>     
  </Flow>
 </Page>                 

The above XML output was generated by passing the following union of flags in the call to GetAsXML(): (TextExtractor::e_words_as_elements | TextExtractor::e_output_bbox | TextExtractor::e_output_style_info)

In case 'xml_output_flags' was not specified, the default XML output would look as follows:

<Page num="1 crop_box="0, 0, 612, 792" media_box="0, 0, 612, 792" rotate="0"> <Flow id="1">

<Line>PDFNet SDK is an amazingly comprehensive, high-quality PDF developer toolkit...</Line> <Line>levels. Using the PDFNet PDF library, ...</Line> ...

</Flow> </Page>

int pdftron::PDF::TextExtractor::GetNumLines (  ) 

Returns:
The number of lines of text on the selected page.

Line pdftron::PDF::TextExtractor::GetFirstLine (  ) 

Returns:
The first line of text on the selected page.
Note:
To traverse the list of all text lines on the page use line.GetNextLine().

To traverse the list of all word on a given line use line.GetFirstWord().


© 2002-2010 PDFTron Systems Inc.