pdftron::PDF::TextSearch Class Reference

TextSearch searches through a PDF document for a user-given search pattern. More...

#include <TextSearch.h>

List of all members.

Public Types

enum  {
  e_reg_expression = 0x0001, e_case_sensitive = e_reg_expression << 1, e_whole_word = e_case_sensitive << 1, e_search_up = e_whole_word << 1,
  e_page_stop = e_search_up << 1, e_highlight = e_page_stop << 1, e_ambient_string = e_highlight << 1
}
 Search modes that control how searching is conducted. More...
enum  ResultCode { e_done = 0, e_page = 1, e_found = 2 }
 The code indicating the reason that the search process returns. More...
typedef TRN_UInt32 Mode
 Typedef the search mode.

Public Member Functions

 TextSearch ()
 Constructor and destructor.
 ~TextSearch ()
bool Begin (PDFDoc &doc, const UString &pattern, Mode mode, int start_page=-1, int end_page=-1)
 Initialize for search process.
ResultCode Run (int &page_num, UString &result_str, UString &ambient_str, Highlights &hlts)
 Search the document and returns upon the following circumstances: a)Reached the end of the document; b)Reached the end of a page (if set to return by specifying mode 'e_page_stop' ); c)Found an instance matching the search pattern.
bool SetPattern (const UString &pattern)
 Set the current search pattern.
Mode GetMode () const
 Retrieve the current search mode.
void SetMode (Mode mode)
 Set the current search mode.
int GetCurrentPage () const
 Retrieve the number of the current page that is searched in.


Detailed Description

TextSearch searches through a PDF document for a user-given search pattern.

The current implementation supports both verbatim search and the search using regular expressions, whose detailed syntax can be found at:

http://www.boost.org/doc/libs/1_42_0/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html

TextSearch also provides users with several useful search modes and extra information besides the found string that matches the pattern. TextSearch can either keep running until a matched string is found or be set to return periodically in order for the caller to perform any necessary updates (e.g., UI updates). It is also worth mentioning that the search modes can be changed on the fly while searching through a document.

Possible use case scenarios for TextSearch include:

Note:

"TextSearch is powerful for finding patterns in PDF files; yes, it is really pow- erful."

a search for "powerful" should return both instances. However, not all end-of-line hyphens are hyphens added to connect a broken word; some of them could be "real" hyphens. In addition, an input search pattern may also contain hyphens that complicate the situation. To tackle this problem, the following conventions are adopted:

a)When in the verbatim search mode and the pattern contains no hyphen, a matching string is returned if it is exactly the same or it contains end-of-line or start-of-line hyphens. For example, as mentioned above, a search for "powerful" would return both instances. b)When in verbatim search mode and the pattern contains one or multiple hyphens, a matching string is returned only if the string matches the pattern exactly. For example, a search for "pow-erful" will only return the second instance, and a search for "power-ful" will return nothing. c)When searching using regular expressions, hyphens are not taken care implicitly. Users should take care of it themselves. For example, in order to find both the "powerful" instances, the input pattern can be "pow-{0,1}erful".

A sample use case (in C++):

 ... Initialize PDFNet ...
 PDFDoc doc(filein);
 doc.InitSecurityHandler();
 int page_num;
 char buf[32];
 UString result_str, ambient_string;
 Highlights hlts;
 TextSearch txt_search;
 TextSearch::Mode mode = TextSearch::e_whole_word | TextSearch::e_page_stop;
 UString pattern( "joHn sMiTh" );
 while ( true )
 {
           TextSearch::ResultCode code = txt_search.Run(page_num, result_str, ambient_string, hlts );
           if ( code == TextSearch::e_found )
           {
                   result_str.ConvertToAscii(buf, 32, true);
                   cout << "found one instance: " << char_buf << endl;
           }
           else
           {
                   break;
           }
 }

For a full sample, please take a look at the TextSearch sample project.


Member Typedef Documentation

typedef TRN_UInt32 pdftron::PDF::TextSearch::Mode

Typedef the search mode.


Member Enumeration Documentation

anonymous enum

Search modes that control how searching is conducted.

Enumerator:
e_reg_expression 
e_case_sensitive 
e_whole_word 
e_search_up 
e_page_stop 
e_highlight 
e_ambient_string 

The code indicating the reason that the search process returns.

Enumerator:
e_done 
e_page 
e_found 


Constructor & Destructor Documentation

pdftron::PDF::TextSearch::TextSearch (  ) 

Constructor and destructor.

pdftron::PDF::TextSearch::~TextSearch (  ) 


Member Function Documentation

bool pdftron::PDF::TextSearch::Begin ( PDFDoc doc,
const UString pattern,
Mode  mode,
int  start_page = -1,
int  end_page = -1 
)

Initialize for search process.

This should be called before starting the actual search with method Run().

Parameters:
doc the PDF document to search in.
pattern the pattern to search for. When regular expression is used, it contains the expression, and in verbatim mode, it is the exact string to search for.
mode the mode of the search process.
start_page the start page of the page range to search in. The default value is -1 indicating the range starts from the first page.
end_page the end page of the page range to search in. The default value is -1 indicating the range ends at the last page.
Returns:
true if the initialization has succeeded.

ResultCode pdftron::PDF::TextSearch::Run ( int &  page_num,
UString result_str,
UString ambient_str,
Highlights hlts 
)

Search the document and returns upon the following circumstances: a)Reached the end of the document; b)Reached the end of a page (if set to return by specifying mode 'e_page_stop' ); c)Found an instance matching the search pattern.

Note that this method should be called in a loop in order to find all matching instances; in other words, the search is conducted in an incremental fashion. In addition, the resulting information only makes sense when the returned code is 'e_found'.

Parameters:
page_num the number of the page the found instance is on.
result_str the found string that matches the search pattern.
ambient_str the ambient string of the found string (computed if 'e_ambient_string' is set).
hlts the Highlights info associated with the found string (computed if 'e_highlight' is set).
Returns:
the code indicating the reason of the return. Note that only when the returned code is 'e_found', the resulting information is meaningful.

bool pdftron::PDF::TextSearch::SetPattern ( const UString pattern  ) 

Set the current search pattern.

Note that it is not necessary to call this method since the search pattern is already set when calling the Begin() method. This method is provided for users to change the search pattern while searching through a document.

Parameters:
pattern the search pattern to set.
Returns:
true if the setting has succeeded.

Mode pdftron::PDF::TextSearch::GetMode (  )  const

Retrieve the current search mode.

Returns:
the current search mode.

void pdftron::PDF::TextSearch::SetMode ( Mode  mode  ) 

Set the current search mode.

For example, the following code turns on the regular expressions:

TextSearch ts; ... TextSearch::Mode mode = ts.GetMode(); mode |= TextSearch::e_reg_expression; ts.SetMode(mode); ...

Parameters:
mode the search mode to set.

int pdftron::PDF::TextSearch::GetCurrentPage (  )  const

Retrieve the number of the current page that is searched in.

If the returned value is -1, it indicates the search process has not been initialized (e.g., Begin() is not called yet); if the returned value is 0, it indicates the search process has finished, and if the returned value is positive, it is a valid page number.

Returns:
the current page number.


© 2002-2010 PDFTron Systems Inc.