Quantcast
Channel: Adobe Community: Message List - Acrobat SDK
Viewing all articles
Browse latest Browse all 10848

Re: how to extract Text from a pdf file

$
0
0

Hi,

 

I am successfully extracting text from pdf, by the PDWordFinder but there are some issue with ligature text.

 

Can any one help let me know if possible, How to stop ligature expanision.

 

There is a word "office" in my pdf file. and it is getting expanded as "offi ce".

 

Here is my code

 

 

                    PDWordFinderConfigRec wfConfig;                    /* WordFinder configuration record */

                    memset(&wfConfig, 0, sizeof(PDWordFinderConfigRec));

                    wfConfig.noXYSort = true;

                    wfConfig.noLigatureExp = false;

 

                    wordFinder = PDDocCreateWordFinderEx (pdDoc, WF_LATEST_VERSION, toUnicode, &wfConfig);
         

         pageNum = AVPageViewGetPageNum (pageView);

         PDWordFinderAcquireWordList (wordFinder, pageNum, &wInfo, NULL, NULL, &count);
        
         

 

for(i=0; i<count; i++)

{

                    memset (str, '\0', MAX_PATH);

                    word = PDWordFinderGetNthWord (wordFinder, i);

                    PDWordGetString (word, str, PDWordGetLength(word));

 

  attrib          = PDWordGetAttrEx (word, 0);

     

   if((attrib & WXE_ADJACENT_TO_SPACE) && !(attrib & WXE_LAST_WORD_ON_LINE) && !(attrib & WXE_HAS_LIGATURE))

        strcat (str, " ");

 

     fprintf (pFileTexts, "%s", str);

}


Viewing all articles
Browse latest Browse all 10848

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>