Hi,
I am successfully extracting text from pdf, by the PDWordFinder but there are some issue with ligature text.
Can any one help let me know if possible, How to stop ligature expanision.
There is a word "office" in my pdf file. and it is getting expanded as "offi ce".
Here is my code
PDWordFinderConfigRec wfConfig; /* WordFinder configuration record */
memset(&wfConfig, 0, sizeof(PDWordFinderConfigRec));
wfConfig.noXYSort = true;
wfConfig.noLigatureExp = false;
wordFinder = PDDocCreateWordFinderEx (pdDoc, WF_LATEST_VERSION, toUnicode, &wfConfig);
pageNum = AVPageViewGetPageNum (pageView);
PDWordFinderAcquireWordList (wordFinder, pageNum, &wInfo, NULL, NULL, &count);
for(i=0; i<count; i++)
{
memset (str, '\0', MAX_PATH);
word = PDWordFinderGetNthWord (wordFinder, i);
PDWordGetString (word, str, PDWordGetLength(word));
attrib = PDWordGetAttrEx (word, 0);
if((attrib & WXE_ADJACENT_TO_SPACE) && !(attrib & WXE_LAST_WORD_ON_LINE) && !(attrib & WXE_HAS_LIGATURE))
strcat (str, " ");
fprintf (pFileTexts, "%s", str);
}
Actually for all words the value (attrib & WXE_HAS_LIGATURE) is never getting true.
so not able to detect ligatured texts.