Quantcast
Channel: Adobe Community: Message List - Acrobat SDK
Viewing all articles
Browse latest Browse all 10848

Re: Built-in Encoding text extraction

$
0
0

Thank you for posting the files. I agree with you that something is happening which I did not think was possible. The action of Create Tags is making the text extractable.

 

The rules for text extraction are documented in the PDF Reference, and Acrobat is following the rules correctly in both cases. The difference in the AFTER file (and I can reproduce this) is that a ToUnicode CMap has been added which contains the correct information. I am suprised. Somehow Create Tags decided that the text extraction was bad, and somehow it did a better job and created a ToUnicode CMap. Perhaps it used information in the PDF, perhaps it used information in my locally installed Verdana font.

 

Anyway, there it is. I would not care to try to reproduce this behaviour, so perhaps adding tags is what you need to automate. I cannot, however, say if there is an API to do that. Perhaps someone knows the answer to that.


Even detecting this is difficult, you cannot rely on random discoveries like built in encoding or MacRomanEncoding (which belongs to a different font anyway). Perhaps the only way is to extract text and apply a heuristic that says "this does not look like text in natural language". Sometimes the heuristic will be wrong, but probably not for fixed classes of source (like newspapers). This will not be a magic bullet; in most cases I imagine files that are not extractable will stay that way.


Viewing all articles
Browse latest Browse all 10848

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>