Quantcast
Channel: Adobe Community: Message List - Acrobat SDK
Viewing all articles
Browse latest Browse all 10848

Re: Refying PDF with subset embedded fonts fixes text extraction

$
0
0

No, entirely expected. The ToUnicode CMap is a part of PDF with no purpose except to aid text extraction. It plays no role in display (or printing). As part of refrying you "print" the PDF, so Acrobat writes PostScript which will look exactly like the PDF - and that's all. Just as the PostScript contains no interactive features (e.g. links) it contains no ToUnicode CMap; indeed there is nowhere in the PostScript to put one.

 

A plug-in can certainly remove the ToUnicode CMap. However, in >90% of cases it makes extraction better or no worse, so removing it for the sake of a few broken files will probably make things worse for you overall.  And be under no illusion: the files ARE broken because text extraction must, by definition, use a ToUnicode CMap if there is one, so all the correct ways to extract text will give the wrong text. Only some apps which don't follow those rules (or workflows which lose the ToUnicode CMap) will have it fall back to looking at the fonts.


Viewing all articles
Browse latest Browse all 10848

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>