Quantcast
Channel: Adobe Community: Message List - Acrobat SDK
Viewing all articles
Browse latest Browse all 10848

When getting text from a PDF using CreateTextSelect some text does not match PDF.

$
0
0

I have a VB.net (VS2010) function that reads the text from a PDF to a text file. This is working correctly, but I have a group of PDFs where there are a couple of words that are garbled when pulling text from the PDF. The project references Acrobat X.

Example: Getting the text "Foucrltihn icQaularter" which reads as "Fourth Quarter" on the PDF. All other text is correctly extracted to a text file.
If I do a "save as" to XML or plain text format this text is correct.
Here is my code:

Dim oSW As System.IO.StreamWriter = New StreamWriter(fileNameTxt)
Dim PDDoc As New Acrobat.AcroPDDoc
Dim CAcroRect As New Acrobat.AcroRect
Dim PDPage As Acrobat.AcroPDPage
Dim PDTxtSelect As Acrobat.AcroPDTextSelect
Dim CArcoPoint As Acrobat.AcroPoint
Dim sPgTxt As String = String.Empty
Dim iNumWords As Integer
Dim iMax As Long
Dim arPdfLines() As String
Dim i As Integer

If PDDoc.Open(fileNamePDF) Then
PDPage = PDDoc.AcquirePage(0)
CArcoPoint = PDPage.GetSize()
CAcroRect.Top = CArcoPoint.y
CAcroRect.Left = 0
CAcroRect.right = CArcoPoint.x
CAcroRect.bottom = 0
PDTxtSelect = PDDoc.CreateTextSelect(0, CAcroRect)
iNumWords = PDTxtSelect.GetNumText
iMax = iNumWords - 1

For i = 0 To iMax
sPgTxt = sPgTxt & PDTxtSelect.GetText(i)
Next
' split the string on newlines,
' put each line in array element
arPdfLines = Split(sPgTxt, vbCrLf)
iMax = UBound(arPdfLines)
If iMax < numLines Then numLines = iMax
For i = 0 To numLines
oSW.Write(CStr(i) & ": " & arPdfLines(i) & vbCrLf)
Next
End If

Any ideas on what is causing this weirdness? Only these two words are coming over garbled.

Thanks


Viewing all articles
Browse latest Browse all 10848

Trending Articles



<script src="https://jsc.adskeeper.com/r/s/rssing.com.1596347.js" async> </script>