I have a VB.net (VS2010) function that reads the text from a PDF to a text file. This is working correctly, but I have a group of PDFs where there are a couple of words that are garbled when pulling text from the PDF. The project references Acrobat X.
Example: Getting the text "Foucrltihn icQaularter" which reads as "Fourth Quarter" on the PDF. All other text is correctly extracted to a text file.
If I do a "save as" to XML or plain text format this text is correct.
Here is my code:
Dim oSW As System.IO.StreamWriter = New StreamWriter(fileNameTxt)
Dim PDDoc As New Acrobat.AcroPDDoc
Dim CAcroRect As New Acrobat.AcroRect
Dim PDPage As Acrobat.AcroPDPage
Dim PDTxtSelect As Acrobat.AcroPDTextSelect
Dim CArcoPoint As Acrobat.AcroPoint
Dim sPgTxt As String = String.Empty
Dim iNumWords As Integer
Dim iMax As Long
Dim arPdfLines() As String
Dim i As Integer
If PDDoc.Open(fileNamePDF) Then
PDPage = PDDoc.AcquirePage(0)
CArcoPoint = PDPage.GetSize()
CAcroRect.Top = CArcoPoint.y
CAcroRect.Left = 0
CAcroRect.right = CArcoPoint.x
CAcroRect.bottom = 0
PDTxtSelect = PDDoc.CreateTextSelect(0, CAcroRect)
iNumWords = PDTxtSelect.GetNumText
iMax = iNumWords - 1
For i = 0 To iMax
sPgTxt = sPgTxt & PDTxtSelect.GetText(i)
Next
' split the string on newlines,
' put each line in array element
arPdfLines = Split(sPgTxt, vbCrLf)
iMax = UBound(arPdfLines)
If iMax < numLines Then numLines = iMax
For i = 0 To numLines
oSW.Write(CStr(i) & ": " & arPdfLines(i) & vbCrLf)
Next
End If
Any ideas on what is causing this weirdness? Only these two words are coming over garbled.
Thanks
↧
When getting text from a PDF using CreateTextSelect some text does not match PDF.
↧