Caring Programmer: C# extracting garbage characters from PDF

This week i encountered interesting problem.

How to extract garbage characters from PDF and use them in C#.

What are garbage characters in pdf?

Bassically when you select some portion of text in any kind of pdf reader (Foxit Reader, Adobe Reader, Nitro Pdf Reader) and pase them to some plain text editor like notepad that the characters are not recognized...there are some boxes, music keys etc

So i wondered what is happening and can i solve it. Well i thought i can but i was wrong.
I have investigated the PDF with lot of tools also looking inside of pdf format to figure out how to get those characters into my C# program.

You may ask yourself how can i see those characters in the reader but i cant extract them?

Well the thing is readers are not showing plain text..they are showing glyphs which are like small shapes defined in the font that reader uses to represent the text.
Glyph is normally connected to Unicode character with ToUnicode table inside PDF format. So each glyph can be represented with some Unicode character.

When the characters are garbage characters it means that for that font and those glyphs the Unicode table is either corrupted or is missing.

So when you try to copy paste the text from reader it also uses the ToUnicode table to "map" the glyphs copied to Unicode characters. When he cant succeed he produces garbage characters.

Ok so bassically i didnt succeed that. But what i have done next to get myself out of the problem?

Well i was lucky that project didnt require the manipulation of text and that the coordinates of desired text were practically the same for each pdf that i encountered so i have gone into solution to convert pdf into image, extract that portion of text as an image and show it in the form.

I will talk about this in my next posts.

Btw...i use notepad ++ as editor for this things and i must say it is great for me.

Here is the link: http://notepad-plus-plus.org/ if you want to try it.

Caring Programmer

O meni

SEARCH THIS BLOG

petak, 3. veljače 2012.

C# extracting garbage characters from PDF

Nema komentara:

Objavi komentar