nedjelja, 5. veljače 2012.

petak, 3. veljače 2012.

C# how to read tiff and crop it

This is very simple and it is final step of my pdf extracting garbage characters.

I will put code here so you have example how to do it.


.NET The specified module could not be found error project

While doing some of my projects i encountered this error.

It happens mostly because some of dlls used in .net application are missing are or built with one configuration and your project is built with another configuration.

One of the solution is to download

http://www.dependencywalker.com/


and to check the dll that is giving the error for dependencies..if some of the dependencies are missing add them to right path.

Another problem is that if you are using dll that is built for ex for 32 bit system and you are building your project with Any CPU this wont work on 64 bit systems.

You have to build your project also for X86 architecture so you can use that 32 bit dll normally in your project.

You can set the architecture for you project for specific architecture in VISUAL Studio .NET like this:

Right Click on Solution -> Properties -> Configuration Properties ->  Configuration Manager


If you dont have for ex X86 architecture go to  Configuration Manager , click on Platform and select New.


You can then choose which platform you want to create.

C# converting pdf to image

This is continue of my post when i was extracting non standard text from pdf.

Bassically i didnt succeded in doing that the way i wanted but i was lucky that all of the pdfs where the same format.

So i came to idea that i transfer pdf to image and then extract the portion of  "text" as image ...bassicaly to crop the image in the rectangle where the text is.

To complete this task i have used ghostscript library
http://www.ghostscript.com/download/gsdnld.html


Also good article how to implement this in C# is on CodeProject
http://www.codeproject.com/Articles/32274/How-To-Convert-PDF-to-Image-Using-Ghostscript-API

With this two i have managed to convert PDF to image. But what format?

Well i have tested with CodeProject application from article and i have found out that the most clearest text i can get is when using TIFF format which is not such a strange thing becuse it is used for printing widely.

So the next step was to read the TIFF into C# application and crop the image.

NOTE: there is also a good library for converting pdf to image named imagemagick which also has wrapper in C#
http://imagemagick.codeplex.com/


It has also lot of other capabilities like resizing,cropping etc.
But i didnt used it because after setting up the project for testing with this library and trying how it converts pdf to image i wasnt very satisfied because of the time it needs to convert so i decided to go with ghostscript.

C# extracting garbage characters from PDF

This week i encountered interesting problem.

How to extract garbage characters from PDF and use them in C#.

What are garbage characters in pdf?

Bassically when you select some portion of text in any kind of pdf reader (Foxit Reader, Adobe Reader, Nitro Pdf Reader) and pase them to some plain text editor like notepad that the characters are not recognized...there are some boxes, music keys etc

So i wondered what is happening and can i solve it. Well i thought i can but i was wrong.
 I have investigated the PDF with lot of tools also looking inside of pdf format to figure out how to get those characters into my C# program.

You may ask yourself how can i see those characters in the reader but i cant extract them?

Well the thing is readers are not showing plain text..they are showing glyphs which are like small shapes defined in the font that reader uses to represent the text.
Glyph is normally connected to Unicode character with ToUnicode table inside PDF format. So each glyph can be represented with some Unicode character.


When the characters are garbage characters it means that for that font and those glyphs the Unicode table is either corrupted or is missing.


So when you try to copy paste the text from reader it also uses the ToUnicode table to "map" the glyphs copied to Unicode characters. When he cant succeed he produces garbage characters.

Ok so bassically i didnt succeed that. But what i have done next to get myself out of the problem?

Well i was lucky that project didnt require the manipulation of text and that the coordinates of desired text were practically the same for each pdf that i encountered so i have gone into solution to convert pdf into image, extract that portion of text as an image and show it in the form.

I will talk about this in my next posts.

Btw...i use notepad ++ as editor for this things and i must say it is great for me.

Here is the link: http://notepad-plus-plus.org/  if you want to try it.