Paperless Office with Djvu

Scanning

In order to scan everything that comes in you usually need to have two scanners: a flatbed one and an AMF scanner. I use … which is my printer at the same time and … which is only a scanner. Most of the documents I can scan with … which is quite fast as I can put multiple pages (maybe 10) at the same time. The flatbed scanner is used for documents which are either to thick (books) or to large (posters) for the other scanner.

Sometimes I scan directly to PDF but sometimes I split the pages up to jpg. This has the advantage that I can remove pages that are not needed and if one page is not scanned well, I just need to rescan this page and can easily replace it. Usually I scan with 300dpi. This gives me a good quality.

To convert multiple JPEG-Files to a PDF document the following command can be used:

OCR

In order to recognize the text of the files and to make it searchable I convert the files to the djvu format. This is easier to hanle with open source tools than PDF. The conversion can be done like this:

Now we can run OCR on all those djvu Files:

Remark: as most of my documents are in german I specified that language. Otherwise it will not recognize the Umlaute. The command also doesn’t work if the filename contains any non-ascii characters.

Other useful commands

OCR for a single pdf-file:

Converting images directly to DJVU

Converting DJVU back to PDF

Spliting DJVU files into single pages

Converting DJVU single pages to images

pre class=”code”>
ddjvu –format=tiff page.djvu page.tiff

Extract specific pages from a djvu file and save them as image:

Show the text that was recognized by OCR:

https://mail.gnome.org/archives/tracker-list/2010-August/msg00020.html

Leave a Reply

Your email address will not be published. Required fields are marked *