Paperless Office with Djvu


In order to scan everything that comes in you usually need to have two scanners: a flatbed one and an AMF scanner. I use … which is my printer at the same time and … which is only a scanner. Most of the documents I can scan with … which is quite fast as I can put multiple pages (maybe 10) at the same time. The flatbed scanner is used for documents which are either to thick (books) or to large (posters) for the other scanner.

Sometimes I scan directly to PDF but sometimes I split the pages up to jpg. This has the advantage that I can remove pages that are not needed and if one page is not scanned well, I just need to rescan this page and can easily replace it. Usually I scan with 300dpi. This gives me a good quality.

To convert multiple JPEG-Files to a PDF document the following command can be used:

convert *.jpg "Filename.pdf"


In order to recognize the text of the files and to make it searchable I convert the files to the djvu format. This is easier to hanle with open source tools than PDF. The conversion can be done like this:

for i in *.pdf; do pdf2djvu -o "${i%%.pdf}.djvu" "$i"; done 

Now we can run OCR on all those djvu Files:

for i in *.djvu; do ocrodjvu --in-place "$i" --render all --language deu --clear-text; done 

Remark: as most of my documents are in german I specified that language. Otherwise it will not recognize the Umlaute. The command also doesn’t work if the filename contains any non-ascii characters.

Other useful commands

OCR for a single pdf-file:

ocrodjvu --in-place 'alice.djvu'

Converting images directly to DJVU

img2djvu -c1 -d600 -v1 ./out

Converting DJVU back to PDF

ddjvu --format=pdf inputfile.djvu ouputfile.pdf

Spliting DJVU files into single pages

djvmcvt -i input.djvu /path/to/out/dir output-index.djvu

Converting DJVU single pages to images

pre class=”code”>
ddjvu –format=tiff page.djvu page.tiff

Extract specific pages from a djvu file and save them as image:

ddjvu --format=tiff --page=1-10 input.djvu output.tiff

Show the text that was recognized by OCR:

djvused -e print-txt 'alice.djvu' | head -n10

Leave a comment

Your email address will not be published. Required fields are marked *