Scanning
In order to scan everything that comes in you usually need to have two scanners: a flatbed one and an AMF scanner. I use … which is my printer at the same time and … which is only a scanner. Most of the documents I can scan with … which is quite fast as I can put multiple pages (maybe 10) at the same time. The flatbed scanner is used for documents which are either to thick (books) or to large (posters) for the other scanner.
Sometimes I scan directly to PDF but sometimes I split the pages up to jpg. This has the advantage that I can remove pages that are not needed and if one page is not scanned well, I just need to rescan this page and can easily replace it. Usually I scan with 300dpi. This gives me a good quality.
To convert multiple JPEG-Files to a PDF document the following command can be used:
convert *.jpg "Filename.pdf"
OCR
In order to recognize the text of the files and to make it searchable I convert the files to the djvu format. This is easier to hanle with open source tools than PDF. The conversion can be done like this:
for i in *.pdf; do pdf2djvu -o "${i%%.pdf}.djvu" "$i"; done
Now we can run OCR on all those djvu Files:
for i in *.djvu; do ocrodjvu --in-place "$i" --render all --language deu --clear-text; done
Remark: as most of my documents are in german I specified that language. Otherwise it will not recognize the Umlaute. The command also doesn’t work if the filename contains any non-ascii characters.
Other useful commands
OCR for a single pdf-file:
ocrodjvu --in-place 'alice.djvu'
Converting images directly to DJVU
img2djvu -c1 -d600 -v1 ./out
Converting DJVU back to PDF
ddjvu --format=pdf inputfile.djvu ouputfile.pdf
Spliting DJVU files into single pages
djvmcvt -i input.djvu /path/to/out/dir output-index.djvu
Converting DJVU single pages to images
pre class=”code”>
ddjvu –format=tiff page.djvu page.tiff
Extract specific pages from a djvu file and save them as image:
ddjvu --format=tiff --page=1-10 input.djvu output.tiff
Show the text that was recognized by OCR:
djvused -e print-txt 'alice.djvu' | head -n10
https://mail.gnome.org/archives/tracker-list/2010-August/msg00020.html