Compare two excel files

There is a hidden tool if you are running a Professional Version of office. You find it here:

Depending on the version of office that you are running the folder “Office15” can have another number. It allows you to select to files and visualize them side-by-side with a nice graphical overview of the differences

PDF OCR with Fedora 24 and Tesseract

Run the following commands:

Now you can convert a file like this:

If you don’t install the tesseract-osd package, it will work but the following error message appears:

Mount Amazon S3 on Fedora 24

There is no package that is ready to be installed. You need to download and compile the code yourself. First you need to install some development libraries. Execute the following commands:

Then you need to create the directory where you want to mount your bucket:

Now you need to prepare your credentials. The AwsAccessKeyId as well as the AwsSecretAccessKey is needed:

Now you can mount your bucket:

Unfortunately if something goes wrong (for example wrong credentials) it doesn’t show you a error message. The folder is just empty. In such a case you can run the debug mode of the command to see more clearly what is going on:

Paperless Office with Djvu

Scanning

In order to scan everything that comes in you usually need to have two scanners: a flatbed one and an AMF scanner. I use … which is my printer at the same time and … which is only a scanner. Most of the documents I can scan with … which is quite fast as I can put multiple pages (maybe 10) at the same time. The flatbed scanner is used for documents which are either to thick (books) or to large (posters) for the other scanner.

Sometimes I scan directly to PDF but sometimes I split the pages up to jpg. This has the advantage that I can remove pages that are not needed and if one page is not scanned well, I just need to rescan this page and can easily replace it. Usually I scan with 300dpi. This gives me a good quality.

To convert multiple JPEG-Files to a PDF document the following command can be used:

OCR

In order to recognize the text of the files and to make it searchable I convert the files to the djvu format. This is easier to hanle with open source tools than PDF. The conversion can be done like this:

Now we can run OCR on all those djvu Files:

Remark: as most of my documents are in german I specified that language. Otherwise it will not recognize the Umlaute. The command also doesn’t work if the filename contains any non-ascii characters.

Other useful commands

OCR for a single pdf-file:

Converting images directly to DJVU

Converting DJVU back to PDF

Spliting DJVU files into single pages

Converting DJVU single pages to images

pre class=”code”>
ddjvu –format=tiff page.djvu page.tiff

Extract specific pages from a djvu file and save them as image:

Show the text that was recognized by OCR:

https://mail.gnome.org/archives/tracker-list/2010-August/msg00020.html