Dealing with PST under linux

I prefer dealing with outlook archives (pst-files) by extracting the messages to a folder structure, saving each message as eml-file (Thunderbird mail-format). This can be achieved as follows:

readpst -o 'Archived Messages' -D -j 4 -r -tea -u -w -m ./some.pst

If the command cannot be found, you might need to install the package libpst first. The command creates msg and eml files with a increasing number as the filename.

Then you can go into the different folders and execute the script eml_renamer.pl. It renames each eml file with the date and subject in the filename. When I used it it did sometimes throw the following error:

Use of uninitialized value $input in concatenation (.) or string at /usr/share/perl5/vendor_perl/DateTime/Format/Builder.pm line 154.
Invalid date format:  at ./eml_renamer.pl line 17.

for that reason I updated the original script to be able to see the filename that cause the issue (it were in fact mails without date information) and to remove them. You find the updated script here: eml_renamer.pl

If you have issues running the script, you might need to install the following packages: perl-File-Slurp, perl-File-Next and perl-DateTime-Format-Flexible

Source

Compare two excel files

There is a hidden tool if you are running a Professional Version of office. You find it here:

C:\Program Files (x86)\Microsoft Office\Office15\DCF\SPREADSHEETCOMPARE.EXE

Depending on the version of office that you are running the folder “Office15” can have another number. It allows you to select to files and visualize them side-by-side with a nice graphical overview of the differences

PDF OCR with Fedora 24 and Tesseract

Run the following commands:

sudo dnf install python3-pip python3-devel libffi-devel qpdf tesseract tesseract-langpack-deu tesseract-osd
sudo python3 -m pip install ocrmypdf 

Now you can convert a file like this:

ocrmypdf -l deu input.pdf output.pdf

If you don’t install the tesseract-osd package, it will work but the following error message appears:

Error opening data file /usr/share/tesseract/tessdata/osd.traineddata
   INFO -    8: [tesseract] Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory.
   INFO -    8: [tesseract] Failed loading language 'osd'
   INFO -    8: [tesseract] Tesseract couldn't load any languages!
   INFO -    8: [tesseract] Warning: Auto orientation and script detection requested, but osd language failed to load

Mount Amazon S3 on Fedora 24

There is no package that is ready to be installed. You need to download and compile the code yourself. First you need to install some development libraries. Execute the following commands:

sudo dnf install fuse-devel libcurl-devel libxml2-devel
git clone https://github.com/s3fs-fuse/s3fs-fuse.git
cd s3fs-fuse
./autogen.sh
./configure
make
sudo make install

Then you need to create the directory where you want to mount your bucket:

sudo mkdir /mnt/mybucket

Now you need to prepare your credentials. The AwsAccessKeyId as well as the AwsSecretAccessKey is needed:

echo myAwsAccessKeyId:AwsSecretAccessKey > /home/myuser/.credentials_s3fs
chmod 600 /home/myuser/.credentials_s3fs

Now you can mount your bucket:

s3fs mybucketname /mnt/mybucket -o passwd_file=/home/myuser/.credentials_s3fs -o umask=000

Unfortunately if something goes wrong (for example wrong credentials) it doesn’t show you a error message. The folder is just empty. In such a case you can run the debug mode of the command to see more clearly what is going on:

s3fs mybucketname /mnt/mybucket -o passwd_file=/home/myuser/.credentials_s3fs -d -d -f -o f2 -o curldbg

Paperless Office with Djvu

Scanning

In order to scan everything that comes in you usually need to have two scanners: a flatbed one and an AMF scanner. I use … which is my printer at the same time and … which is only a scanner. Most of the documents I can scan with … which is quite fast as I can put multiple pages (maybe 10) at the same time. The flatbed scanner is used for documents which are either to thick (books) or to large (posters) for the other scanner.

Sometimes I scan directly to PDF but sometimes I split the pages up to jpg. This has the advantage that I can remove pages that are not needed and if one page is not scanned well, I just need to rescan this page and can easily replace it. Usually I scan with 300dpi. This gives me a good quality.

To convert multiple JPEG-Files to a PDF document the following command can be used:

convert *.jpg "Filename.pdf"

OCR

In order to recognize the text of the files and to make it searchable I convert the files to the djvu format. This is easier to hanle with open source tools than PDF. The conversion can be done like this:

for i in *.pdf; do pdf2djvu -o "${i%%.pdf}.djvu" "$i"; done 

Now we can run OCR on all those djvu Files:

for i in *.djvu; do ocrodjvu --in-place "$i" --render all --language deu --clear-text; done 

Remark: as most of my documents are in german I specified that language. Otherwise it will not recognize the Umlaute. The command also doesn’t work if the filename contains any non-ascii characters.

Other useful commands

OCR for a single pdf-file:

ocrodjvu --in-place 'alice.djvu'

Converting images directly to DJVU

img2djvu -c1 -d600 -v1 ./out

Converting DJVU back to PDF

ddjvu --format=pdf inputfile.djvu ouputfile.pdf

Spliting DJVU files into single pages

djvmcvt -i input.djvu /path/to/out/dir output-index.djvu

Converting DJVU single pages to images

pre class=”code”>
ddjvu –format=tiff page.djvu page.tiff

Extract specific pages from a djvu file and save them as image:

ddjvu --format=tiff --page=1-10 input.djvu output.tiff

Show the text that was recognized by OCR:

djvused -e print-txt 'alice.djvu' | head -n10

https://mail.gnome.org/archives/tracker-list/2010-August/msg00020.html