Get a large corpus of text data

I needed to get a large corpus of text data in order to test some full-text-search functionality. Also the text data should not be in one big file but in many small files instead. This lead me to Project Gutenberg and to Wikipedia.

Here is how you can get lot’s of data (we exclude the zip from gutenberg because they contain just the other files that are downloaded anyway):

mkdir Gutenberg-orig/
cd Gutenberg-orig/
rsync -rlHtSv --delete --exclude '*.zip' ftp@ftp.ibiblio.org::gutenberg ./

wget --recursive --no-parent http://dumps.wikimedia.org/other/static_html_dumps/current/

Leave a comment

Your email address will not be published. Required fields are marked *