Kybernetes blog

The place with world's knowledge about cybernetics, artificial inteligence and automation.

Command-line conversion
from .doc to .txt

10.11.2016

Conversion from the .doc documents into .txt from the command-line is possible with:

1. http://www.libreoffice.org (http://www.openoffice.org/),
2. http://www.calligra.org,
3. http://www.abiword.org,
4. http://wvware.sourceforge.net.

The LibreOffice command-line conversion syntax is:

oowriter --convert-to .txt file.doc

for the Calligra it is:

calligraconverter file.doc file.txt

for AbiWord:

abiword -t txt -o file.txt file.doc

and the wvWare syntax is:

wvText file.doc > file.txt

To get the result in the utf-8 encoding an appropriate environment has to be set.  The setting is:

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8
export LANGUAGE=en_US.UTF-8

Or you can use the national language setting, like the sk_SK. But the English with the utf-8 works fine too.  The exception seems to be AbiWord, it does produce utf-8 output by default, even with the LC_ALL=C set.  I wanted the command-line switch to be able to set the encoding of output for every file, but I cannot find it for any of these applications.

Subjectively I prefer the form of text output of abiword or calligra over the LibreOffice.  The wvWare is also good, but it cannot convert the .docx documents.  LibreOffice separated paragraphs in the text output of my test document by just single newline, which is not a robust solution.

Comparing the output of abiword and calligra the text was very similar, but I preferred the .html output of abiword.  The .html output is necessary when you want to preserve or extract the italic/bold type of text, which is not possible in the plain .txt format.

The oowriter, abiword and calligraconverter conversion programs do link to all libraries for their graphical interfaces.  It is all Qt stuff for Calligra, the java UI support for LibreOffice and gtk for AbiWord.  On the Linux distribution which I used for the test it was about 80 libraries for the Calligra, 110 for the AbiWord and 120 for the LibreOffice.  The wvWare convertor linked just to about 30 libraries, so it seems the best candidate to be installed in the server environment.

Development version of the wvWare suite can be downloaded from the AbiWord web: http://www.abiword.org/downloads/wv .

Ing. Rudolf Jakša PhD.
Kybernetes s.r.o.

Ján Liguš, Editor
Ján Sarnovský, Editor
Rudolf Jakša, Editor

kontakt: blog@kybernetes.sk
web: blog.kybernetes.sk

Copyright © 2016 Kybernetes s.r.o.