Extracting English text from MS Word files ...

Below is a simple TB program listing for extracting English language text from a Microsoft Word file.
It 'parses' all of the ASCII characters in the Word file, string variable z$, and creates a
printable string variable x$ which can be examined for English language text, using either Notepad,
or preferably, Microsoft Wordpad.

String variable x$ contains only the range of ASCII characters from ASCII 32 (space) to ASCII 127).

All of the ASCII characters in z$, from ASCII 1 to ASCII 31 and ASCII 128 to ASCII 255 are passed
over in creating x$.

Here is the TB program listing:


! Filename: PARSE_FILE_20100430.tru, tjm, 20100430

! Chose one of the following program lines to fetch a MS Word doc file
! INPUT prompt "Enter an unambiguous filename to parse ? ": fname$
! LET fname$="your MIcrosoft Word '.doc' filename"

OPEN #1: name fname$, org byte, access input ! fetch file to parse
ASK #1: FILESIZE fs
READ #1, bytes fs: z$
CLOSE #1
LET x$=""
LET i=1
DO
LET a$=z$[i:i]
IF ord(a$)>31 or ord(a$)<128 then LET x$=x$ & a$
LET i=i+1
LOOP until i=fs
PRINT
PRINT x$
PRINT
PRINT " Done ... "
END

When RUNning PARSE_FILE_20100430.TRU, it is wise to create a 'text' file that contains only x$.
I recommend creating and SAVEing the text file "OUTPUT.DAT". To do this, one RUNs the program
from the TB Editor COMMAND line at the bottom of the TB Editor screen, using the command:


RUN >> OUTPUT.DAT

Then, to examine OUTPUT.DAT for English text, open OUTPUT.DAT, using Microsoft Wordpad, and
use the keyboard arrow keys to search for printable text. Any text found can be highlighted,
copied to the computer Clipboard and then pasted into another file.

One can also open OUTPUT.DAT using Microsoft Notepad, but my experiments revealed that Wordpad
was easier to navigate to English text. Regards ... Tom M

Comments

Text from Word files

Actually, Microsoft Office 2007 and 2010 Word files (.docx filename extension) do have text in them, but it isn't very useful. These files are actually Zip archives (you can unzip them after changing the extension to .zip) which contain several files in a directory. The document text is in an XML file, littered with much XML verbiage of no interest to humans. More usefully, all the included graphics are also there, in their original file formats, if you need to recover a JPEG or something.

But why reinvent the wheel (or, more appropriately for the fiendishly complicated Word formats, the helicopter)? Open the file in Word (or maybe Wordpad) and save it as .txt. This is going to be easier than a parsing program no matter how many files you have. Then LINE INPUT should work as expected.

Re: Text from Word files ...

The reason one might want to extract text from a MS Word file is that that one doesn't own MS Word. The program listing I wrote does produce a text file, named OUTPUT.DAT.

As I also said, with the coming of MS Word versions 2007 and 2010, there isn't any English text that can be extracted by a TB program like mine. Regards ... Tom M

Re: Text from Word files ...

If you don't own Word, the version of Wordpad that came (free) with Windows XP will open .DOC files (not always formatted correctly) and save them as .TXT (according to Wikipedia). Windows 7 Wordpad will do the same for .DOCX files. Or, get the free Word Viewer download from Microsoft, open either of these formats, copy all the text, paste it into Notepad, and save that as .TXT. Admittedly more manual labor than an extraction program, but quicker unless you have thousands of files.

OpenOffice.org

I personally use Open Office - an open-source suite of applications that mimics the functions of MS Office without the price tag (it's free). OO.org has its own file types but also works with MS Office file formats, as well as some others which I haven't tried.

It may or may not do what is called for here by the OP, but I thought I'd make the suggestion.

www.openoffice.org

-Anne

Can't extract English text from a .docx file

To all ... Microsoft Office 2007 and Office 2010 Word files (.docx filename extension) don't have any English text in them. The good old days are gone! Regards ... Tom M