Searching Text Files

I have a couple of large documents to compare (two versions of Human Engineering standards) and I've exported them as text files.

I'd like to be able to read the first file line by line and compare it with the second file line by line - looking for similarities and place the matching compared lines in a third file based on their similarities.

So, I'm looking for some way to look for similarities between two string values which would be unique enough to keep me from getting all the lines with "the", "in", "at", etc.

I'm thinking I might have to break up each line individually based on spacing, ignoring all words with less than three or four letters and comparing the remaining words one by one. The routine should not be too difficult but might end up taking hours to run.

Before I go down this road, anybody have any ideas?

Thanks, Ron

Comments

Soundex Function to Find Similar Words Correction!

Here's a Soundex function I wrote that will return a string which is the same as strings returned by similar words. Try it with "kimberly",then "kimberlee",then "kimberley"
for example.


DECLARE FUNCTION soundex$

DO
LINE INPUT PROMPT "Enter string: ":m$
IF m$="" THEN EXIT DO
PRINT soundex$(m$)
LOOP
END

EXTERNAL FUNCTION soundex$(a$)
LET a$ = UCASE$(a$)
LET code$ = a$(1:1)
IF code$ >= "A" AND code$ <= "Z" THEN
FOR i=2 TO LEN(a$)
LET temp$ = a$(i:i)
IF temp$ <> a$(i-1:i-1) AND POS("AEHIOUWY",temp$) = 0 THEN
SELECT CASE temp$
CASE "B","F","P","V"
LET code$ = code$ & "1"
CASE "C","G","J","K","Q","S","X","Z"
LET code$ = code$ & "2"
CASE "D","T"
LET code$ = code$ & "3"
CASE "L"
LET code$ = code$ & "4"
CASE "M","N"
LET code$ = code$ & "5"
CASE "R"
LET code$ = code$ & "6"
CASE ELSE
EXIT FOR
END SELECT
END IF
NEXT i

LET code$ = code$ & "000"
LET code$ = code$(1:4)
LET soundex$ = code$
ELSE
LET soundex$ = "0000"
END IF

END FUNCTION

SOUNDEX

Tom,

Interesting routine - I'll try it out on line inputs when I can - for now the scope of the project has dimnished enough that I can do it manually. Thanks!
Ron Irvine
Charleston, SC

Searching Text Files

Ron ... Back in my CP/M days, and "Mallard" BASIC programs I had to copy from the PCW Magazine, there was a "checksum" program that opened one's copied program listing and then printed a four-HEX-digit number, appended at the end of each program line as a "program line comment". If your typed program listing's "program line comments" didn't agree with the magazine's "program line comments", you would edit your typed program lines until they all did.

There's a little mathematics involved in calculating the HEX number for a program - or text - line so that the chance of calculating the same number for two different text lines is infinitessimal.

I'm fairly certain I can locate the PCW checksum BASIC program, and I think I remember the mathematics for computing the unique line numbers. And, of course, the line numbers can be decimal integers. Comparing the document line numbers in multiple documents is, as they say, "an excercise for the student". Regards ... Tom M

Ultra compare

You might want to try UltraCompare (http://www.ultraedit.com/products/ultracompare.html) which already does much of what you are looking to do.

Ernest Gundel
Synovate, Inc
ernest.gundel@synovate.com

Ultra compare

Ernest,

Interesting piece of software but it didn't do what I was looking for - these files are totally disparate. The only commonality is the general subject.

Thanks anyway.

Ron Irvine
Charleston, SC

Searching text

Ron ... If you haven't done so already, Google 'search text' The 1st "hit" is "Full Text Search" (Google that too). That article should give you some help. Regards ... Tom M

Comparing text strings

Hi Ron,

I don't think your proposed text comparison program will take hours to run. I have a text search program that searches entire folders and sub folders (including exe files) looking for phrases or words, and this is very quick. The reason I wrote this is that the Windows "search" option under the START button doesn't work properly.

Obviously the logic in your program is going to be more complex than mine, but yours only compares two files whereas my program reads hundreds. In my case I read the entire file as a string and I look for ANY instances of the target word or phrase. In your case you are looking for the target word in a specific place. I think the big problem will be in deciding what to do next, e.g. if you don't find the target word, do you continue searching or do you take the next word in the original document and seach for that, making a note of all the missing stuff in between.

Your idea of making the comparison line by line is probably a good one because the datum point, i.e the position of the search word in the original file and its position in the target file, will be reset at the end of each line.

Good luck,

Big John

Comparing text strings

John,

I did something like that last year when I was searching problem reports for failure modes (folks who designed the report didn't think of that and I had hundreds of them).

In that instance, I put them all in one directory, did a DOS listing (dir > *.txt) and modified that to open the files one after another, searching for particular phrases. It worked very well but, as in your case, I knew what I was looking for.

Years ago, when SATCOM first came into NAVY use, there was a software adjustment in the computer called a "QRK" factor. This allowed the system to receieve message traffic which had slightly garbled addresees. So, a message intended for USS NEVERSAIL could still get a message addressed to USS EVERSAIL (or similar) if the "QRK" was set to something like .9.

Anyway, I think what I'm looking for is some sort of comparator which will give me a numerical value based on the string contents. If I can find that, I'll be able to set a threshold for accepting the comparison.

For now, I'm just reading one and running a search on the later version (and they still don't always use the same terms).

Think I better stop before I start venting my feelings about ISO, Lean Six Sigma, DOORS and Human Engineering Design (we're being smothered in this stuff).

:-)

Ron

Ron Irvine
Charleston, SC

Text comparisons

Hi Ron,

This is fascinating stuff. We humans can do the job relatively easily, even if we are a bit slow and we get bored very quickly. However, duplicating what we do and how we think into a program is extremely difficult.

I like the idea of using a comparator. Probably differences in punctuation and spacing should be discounted. Maybe a simple comparison based on the number of identical characters versus the total number of characters in a line would be sufficient.

I would be tempted to use a computer/human combination, i.e. use the computer to screen all the text and print out major differences, and let a human examine the differences.

Regards
Big John