Missing values

Hi;

I'm reading comma delimited files which frequently have missing values. This causes the program to abort so I'm wondering if TB has a way of handling this.

For example, in FORTRAN you could use a "bz" format flag to tell it to replace blanks with zeros.

Naturally I could do a LINE INPUT to a string and parse for ", ," cases but with variable length numbers this could get awkward.

Roger

Comments

Parsing CSV files

I created a library to parse a CSV file. It takes each line, parses it based on commas and returns the values in an array. It gets complicated because some values in the CSV strings with commas inside of quotation marks so you have to be careful not to break up a string input. Also, it handles empty values so long as the commas are there.

Enjoy
! ParseLib.tru
!
! CALL StringtoArray(a$,array$())
! CALL ParseString(a$,b$)
!
! Call StrArrayCount(a$,n)
! Input a$ is a text string from a CSV file
! Output n is the number of values stored in a$
!
! Call StringtoArray(a$,array$())
! Input a$ is a text string from a CSV file
! Output array$() stores each value from the CSV input.
! array$ is redimensioned to equal the number of values in a$
!
! Call ParseString(a$,b$)
! Input a$ is a text string from a CSV file
! Output b$ is the first value (up to the first "," or to the end of the file
! b$ is null if no a$ is null
! a$ is modified in this routine by stripping b$ from the beginning
!

EXTERNAL

MODULE Parse

SHARE dummy$,delim$

LET delim$=","

SUB StrArrayCount(a$,n)
dim d$(0)
CALL StringtoArray(a$,d$)
let n=Size(d$)
END SUB

SUB StringtoArray(a$,array$())
LET dummy$=a$
LET narray=0
DO while LEN(dummy$)>0
LET narray = narray+1
MAT redim array$(narray)
CALL ParseString(dummy$,array$[narray])
LOOP
END SUB

SUB ParseString(a$,b$)
LET p=POS(a$,delim$)

IF p=0 then
LET b$=a$
LET a$=""
EXIT SUB
END IF

LET a$=Trim$(a$) ! strip the blanks
LET t=POS(a$,"""") !check to see if there are any quotes
LET p=POS(a$,delim$)

IF t>p or t=0 then ! retrieve data
LET b$=a$[1:p-1]
LET a$[1:p]=""
EXIT SUB
END IF

! if quote comes before comma then either comma is encased by quotes
! or the comma is after the encased value. In either case, the first
! character must be a quote. If not, then cause an error

IF a$[1:1]<>"""" then
LET e$ = "First character is (" & a$[1:1] & "). It must be ("")."
CAUSE ERROR 201,e$
END IF

LET z = LEN(a$)

! First, find the end of the string encased by quotes.

! check to see if something is encased between quotes
FOR i=2 to z-1
IF a$[i:i]="""" then ! maybe this is the end quote
! check to see if it is followed by another quote
IF a$[i+1:i+1]="""" then ! skip to the next potential pair
LET i=i+1
ELSE
LET p = i !it must be the end of the quote
EXIT FOR
!as a double check, the next character should be a comma
END IF
END IF
NEXT i
LET b$=a$[2:p-1]
LET a$[1:p+1]=""

! find double quotes and make them single quotes
FOR i=LEN(b$) to 2 step -1
IF b$[i:i]="""" then ! maybe this is the end quote
! check to see if it is followed by another quote
IF b$[i-1:i-1]="""" then ! skip to the next potential pair
LET b$[i:i]=""
LET i=i-1
END IF
END IF
NEXT i

END SUB

END MODULE

Screen Dumps ... A simple program for reading byte files

Roger ... I've completed a simple 50-line TB program that reads and prints the random numbers I created in a 5-row 4-colum Excel array file. I deleted the 2nd number in the 2nd row, the 2nd and 3rd numbers in the 3rd row and the 2nd, 3rd & 4th nuimbers in the 4th row and then Saved as.. the Excel file as a .csv file, a MAC.csv file and a Tab.txt file.

The TB program OPEN'd each of the stored files, one-at-a-time, and printed data to the monitor screen. The TB program opens the data files with ORG byte, ACCESS input attributes. The number of file bytes is ASK'd for and the input data is delivered to a single string variable, y$. The y$ length of the .csv and .txt files were both 186 bytes. the MAC.csv y$ length was 250 bytes.

All of the numbers were plain text, not 8-byte IEEE format numbers.

Non-ASCII-text characters were printed in a list. The only ASCII bytes printed to the screen were #9, #10 or #13, and of course, were just retangular blobs. No other Ctrl codes were detected.

To keep this message short, I suggest you email me and I will respond with the plain text program listing and the 3 Excel data files, so you can see what I did. I'm sure you will be able to extract the numeric variables embedded in y$.

If you can, I'd like to receive a file of a web page example, so I can edit my program to extract your file numbers into the TB number format. Perhaps you could attach to your email to me. Regards ... Tom M

This kind of examining the

This kind of examining the contents of a file needs no programming at all; it can be done using a good text editor or a file dumping tool (usually displaying the contents in hex format).

More importantly, no examination or experimenting at all is needed to discover that a csv file is a plain text file. A csv file is a plain text file by definition (precisely for the purpose that tables could be processed as plain text rather than binary data).

That is why you haven't found ‘invisible’ ASCII characters other than those with numbers #9, #10 and #13. The last two are used alone or in combination (depending on the o.s.) as line terminators. #9 is the "Tab" character and was, apparently, only present where Tab was used instead of comma as a delimiter between the values.

Also because csv is a text format, there is no reason to expect to find a binary-stored data, such as 8-byte IEEE-format numbers, in it.

There follows that Roger's problem is best solved by reading strings, a line at a time, and parsing them. There is no need to read an entire file, let alone to have to ask for its size.

Now, as Roger says, it is the explicit parsing that he wants to avoid, but this is obviously impossible, because in TB there is no way to implicitly tokenize a string based on a delimiter (comma or whatever).

On the other hand, such parsing is easily automated explicitly by a simple procedure such as the 10-line one that I posted in this thread: it works correctly for any amounts of missing and present values, it is in fact parameterized on the delimiter (so the latter can be varied to, say, ";" or ":" or char$(9)), and it can easily be extended to also handle non-numeric values, such as e.g. strings.

Roger,
I wonder, what did you mean by saying ‘with variable length numbers this could get awkward’? What can get awkward?

Regards,
Boyko

Boyko; Sometimes I type

Boyko;

Sometimes I type faster than I think. In this case I was thinking it would be awkward to have to search for ",,",", ,",", ", etc.

It's bad enough to have to use POS to find commas and then chr$ to see what's between them.

Roger

Ah, I see now... This can be

Ah, I see now...
This can be really annoying, but once it's done right it needs not be repeated. You do it once and forever. Rather than directly doing LINE INPUT, you'll just have to call a procedure...

The following is a slight extension to the parsing code I posted before, together with a main program for testing it (on #0). The get_csv_row procedure takes an input channel number n, a delimiter d$, and a numeric array a as arguments. It reads a line of text from #n and parses it, based on d$ and skipping any whitespace, into a sequence of numbers which it stores in a. It also resizes a so that it accomodates precisely that many numbers.

The only non-automated thing that remains is storing the obtained values in individual variables, if that is needed: e.g. let x = values(1), let y = values(2) etc. in the main program.

I hope this helps.

Regards,
Boyko

option base 1
dim values(0)
call get_csv_row(0,",",values)
let m = size(values,1)
print m; "numbers read:"
for i=1 to m
  print values(i);
next i
!
sub get_csv_row(n,d$,a())
local i,x,s$,w$
line input #n: s$
! this part only needed if we wish to resize a()
let k = 1
let i = 0
do
  let i = pos(s$,d$,i+1)
  if i=0 then exit do
  let k = k+1
loop
mat redim a(k)
! parsing follows
let s$ = s$ & d$
let k = 0
do
  let i = pos(s$,d$)
  if i=1 then let w$ = "" else let w$ = trim$(s$(1:i-1))
  if len(w$)=0 then let x = 0 else let x = val(w$)
  let k = k+1
  let a(k) = x
  if i=len(s$) then exit do
  let s$ = s$(i+1:len(s$))
loop
end sub
!
end

Re: Screen dumps - an experiment

Roger ... Running Microsoft Excel, I've just created a 5-row, 4-column rectangular array of random numbers, using the Excel function = 100*Rand()-50, so I would create both positive and negative numbers that display 10-digits (two integers and 8 decimals).

I saved to disk memory three files, named as follows:

Random_Comma.csv, Random_Comma_MAC.csv and Random_Tab.txt

It's after midnight, and my wife and I will be attending the Philadelphia Area Computer Society (pacsnet.org) monthly meeting tomorrow morning and afternoon, so I won't be able to get back to this project until tomorrow evening. Then I will write a TB program that will address your problems, including "missing values".

I'm anxious to learn if the numeric data cells might be IEEE format-compliant - and - if the numbers are stored with the eight number bytes in "little endian" order. Regards ... Tom M

parsing a csv string

Roger,

Of course you are right — you don't need to read the file as a byte sequence. Reading a line at a time is perfect.

Now, assuming that:
  (1) all values in a line are numeric;
  (2) they are separated by commas;
  (3) missing values are treated as 0s,
then the following code parses a string s$ and prints the corresponding sequence of numbers.

Of course, instead of printing the values of x, you can store them in an array etc.

This code works correctly no matter how many of the numbers are missing and which ones, i.e. all sequences of two or more commas are parsed as needed, including in leading or trailing position in s$.

Any possible spaces in s$ are skipped as immaterial.

If another separator need be used instead of the comma, change the value of d$ correspondingly.

let d$ = ","
let s$ = s$ & d$
do
  let i = pos(s$,d$)
  if i=1 then let w$ = "" else let w$ = trim$(s$(1:i-1))
  if len(w$)=0 then let x = 0 else let x = val(w$)
  PRINT X;   !! or store X, or do whatever needed with it
  if i=len(s$) then exit do
  let s$ = s$(i+1:len(s$))
loop

Is that what you need?
Regards, Boyko

Yes, see my reply to Tom,

Yes, see my reply to Tom, below.

Roger

Missing Values

The following will separate strings into their component parts. a null value won't bother it at all

Tom Lake


DO
LINE INPUT prompt "Enter comma delimited string: ":s$
DO while pos(s$,",") > 0
LET p=pos(s$,",")
PRINT s$(1:p-1)
LET s$=s$(p+1:maxnum)
LOOP
PRINT s$
LOOP
END

Yes, this is what I've been

Yes, this is what I've been doing as a work-around.

But what I wanted was a way to read the values without tinkering with the characters. For example, if the line was:
27, 32.5, , 14
I want INPUT #3: a,b,c,d to return a=27, b=32.5, c=0, d=14.

Roger

Re: Your "numbers" inputted from Web pages

Roger ... Your "numbers" inputted from Web pages is revealing. They seem to consist of ASCII character "strings", and not the 8-byte IEEE numbers that True BASIC works with. Accordingly, the INPUT #3: program line should be written as:

INPUT #3: a$, b$, c$, d$, etc.

followed by program lines

a = VAL(a$)

b = VAL(b$) etc.

Of course, my revised programming lines might return some ASCII character 000 (zero, the NUL character). How do you want to handle that case? That's the problem with INPUT and PRINT "text" type files. "byte" type files use READ and WRITE TB program lines, even with "text" type files.

When one READ's any type of file - text, stream, random, record or byte - the READ variable can asign the entire stored target file into a single string of ASCII characters, i.e., y$. y$ can be printed to you monitor screen, along with LEN(y$).

One can parse y$ to identify every ASCII character number in y$. I do this routinely. I never use "text" files unless I know before hand the files are strictly "plain text", and can be recovered using LINE INPUT program lines.

Parsing y$ I can detect the carriage return and line feed characters (13 & 10), so I can deduce the file was written on a PC, a MAC or Unix machine.

This message is long enough, and interrupting our 9 PM snack time, so ... Regards ... Tom M

Comma delimited files

rogerh ... If you are NOT reading your target file as a byte file you're trouble, and you're wasting your time. You MUST be using READ statements, and WRITE statements (if you want to alter your target file), and NOT use INPUT and PRINT statements.

When you parse a byte file, you can be certain that NO "housekeeping" characters or Control characters have been introduced into the target file by a INPUT or PRINT statement - including a LINE INPUT statement.

READ THE MANUAL pages that cover byte-type files. Rergards ... Tom M

Byte files

Hi Tom,

Very sound advice.

The file reading routine in the new editor uses the byte file technique to read incoming files. In this way it is possible to import TRU, TRC and even MAC formatted files without raising an error

Big John

Big John; While I would

Big John;

While I would normally defer to your greater experience, in this case I disagree.

I've been using TB for over 30 years and have never had occasion to read by bytes.

All I'm doing is to read in text files, extract data by columns or comma delimited format and use it or write it to another file. I can read in a line and examine it character by character if need be. You don't need byte files for that.

Additionally, you have to specify how many bytes to read in which isn't too useful with variable (and unknown) length lines.

Roger

Parsing file characters

Roger ... If your target file consists of columns (of numbers I assume), then it also contains end-of-line characters, like crlf$, which is chr$(13) & chr$(10), assuming your target (array?) file contains more than one row of numbers.

An Excel spreadsheet numeric variable file is an example of such a file, but it may contain the TAB control character, chr$(9), and not the comma character, ASCII character, chr$(44)to separate numbers.

Opening an array (of numbers) as a byte file permits one to determine precisely what every file character is, from ASCII zero to ASCII 255.

If your target file contains numbers, you haven't revealed whether your numbers are IEEE 64-bit double-precision numbers or just strings of ASCII characters (between ASCII 048 and ASCII 057), as in Appendix A of your TB manual. Please tell us which. Regards ... Tom M

Tom; I have no idea but it's

Tom;

I have no idea but it's never been a problem. I'm assuming they are ASCII as I'm doing cut/paste from web pages.

Roger

RE: Tom; I have no idea ...

Roger ... If memory serves me, I think we know each other from the "old True BASIC Forum". I think we're both engineers. I am old & retired.

I'm curious to know about the web pages you have been "cutting & pasting". Are they actually rectangular numeric arrays you highlight? Do you find them in Excel file pages?

I have downloaded & printed Excel arrays found on the "rfcafe" website, in the Smith Chart pages. I think I will revisit the site and experiment with the Excel arrays. I've never thought of extracting the S-parameter numbers into TB.

Some years ago I wrote TB programs that duplicated rf circuit design examples that exist in Tom Cuthbert's book "Circuit Design using Personal Computers (1983)", which used PET BASIC. I used TB's MATRIX statements & functions, and rendered complex numbers as 2-by-2 square arrays. Simplified vector analysis greatly.

I'm going to ignore further reply responses posted in your Forum message. Regarding byte-type files, I say "Don't bother me with facts! My mind is made up!" Regards ... Tom M

Tom; I'm old and retired but

Tom;

I'm old and retired but a seismologist, not an engineer.

I'm reading the results of earthquake searches. It's columns unless I ask for spreadsheet format in which case it's csv format.

My problem is that in either case there can be missing values for seconds and/or magnitude so if I say input #3: yr,mo,da,hr,mn,sec a missing value will cause an error and so will sec = val(lin$[23:28])

Roger

Seismology

Roger ... Yes, I remember you now. Can you give me the name of the website you visit, so I can see the tables you work with? Regards ... Tom M

Tom; Sure thing,

Tom;

Sure thing, it's:

http://earthquake.usgs.gov/eqcenter/recenteqsww/Quakes/quakes_all.php

That's one of them. I either cut/paste into Crimson Editor or use the "save page as..." option in my browser.

Roger

The earthquake link

Thanks for the earthquake data source. I visited the site & downloaded the "M1+earthquakes, past 7 days (86kB)" file. When I opened it in Wordpad, I could see it would open in an Excel spreadsheet, and that's what I did. I named the file rogerh_20081022.xls and saved it. I also saved the Excel file as a .csv file and a TAB-delimited .txt file.

I printed out the first page of the Excel file. The first row of the "earthquake" file was the "field" headings for the 10 columns of data, namely Src, Eqid, Version, Datetime, Lat, Lon, Magnitude, Depth, NST and Region. The Excel file is a "flat" database with 3 columns of text & 7 columns of numbers.

I edited the TB program I wrote last weekend & changed the target file names to agree with the above Excel .csv & .txt file names. My program ran ok; no errors had to be troubleshot. The length of the y$'s were 88,890 bytes for the .csv file, and 84,797 byes for the .txt file. The difference between the two was due to the different number of Ctrl characters in the two files. It would be easy to extract the "number" columns.

Have you ever opened a csv file, first as a Wordpad file, and then as an Excel database? Regards ... Tom M

Yes, but Wordpad has a nasty

Yes, but Wordpad has a nasty way of trimming excess blanks which can mess up a column based file. That's a problem when I want to reformat by deleting some of the columns,

The other problem would be in opening a large file. I sometimes deal with catalog searches that have 250,000 lines of text.

Roger

Re: Yes, but ...

Roger ... Can you point me to a large catalog search file? And, any file that has missing numeric data.

The .csv file I downloaded didn't have any missing numeric data values, and I didn't try any reformatting until I was in the Excel file. I did some sorts on the Magnitude & Depth columns, to see their range of values, but didn't save the sorted files. Regards ... Tom M

Actually, I can't because

Actually, I can't because the search routine won't allow you to get that many in one pass. But I have files that big, made by consolidating multiple searches.

Go here and do a global search for as many years as it will allow and you'll find a lot of missing values.

http://neic.usgs.gov/neis/epic/epic.html

Roger

Missing values, eartrhquakes

Roger ... I visited the earthquake link in your last message and downloaded moderate snippets of two of the .csv files I opened. On the first try, I used Ctrl+a to prepare the file for copy & past into Wordpad. NOT a good idea! Too big of a file!!!

On the next try, I highlighted about 50 to 100 lines, from two different data files that did have missing numerical data lines. Got them saved as an rtf file.

Then had to shut down and travel to the Philly Main Line, where we have a condo. That's where dottie & I are spending the weekend, socializing. I'll be back in NE Philly Monday. I want to get the two data file segments into Excel, and ultimately into TB. Regards ... Tom M