Monday, March 31, 2008

Cola Array Range Notation and Range Lists One of the common things I use Perl or UNIX command line tools to do is process either CSV or fixed width data files. Even though Perl has some nice regex notation, I've found often that the UNIX 'cut' utility is more direct and clear than a Perl regex, substr or pack/unpack. Say I'm given a fixed width ASCII file with first name, last name, zipcode and a lot of other fields and I'm told to create a new file with only the 3 fields listed above. The spec says each field is 10 characters wide and the first, last and zipcode fields are the 1st, 2nd and 7th fields respectively. In UNIX, to extract, its as simple as:
cut -c0-9,10-19,60-69 myfile.dat > newfile.dat
I don't know of any briefer way to write it than with 'cut'. With Perl I'd usually use a regex or substr for this.
# Perl
 while(<>) {
  print substr($_, 0, 10) . substr($_, 10, 10) . substr($_, 60, 10) . "\n";
 }
Not as clear, and usually where the substr becomes less clear is it doesn't match up with the actual START:STOP ranges given in the specs or something like an Oracle SQL Loader file definition, so I have to remind myself that Perl substr() uses START:LENGTH notation, not START:END notation. If the problem is a bit more complex, like splicing in new fields, substituting field values, etc. the Perl is usually the way to go to, though I'm sure some shell gurus will argue, but by the time I have to use three (cut + sed + awk) I just prefer one (Perl). For Cola I'm playing with the idea of range notation and range lists for both arrays and strings (which I treat as arrays).
 // Cola
 string s;
 while(s = readln()) {
   print( s[0..9, 10..19, 60..69] );
 }
Its easy to implement this use as an rvalue, but eventually people want to assign to it (lvalue) and so the question becomes what is the type of the expression:
   foo[0..9, 10..19]
As a rvalue, I expect the type to be of whatever the type of foo is, so if is an array of int, the expression should be a new array of int with only those ranges. The question becomes can I assign to that in any meaningful way? The tangent to this is immutable (or mutable) strings. Click said link for hours of reading from fanatics from either camp (see my post on mutable strings .. link to be added). C# and Java don't have mutable strings, but Perl has a smart string. I'm leaning towards smart strings. For the sake of this dicussion forget immutable strings, or forget we are talking about strings, and assume arrays. Is it useful to be able to assign to an array with range notation?
   foo[0..2] = {1, 3, 7};   // sure
   foo[0..2, 5..7] = {1, 3, 7, 11, 13, 17};  // makes sense to me
   foo[0..2, 5..7] = { {1, 3, 7},  {11, 13, 17} };   // nested arrays ?
The problem with the 3rd example is it lacks orthogonality? The principle of orthogonality would dictate if I evaluate the expression as type X, then when I assign to it I must assign to type X, not X[]. Other points to ponder. If the target range is not the same size as the source range, what happens?
   foo[0..2] = {1};   // leave range 1..2 untouched or collapse and discard them
   s[0..2] = "ABCDE"; // replaces outside the range, illegal or squeeze it into the array?
I think the collapse or expand has more practical utility use in this notation, however its probably more bug-prone. An option is to generate a compiler warning about this sort of expression, the other option is to just make it illegal. Perl's substr() assign is illegal if the assignment is out of range or src size > target, so it doesn't try to do any magic. The question is where does Cola fit. I've added in some Perly features, but still stick closer to C#/Java syntax and Cola is in fact a staticly typed language, but C#/Java do not do compile time bounds checking on arrays, and this range feature falls right in line with that.

No comments: