Saturday, September 30, 2017

Max Column Sum by Key (part 3)



Continuing on from my previous explorations that followed Learn to Code Grand Rapids “Building a Real World App in VS 2017, Part I” of August 10 I decided to try Kotlin after having attended a MeetUp presentation of this new language. 

It was supposed to be an extension of Java 8 (I think it was) for Android devices and to be able to gradually go from Java code to Kotlin with them working together while gradually switching over.  As a new language it was supposed to be getting frequent releases.  Well I must not have downloaded a real recent release – at least I couldn't compile with part Kotlin code and part Java code (of course my Java download was back with a 3.6 release or something like that so no where near Java 8).

In my previous two Max Column Sum by Key documents I described my initial C# attempt before I had the file containing the data that was aborted for that reason, then my application using Ada as the language after getting the data file, then with C# once again using the data file, and finally with Python as a new language for me.  Then, in the second document, using Java and redoing the C# application once again.

This document will describe the application in Kotlin which ended up somewhat simpler since I had to work through some problems.  While trying to find a way to do a similar structure to that used in "Max Column Sum by Key (part 2)" in Java and then C# once again and failing to accomplish it in Kotlin, I gave up and then a simpler solution occurred to me.

Following that discussion will be one concerning doing the same application in C.

Discussion of the Kotlin Application


I started out following the program structure of the Java version of the application with a Key class to encapsulate the data of the key field of the file.  Remember that the file being used consisted of 20 records for each of 5 different keys with a pair of values in each record associated with the key with the objective of finding the key and value that had the most instances in the file.  (Each key following "who cares" leading text and the key and value fields preceded by a horizontal tab character and the original file having a line feed character following the second value.  Since I am using the UltraEdit editor which likes to ask if you want to convert to a DOS format when opening a file, I happened to reply in the affirmative early on so that records got a carriage return preceding the trailing LF when I was examining the format of the file.)

In the Java application and the later C# application there was the Key class to organize the data for a key including arrays to retain the different values associated with the particular key (taking the place of a C struct structure).  Then a KeyTable class to keep track of all the data with an array of the Key class to retain the data on each different key found in the file.  Plus an Update function to access the two classes to build the tables.

I tried to duplicate this in Kotlin – after all it was supposed to be able to gradually convert from a Java application to Kotlin.  Kotlin had array constructs where a class or object could be assigned as the type of the array.  (object being something similar to a class it seemed.)  But try as I might I couldn't get an array of the Key class to work as I had in Java and C#. 

So I decided to use a double dimension array of the Int type.  Again there was a problem since it seemed from online searches that multiple dimension arrays weren't supported although some said it had been added supplying what looked to me as a strange syntax.  But, most likely not in the version I had downloaded even if I could have figured out the syntax.

After fussing with the array of the Key class once again it dawned on me that I could do a work around for a double dimension array where I made a single dimension array of the Int type contain imaginary arrays of a second dimension.  In that way I could have the multiple arrays of the past Key class value data for a particular key embedded in the single dimension array – one group of the value data for each key as follows.
  // The double dimension arrays are simulated as
  //  +-----------+-----------+-----------+-----------+-----------+-----------+
  //  |           |           |           |           |           |           |
  //  +-----------+-----------+-----------+-----------+-----------+-----------+
  //  |           |
  //  | elements  +--> boundary at end of first array
  //    of first
  //     array
  // Access the double array via v + k * (first array size) since arrays begin at 0
  //  [0,0] computes to 0 for the single dimension array;
  //  [1,0] computes to 1; assuming 30 elements in the k array
  //  [0,1] computes to 30;
  //  [1,1] computes to 31
  var keyCount : Int = 0        // number of different keys
  var keys = IntArray(30)       // value of key - one in each array position
  var valueCount = IntArray(30) // number of different values for a particular key
  var values = IntArray(30*20)  // particular values for each key
  var sums = IntArray(30*20)    // sum of instances of each value for each key
Here "elements of first array" means the values associated with the first key (the k index of the illustration) with v being the value index within a particular set of elements.  Thus the code only needs to compute the actual index from the k and v indexes.

After getting this worked out, the Key class no longer had any purpose and so was discarded.

Then while implementing the Kotlin version of the Update function accessing the KeyTable class functions it occurred to me that it would be much simpler to move the update function inside the KeyTable class to directly access the data and that this could be accomplished more simply than I had in the past implementations by passing in all three fields from each record at a time.  This is when it occurred to me that this new KeyTable function had the same purpose as the previous Update function so I changed the name I had given it to update (following the usual C convention of making the first character of a variable or function name lower case).  When the code is presented it can be seen how this version of update is simpler than the previous versions.

At the beginning I had had some problems remembering the syntax for declaring variables and function parameters and function data return declarations since it is non-C like.  It took me a while to get used to it and then I realized it had some characteristics of that of Ada – that is, that the variable is named first followed by a ':' and then its type.  For instance
var index : Int = 0
val HT : Byte = 9
which in Ada would be (where case doesn't matter)
index : Integer := 0;
HT : constant Byte := 9;
where the Byte type in Ada needs to be defined as an 8 bit unsigned integer.
Whereas in C (for instance) it is
int index = 0;
with the type given first.

Also Kotlin can have constants by using the keyword val instead of var.  While, of course, Ada, C, etc don't have a keyword to specify when a variable is being created.

As with Java an ending ';' can be given but isn't need.

For function declarations Kotlin uses
fun name(param1 : Int, param2 : String) : Int
whereas Ada would use
function name(param1 : Integer; param2 : String) return Integer;
or
function name(param1 : in Integer; param2 : in String) return Integer;
where it is specified that a function can only have parameters that input data to the function.  Ada also has
procedure name(param1 : in out Integer; param2 : in String, param3 : out Integer);
that doesn't have a return statement but can output data via the out keyword (with param1 in this case being both input to the procedure and output).  C function declarations, of course, are of the form
int name(int param1);
void name(int param1);
without the keyword to specify a function and with the return type preceding the function name and the type of the parameters preceding the parameter name.

So, in this regard, Katlin syntax is more like Ada (and the older and deader language Pascal) with the name first and then the type.

As can be seen in the Kotlin code, I also modified the parse function to just build an array of 3 integers to capture the data fields of interest (the key and the two values) and just count the number of fields processed (including the leading don't care field) rather than use a switch type of statement.

My Kotlin code for the application is (where Java IO is imported to read the file)
import java.io.File
import java.io.InputStream

// Update structure to contain the different values for each key with the
// number of instances of each combination
class KeyTable
{
  // Data tracking which value of which key has the greatest number of references
  var maxKey : Int = -1   // key with the greatest number of references for a value
  var maxValue : Int = -1 // value with the greatest number of references
  var maxSum : Int = -1   // number of references for key, value combination
 
  // This class retains the data for all possible keys with their associated
  // values.  As such it simulates double dimension arrays for the values and
  // the sum of the number of instances for a particular value.  The first
  // simulated dimension is the same as that for the keys array; one index for
  // each key.  The second simulated dimension is of the various different values
  // for a particular key.
  //
  // The double dimension arrays are simulated as
  //  +-----------+-----------+-----------+-----------+-----------+-----------+
  //  |           |           |           |           |           |           |
  //  +-----------+-----------+-----------+-----------+-----------+-----------+
  //  |           |
  //  | elements  +--> boundary at end of first array for values of first key
  //    of first
  //      key
  // Access the double array via v + k * (first array size) since arrays begin at 0
  //  [0,0] computes to 0 for the single dimension array;
  //  [1,0] computes to 1; assuming 30 elements in the key array and 20 for the values
  //  [0,1] computes to 30 for first value of 2nd key
  //  [1,1] computes to 31 for second value of 2nd key
  var keyCount : Int = 0        // number of different keys
  var keys = IntArray(30)       // value of key - one in each array position
  var valueCount = IntArray(30) // number of different values for a particular key
  var values = IntArray(30*20)  // particular values for each key
  var sums = IntArray(30*20)    // sum of instances of each value for each key

  var savedData = IntArray(3) // global values from call to update

  fun initialize()
  {
    for (k: Int in 0..29)
    {
      keys[k] = 0
      valueCount[k] = 0
    }
    for (kv: Int in 0..(30*20)-1)
    {
      values[kv] = 0
      sums[kv] = 0
    }
  } // end initialize

  // Add key and its values
  fun update(data : IntArray)
  {
    // This function first checks if the key is new and, if so, adds it to the
    // keys array.  It then does similar for the values associated with it.
    // Note: data[0] is the key while data[1] and data[2] are the two values
    //       associated with the key.
   
    savedData = data // save data to be updated for use by updateTotals
   
    // Check whether the key is already in the table
    var keyIndex : Int = -1
    for (k : Int in 0..keyCount-1)
    {
      if (keys[k] == data[0])
      {
        keyIndex = k // key already in the table
        break; // exit loop
      }
    } // end for loop

    if (keyIndex < 0) // key not in the table
    { // add the key
      keyIndex = keyCount
      keys[keyIndex] = data[0]
      keyCount++
    } // end if

    // add the values for the key
    addValues(keyIndex, data[1], data[2])

  } // end update

  // Add the values or increment their sums
  fun addValues(keyIndex : Int, value1 : Int, value2 : Int)
  {
    // Check whether the first value is already in the table
    var valueIndex : Int = -1
    var index : Int = keyIndex * 30 // location of any first value (0, 30, ...)
    for (v : Int in 0..valueCount[keyIndex]-1)
    {
      index = index + v
      if (values[index] == value1)
      { // value already in the table
        valueIndex = index
        sums[valueIndex]++ // increment its number of references
        if (value1 == value2) // 2nd value the same
        {
          sums[valueIndex]++ // increment again
        }
        updateTotals(savedData[0],value1,sums[valueIndex]) // check if new max
        break; // exit loop
      }
    } // end loop
    if (valueIndex < 0) // value not yet in table
    { // add value to the table - index points to last value checked
      index = index + valueCount[keyIndex] // next location
      values[index] = value1
      sums[index] = 1
      valueCount[keyIndex]++
      if (value1 == value2)
      {
        sums[index]++ // increment
        updateTotals(savedData[0],value1,sums[index]) // check if new max
      }
      else
      {
        add2ndValue(keyIndex, value2)
      }
    } // end outer if

  } // end addValues

  // Add the second value or increment its sum
  fun add2ndValue(keyIndex : Int, value2 : Int)
  {
    // Check whether the second value is already in the table
    var valueIndex : Int = -1
    var index : Int = keyIndex * 30 // location of any second value (0, 30, ...)
    for (v : Int in 0..valueCount[keyIndex]-1)
    {
      index = index + v
      if (values[index] == value2)
      { // value already in the table
        valueIndex = index
        sums[valueIndex]++ // increment its number of references
        updateTotals(savedData[0],value2,sums[valueIndex]) // check if new max
        break; // exit loop
      }
    } // end loop
    if (valueIndex < 0) // value not yet in table
    { // add value to the table - index points to last value checked
      index++ // index to next location
      values[index] = value2
      sums[index] = 1
      valueCount[keyIndex]++
    }
       
  } // end add2ndValue

  fun updateTotals(key : Int, value : Int, sum : Int)
  {
    if (sum > maxSum)
    {
      maxKey = key
      maxValue = value
      maxSum = sum
    }
  } // end updateTotals
 
  // Report the key and value with the most references
  fun report()
  {
    println("Key " + maxKey + " with value " + maxValue + " with maximum references of " + maxSum)
  }
 
} // end class KeyTable

// Instantiate KeyTable class
val keyTable = KeyTable()


// Convert Byte array to Int
fun toInt(data: ByteArray, iS : Int, iE: Int): Int
{
  val NINE: Int = 57 // ASCII character for digit 9
  val ZERO: Int = 48 // ASCII character for digit 0

  var index: Int = iE // loop in reverse
  var digit: Int
  var m: Int = 1      // multiplier for shift
  var number: Int = 0 // Numeric result

  while (index >= iS)
  {
    if ((data[index] >= ZERO) && (data[index] <= NINE))
    {
      digit = data[index] - ZERO; // convert ASCII to digit
      number = number + (m * digit);
      m = m * 10;
      index--;
    }
  }

  return number;

} // end toInt

// Parse
fun parse(buffer: ByteArray, length: Int)
{
  val CR: Byte = 13 // carriage return
  val HT: Byte = 9  // horizontal tab
  val LF: Byte = 10 // line feed
 
  // Parse each line of data in the buffer to obtain the three fields of
  // interest, converting those fields to integers into an array, and
  // then updating a data structure to retain the data for evaluation
  // when the complete buffer has been parsed.
  // Also, output the contents of the file buffer when not commented out.
  var nextField: Int = 0  // index of the beginning of next field
  var startField: Int     // range of indexes of
  var numFields: Int = 0  // index into dataFields array
  var dataFields = IntArray(3) // Integer values of the three fields of interest
 
  var index: Int = 0
  while (true)
  {
    // Parse the buffer line
    if (buffer[index] == HT) // beginning of a field
    {
      startField = nextField // save starting index
      nextField = index + 1  // the next byte will contain part of next field
      if (numFields > 0)
      {
        dataFields[numFields-1] = toInt(buffer, startField, index - 1) //convert and store in dataFields
      } // end if numFields > 0
      numFields++
    } // end if HT
    if ((buffer[index] == CR) || (buffer[index] == LF))
    {
      if (numFields < 4) // only do CR or LF once
      {
        dataFields[numFields-1] = toInt(buffer, nextField, index - 1) //convert and store in dataFields
        keyTable.update(dataFields)

        numFields++
      }
      if (buffer[index] == LF)
      {
        numFields = 0
      }
    } // end else if

    index++
    if (index == length)
    {
      break // out of while loop
    }
  } // end while

} // end of parse function

// Main entry point
fun main(args: Array<String>)
{
  // Open the file.
  try
  {
    // Open the inputStream.
    var inputStream: InputStream = File( "C:\\Source\\LearnToCodeGR\\max-col-sum-by-Key.tsv").inputStream()

    // Read from this input stream into an array of bytes
    var buffer = ByteArray(8192)
    var length = inputStream.read(buffer)

    keyTable.initialize() // initialize the KeyTable class

    // Parse the buffer and keep track of the needed data.
    parse( buffer, length)
   
    // Report the results
    keyTable.report()
   
  } // end try
  catch (e: NoSuchFileException)
  {
    System.out.println("File not found" + e);
  }

} // end main function

Kotlin, like Java and Python, doesn’t seem to have a compiler that runs in a Windows window.  So it has to be run in a Command Prompt window.  I, as usual, changed the folder in the window to that where I was developing the application (c:\Source\Kotlin) so the command was just
> kotlinc learntocodegr.kc -include-runtime -d learntocodegr.jar
since LearnToCodeGR.kc was the file name that I used and kotlinc.exe is the kotlin compiler.  (With the path to the binary folder added to the Windows system path.)

To run the application I used the command
> java –jar learntocodegr.jar
which result in the output in the Command Prompt window of
Key 3000 with value 1 with maximum references of 20
as output by the report function of the KeyTable class.  So the use of Java shows up here.


Discussion of the C Application


Once again I had the "main" function open the max-col-sum-by-Key.tsv file and read it.  However, as I have been meaning to do for a while now, instead of reading the contents of the entire file into a buffer, I used a C function to only read in a line at a time.  That is, until the Line Feed character and read the file as containing characters since it is now known that the file contains lines of characters rather than, for instance, a mix with binary numeric values.  This also has the advantage of being able to process a much larger file since the buffer can be sized to a much smaller value.

And I used C's struct type that I've found missing in C#, etc, etc so its use can be observed.  The object of the struct type are
  struct KeyData keydata[30]; // array of possible keys with their values
in the declaration of the KeyTable structure type and
struct KeyTable keyTable; // static table
for the keyTable variable.

Otherwise, the C application is a rework of the Kotlin version of the application.  So the two languages can be directly compared.

The contents of the max-col-sum-by-Key.tsv file is displayed in the main function.  I've left it in but normally it would either be removed or commented out.  When the application is run there is a blank line after each line read from the file since each line (except the last) has its own ending new line character and then my code outputs another new line using the \n new line character of C.  Of course, the horizontal tabs in the lines are only seen as blank space since HTs (\t), like the LF (\n), aren't printable characters.

#include <stdio.h>

// Data about a particular key
struct KeyData {
  int key;        // numeric key
  int valueCount; // number of different values associated with key
  int values[20]; // different values associated with key
  int sums[20];   // number of references to a value
};

// Data about all the keys of the file
struct KeyTable {
  int keyCount; // number of different keys in the file
  struct KeyData keydata[30]; // array of possible keys with their values
};

struct KeyTable keyTable; // static table

// Initialize keyTable
void initialize()
{
  keyTable.keyCount = 0;
  for (int i = 0; i < 30; i++)
  {
    keyTable.keydata[i].valueCount = 0;
  }
} // end initialize

// Data to be retained as the key and its associated value with the maximum
// number of references
int maxKey = 0;
int maxValue = 0;
int maxSum = 0;

int savedData[3];

// Report the key and value with the most references
void report()
{
  printf("\n");
  printf("Key %d with value %d with maximum references of %d \n", maxKey, maxValue, maxSum );
}

void updateTotals(int key, int value, int sum)
{
  if (sum > maxSum)
  {
    maxKey = key;
    maxValue = value;
    maxSum = sum;
  }
} // end updateTotals

// Add the second value or increment its sum
void add2ndValue(int keyIndex, int value2)
{
  // Check whether the second value is already in the table
  int valueIndex = -1;
  for (int v = 0; v < keyTable.keydata[keyIndex].valueCount; v++)
  {
    if (keyTable.keydata[keyIndex].values[v] == value2)
    { // value already in the table
      valueIndex = v;
      keyTable.keydata[keyIndex].sums[valueIndex]++; // increment its number of references
      // check if new max
      updateTotals(savedData[0],value2,keyTable.keydata[keyIndex].sums[valueIndex]);
      break; // exit loop
    }
  } // end loop
  if (valueIndex < 0) // value not yet in table
  { // add value to the table
    int index = keyTable.keydata[keyIndex].valueCount;
    keyTable.keydata[keyIndex].values[index] = value2;
    keyTable.keydata[keyIndex].sums[index] = 1;
    keyTable.keydata[keyIndex].valueCount++;
  }
       
} // end add2ndValue

// Add the values or increment their sums
void addValues(int keyIndex, int value1, int value2)
{
  // Check whether the first value is already in the table
  int valueIndex = -1;
  for (int v = 0; v < keyTable.keydata[keyIndex].valueCount; v++)
  {
    if (keyTable.keydata[keyIndex].values[v] == value1)
    { // value already in the table
      valueIndex = v;
      keyTable.keydata[keyIndex].sums[v]++; // increment its number of references
      if (value1 == value2) // 2nd value the same
      {
        keyTable.keydata[keyIndex].sums[v]++; // increment again
      }
      // check if new max
      updateTotals(savedData[0],value1,keyTable.keydata[keyIndex].sums[v]);
      break; // exit loop
    }
  } // end loop
  if (valueIndex < 0) // value not yet in table
  { // add value to the table - index points to last value checked
    //    index = index + valueCount[keyIndex] // next location
    int index = keyTable.keydata[keyIndex].valueCount;
    keyTable.keydata[keyIndex].values[index] = value1;
    keyTable.keydata[keyIndex].sums[index] = 1;
    keyTable.keydata[keyIndex].valueCount++;
    if (value1 == value2)
    {
      keyTable.keydata[keyIndex].sums[index]++; // increment
      // check if new max
      updateTotals(savedData[0],value1,keyTable.keydata[keyIndex].sums[index]);
    }
    else
    {
      add2ndValue(keyIndex, value2);
    }
  } // end outer if

} // end addValues

// Update keyTable with data from a parsed line
void update(int* data)
{
  // This function first checks if the key is new and, if so, adds it to the
  // keys array.  It then does similar for the values associated with it.
  // Note: data[0] is the key while data[1] and data[2] are the two values
  //       associated with the key.

  // save data to be updated for use by updateTotals since needed by multiple functions
  savedData[0] = data[0];
  savedData[1] = data[1];
  savedData[2] = data[2];

  // Check whether the key is already in the table
  int keyIndex = -1;
  for (int k = 0; k < keyTable.keyCount; k++)
  {
    if (keyTable.keydata[k].key == data[0])
    {
      keyIndex = k; // key already in the table
      break; // exit loop
    }
  } // end for loop

  if (keyIndex < 0) // key not in the table
  { // add the key
    keyIndex = keyTable.keyCount;
    keyTable.keydata[keyIndex].key = data[0];
    keyTable.keyCount++;
  } // end if

  // add the values for the key
  addValues(keyIndex, savedData[1], savedData[2]); //data[1], data[2]);

} // end update


// Convert character array to int
int toInt(char* data, int iS, int iE)
{
  #define NINE '9' //57 // ASCII character for digit 9
  #define ZERO '0' //48 // ASCII character for digit 0

  int index = iE; // loop in reverse
  int digit;
  int m = 1;      // multiplier for shift
  int number = 0; // Numeric result

  while (index >= iS)
  {
    if ((data[index] >= ZERO) && (data[index] <= NINE))
    {
      digit = data[index] - ZERO; // convert ASCII to digit
      number = number + (m * digit);
      m = m * 10;
      index--;
    }
  }

  return number;

} // end toInt

// Extract the fields of interest from the file's line and retain the data
void parse(int count, char* data)
{

  #define CR '\r' // carriage return
  #define HT '\t' // horizontal tab
  #define LF '\n' // line feed
  #define TRUE 1
 
  // Parse each line of data in the buffer to obtain the three fields of
  // interest, converting those fields to integers into an array, and
  // then updating a data structure to retain the data for evaluation
  // when the complete buffer has been parsed.
  int nextField = 0;  // index of the beginning of next field
  int startField;     // range of indexes of
  int numFields = 0;  // index into dataFields array
  int dataFields[3];  // Integer values of the three fields of interest
 
  int index = 0;
  while (TRUE)
  {
    // Parse the buffer line
    if (data[index] == HT) // beginning of a field
    {
      startField = nextField; // save starting index
      nextField = index + 1;  // the next byte will contain part of next field
      if (numFields > 0)
      { //convert and store in dataFields
        dataFields[numFields-1] = toInt(data, startField, index - 1);
      } // end if numFields > 0
      numFields++;
    } // end if HT
    if ((data[index] == CR) || (data[index] == LF))
    {
      if (numFields < 4) // only do CR or LF once
      { //convert and store in dataFields
        dataFields[numFields-1] = toInt(data, nextField, index - 1);
        // save the data to determine the key, value combination with
        // maximum number of references
        update(dataFields);

        return; // finished with the line in the
      }
    } // end if

    index++;
    if (index == count)
    {
      break; // out of while loop
    }
  } // end while

} // end parse


void main(argc, argv)
int argc;
char* argv[];
{
  FILE* file;
  char buffer[30];
  file = fopen("C:\\Source\\LearnToCodeGR\\max-col-sum-by-Key.tsv", "r");

  int currPos;
  int lastPos = 0;
  int length;

  initialize();

  if(file == NULL)
  {
    perror("Error opening file");
  }
  else
  { // Read file records until end-of-file
    while (!feof(file))
    {
      fgets(buffer, 30, (FILE*)file);
      currPos = ftell(file);
      length = currPos - lastPos; // number of characters read into buffer
      lastPos = currPos;
      for (int i=0; i < length; i++)
        printf("%c ", buffer[i]);
      printf("\n");
     
      // Parse the line and update to retain all the necessary data.
      parse(length, buffer);
    }
    fclose(file);

    // Report the results.
    report();
  }

} // end main


No comments: