Cross-tabulation Algorithms

by Martin Barghoorn (barg@cs.tu-berlin.de)

Abstract

The cross-tabulation technique is a common tool for data analysis and other statistical purposes. But when the datasets are large the standard software such as spreadsheet programs cannot process all the data or the database programs are too slow. It is therefore necessary to offer efficient APL programs which are well designed for cross-tabulations. There are various algorithms in APL but which ones are appropriate for such tasks?

1. Introduction

A very important tool for data analysis and statistical evaluation is the cross-tabulation method. We distinguish between the elements of a vector or a matrix representing measurements (e.g. weights or sizes) and a corresponding set of classification variables. Often the classification vectors are the result of defining classes for a continuous variable (frequency distribution). By analyzing the classification variables alone we can count frequencies within the classes. Depending on the chosen number n of classification variables it is a n-way classicification problem. There are many techniques of calculating n-way cross-tabulations with APL. I’ll try to document some and find out an algorithm which works fast and which is able to handle large data sets.

2. Cross-tabs with APL

Outer and inner product

Consider the classification vector ZUV and at minimum two methods to get the corresponding bitmatrix. Both work with outer matrix product.

    FRAME¨ZUV (BITMAT1„(¼MAX ZUV)°.=ZUV)

ÚÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÌ ÚÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÌ Û 1 3 2 2 3 4 4 5 1 1 Û Û 1 0 0 0 0 0 0 0 1 1 Û ÀÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÙ Û 0 0 1 1 0 0 0 0 0 0 Û Û 0 1 0 0 1 0 0 0 0 0 Û Û 0 0 0 0 0 1 1 0 0 0 Û Û 0 0 0 0 0 0 0 1 0 0 Û ÀÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÙ

FRAME¨ZUV (BITMAT2„(SORTZ UNIQUE ZUV)°.=ZUV) ÚÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÌ ÚÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÌ Û 1 3 2 2 3 4 4 5 1 1 Û Û 1 0 0 0 0 0 0 0 1 1 Û ÀÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÙ Û 0 0 1 1 0 0 0 0 0 0 Û Û 0 1 0 0 1 0 0 0 0 0 Û Û 0 0 0 0 0 1 1 0 0 0 Û Û 0 0 0 0 0 0 0 1 0 0 Û ÀÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÙ

The data-vector consists of the numbers from 1 to 10

    FRAME DAT„¼10

ÚÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÌ Û 1 2 3 4 5 6 7 8 9 10 Û ÀÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÙ

In both cases we get the 1-way accumulation for DAT by using the inner matrix product.

    FRAME¨(BITMAT1+.×DAT)(BITMAT2+.×DAT)

ÚÎÎÎÎÎÎÎÎÎÎÎÎÎÌ ÚÎÎÎÎÎÎÎÎÎÎÎÎÎÌ Û 20 7 7 13 8 Û Û 20 7 7 13 8 Û ÀÎÎÎÎÎÎÎÎÎÎÎÎÎÙ ÀÎÎÎÎÎÎÎÎÎÎÎÎÎÙ

If we summarize +/BITMAT1 or +/BITMAT2 we obtain the number of cases in each cell (frequency count). This method is simple but inefficient if the classification variables have many classes. In the past we sometimes got WS FULL errors.

2.2 Accumulation with Rotate (without outer and inner product)

First we define a rotate-vector for 1-way tabulation.

      FRAME ROTAT„ZUV-1

ÚÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÌ Û 0 2 1 1 2 3 3 4 0 0 Û ÀÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÙ

We compare the data before and after Rotate.

       FRAME¨(10 5†10 1½DAT) (MATS„(-ROTAT)²10 5†10 1½DAT)

ÚÎÎÎÎÎÎÎÎÎÎÎÎÌ ÚÎÎÎÎÎÎÎÎÎÎÎÎÌ Û 1 0 0 0 0 Û Û 1 0 0 0 0 Û Û 2 0 0 0 0 Û Û 0 0 2 0 0 Û Û 3 0 0 0 0 Û Û 0 3 0 0 0 Û Û 4 0 0 0 0 Û Û 0 4 0 0 0 Û Û 5 0 0 0 0 Û Û 0 0 5 0 0 Û Û 6 0 0 0 0 Û Û 0 0 0 6 0 Û Û 7 0 0 0 0 Û Û 0 0 0 7 0 Û Û 8 0 0 0 0 Û Û 0 0 0 0 8 Û Û 9 0 0 0 0 Û Û 9 0 0 0 0 Û Û 10 0 0 0 0 Û Û 10 0 0 0 0 Û ÀÎÎÎÎÎÎÎÎÎÎÎÎÙ ÀÎÎÎÎÎÎÎÎÎÎÎÎÙ

Comparison of results of different techniques:

      FRAME¨(BITMAT1+.×DAT)(+/[1]MATS)

ÚÎÎÎÎÎÎÎÎÎÎÎÎÎÌ ÚÎÎÎÎÎÎÎÎÎÎÎÎÎÌ Û 20 7 7 13 8 Û Û 20 7 7 13 8 Û ÀÎÎÎÎÎÎÎÎÎÎÎÎÎÙ ÀÎÎÎÎÎÎÎÎÎÎÎÎÎÙ

This technique does not seem to be very good for multi-way classification.

3. Cross-tabs with APL2

3.1 One-Way Classification (with sorting)

The 1-way classification technique which includes sorting is simple because only one classification variable has to be sorted.

      DAT
1 2 3 4 5 6 7 8 9 10
      FRAME+/¨(SORTZ ZUV)›DAT[“ZUV]

ÚÎÎÎÎÎÎÎÎÎÎÎÎÎÌ Û 20 7 7 13 8 Û ÀÎÎÎÎÎÎÎÎÎÎÎÎÎÙ

3.2 Multi-way-analysis

A more than 1-way classification with sorting is more complex because it is not possible to sort all classification variables one after the other.

Let us start with the classification matrix S and data D.

      ½¨S D
 200 2  200

Define one classification variable with 5 classes and another with 11 classes (random numbers).

      S[;1]„?200½5
      S[;2]„?200½11
      FRAME¨15†[1]¨S D

ÚÎÎÎÎÎÎÌ ÚÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÌ Û 3 3 Û Û 14 76 46 54 22 5 68 68 94 39 52 84 4 6 53 Û Û 1 9 Û ÀÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÙ Û 1 9 Û Û 2 10 Û Û 3 11 Û Û 4 11 Û Û 3 10 Û Û 4 1 Û Û 2 1 Û Û 1 4 Û Û 4 4 Û Û 4 9 Û Û 4 8 Û Û 1 11 Û Û 2 9 Û ÀÎÎÎÎÎÎÙ

We have two programs CROSS_LIST und CROSS_SORT. Both start with an array of zeros. The size of this arrays depends on the numbers of classes in the classification vectors. CROSS_LIST uses Catenate and CROSS_SORT uses selective assignment and thus we get different results.

The results of classification of the first 15 datasets with the two different programs are different with respect to the zeros in some subarrays.

      FRAME FRAME¨(15†[1]S) CROSS_LIST 15†D

ÚÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÌ Û ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÎÎÎÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÎÎÌ Û Û Û 0 Û Û 0 Û Û 0 Û Û 0 39 Û Û 0 Û Û 0 Û Û 0 Û Û 0 Û Û 0 76 46 Û Û 0 Û Û 0 6 Û Û Û ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÎÎÎÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÎÎÙ Û Û Û Û ÚÎÎÎÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÎÎÎÌ ÚÎÎÎÎÎÎÌ ÚÎÎÎÌ Û Û Û 0 94 Û Û 0 Û Û 0 Û Û 0 Û Û 0 Û Û 0 Û Û 0 Û Û 0 Û Û 0 53 Û Û 0 54 Û Û 0 Û Û Û ÀÎÎÎÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÎÎÎÙ ÀÎÎÎÎÎÎÙ ÀÎÎÎÙ Û Û Û Û ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÎÎÎÌ ÚÎÎÎÎÎÎÌ Û Û Û 0 Û Û 0 Û Û 0 14 Û Û 0 Û Û 0 Û Û 0 Û Û 0 Û Û 0 Û Û 0 Û Û 0 68 Û Û 0 22 Û Û Û ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÎÎÎÙ ÀÎÎÎÎÎÎÙ Û Û Û Û ÚÎÎÎÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÎÎÌ ÚÎÎÎÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÎÎÌ Û Û Û 0 68 Û Û 0 Û Û 0 Û Û 0 52 Û Û 0 Û Û 0 Û Û 0 Û Û 0 4 Û Û 0 84 Û Û 0 Û Û 0 5 Û Û Û ÀÎÎÎÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÎÎÙ ÀÎÎÎÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÎÎÙ Û Û Û Û ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ Û Û Û 0 Û Û 0 Û Û 0 Û Û 0 Û Û 0 Û Û 0 Û Û 0 Û Û 0 Û Û 0 Û Û 0 Û Û 0 Û Û Û ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ Û ÀÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÙ

FRAME FRAME¨(15†[1]S) CROSS_SORT 15†D

ÚÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÌ Û ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÎÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ Û Û Û 0 Û Û 0 Û Û 0 Û Û 39 Û Û 0 Û Û 0 Û Û 0 Û Û 0 Û Û 76 46 Û Û 0 Û Û 6 Û Û Û ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÎÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ Û Û Û Û ÚÎÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÎÌ ÚÎÎÎÎÌ ÚÎÎÎÌ Û Û Û 94 Û Û 0 Û Û 0 Û Û 0 Û Û 0 Û Û 0 Û Û 0 Û Û 0 Û Û 53 Û Û 54 Û Û 0 Û Û Û ÀÎÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÎÙ ÀÎÎÎÎÙ ÀÎÎÎÙ Û Û Û Û ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÎÌ ÚÎÎÎÎÌ Û Û Û 0 Û Û 0 Û Û 14 Û Û 0 Û Û 0 Û Û 0 Û Û 0 Û Û 0 Û Û 0 Û Û 68 Û Û 22 Û Û Û ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÎÙ ÀÎÎÎÎÙ Û Û Û Û ÚÎÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ Û Û Û 68 Û Û 0 Û Û 0 Û Û 52 Û Û 0 Û Û 0 Û Û 0 Û Û 4 Û Û 84 Û Û 0 Û Û 5 Û Û Û ÀÎÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ Û Û Û Û ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ ÚÎÎÎÌ Û Û Û 0 Û Û 0 Û Û 0 Û Û 0 Û Û 0 Û Û 0 Û Û 0 Û Û 0 Û Û 0 Û Û 0 Û Û 0 Û Û Û ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ ÀÎÎÎÙ Û ÀÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÙ

After Each-accumulation and Transpose we can compare the results for the whole dataset.

      FRAME¨³¨+/¨¨(S CROSS_LIST D)(S CROSS_SORT D)

ÚÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÌ ÚÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÌ Û 201 158 188 396 432 Û Û 201 158 188 396 432 Û Û 64 142 118 174 49 Û Û 64 142 118 174 49 Û Û 99 345 285 0 216 Û Û 99 345 285 0 216 Û Û 209 177 259 52 182 Û Û 209 177 259 52 182 Û Û 199 154 76 44 0 Û Û 199 154 76 44 0 Û Û 67 401 130 257 173 Û Û 67 401 130 257 173 Û Û 179 131 115 231 83 Û Û 179 131 115 231 83 Û Û 136 199 352 408 329 Û Û 136 199 352 408 329 Û Û 357 242 236 318 253 Û Û 357 242 236 318 253 Û Û 220 131 237 122 102 Û Û 220 131 237 122 102 Û Û 52 194 201 51 448 Û Û 52 194 201 51 448 Û ÀÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÙ ÀÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÙ

For frequency counts we take a logical function.

      FRAME RAND +/¨0<S CROSS_SORT D

ÚÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÌ Û 3 2 1 5 4 2 4 4 6 4 3 38 Û Û 2 3 5 3 3 7 2 3 5 2 4 39 Û Û 3 3 7 6 2 3 3 6 5 4 4 46 Û Û 6 3 0 1 2 4 4 7 4 3 3 37 Û Û 7 1 5 3 0 3 2 5 4 2 8 40 Û Û 21 12 18 18 11 19 15 25 24 15 22 200 Û ÀÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÙ

We can also look for local maxima in the subarrays.

      FRAME —/¨S CROSS_SORT_SP D

ÚÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÌ Û 89 33 99 77 90 60 92 68 99 85 42 Û Û 94 74 99 94 85 92 92 91 74 77 77 Û Û 95 69 95 100 66 60 66 100 76 94 89 Û Û 92 78 0 52 37 91 87 100 100 84 36 Û Û 90 49 91 89 0 95 73 94 91 74 95 Û ÀÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÙ

4 Cross-tab-programs with 4 methods

4.1 Method BITMA

The classical APL-BITMA-technique uses outer and inner matrix product.

        ’CROSS_BITMA[Œ]’
[0]   R„class CROSS_BITMA dat
[1]   ©Crosstabs with APL, outer and inner matrix product
[2]   R„dat+.×STRUC class  ©STRUC forms the bitmatrix

To get the multi-dimensional bitmatrix I apply my old program STRUC.

      ’STRUC[Œ]’
[0]   R„STRUC D;H;I;Z;S
[1]   © forms bitmatrix R
[2]    (Z S)„½D           ©SHAPE
[3]    I„1
[4]    R„Z½1
[5]   LOOP:R„(R×[1]D[;I])°.=¼—/D[;I]
[6]    I„I+1
[8]    …(IˆS)/LOOP   © Nr. of col.

4.2 Method LIST

With CROSS_LIST we do not have to sort the classification vectors or data. Each line of the classification matrix is the n-way index to catenate the data at the right place.

      ’CROSS_LIST[Œ]’
[0]   R„class CROSS_LIST dat;I;big;del;ind;sel
[1]   © program: without sorting class and data
[2]   © class: classification variable(s)
[3]   © dat: data variable
[4]    …(2=½½class)/OK1                   © is class matrix?
[5]    class„(2†(½class),1)½class         © always matrix
[6]   OK1:…(0=+/,class=0)/OK2             © containing zero
[7]    sel„0=+/class=0                    © vector sel
[8]    class„sel/[1]class ª dat„sel/dat   © selection
[9]   OK2:big„—/[1]class ª I„0 ª R„big½0  © initial values
[10]   ind„›[2]class                      © assignment vector
[11]   del„–¨(1†½class)½›'((Iœind)ÞR)„›(¹(Iœind)ÞR),dat[I„I+1]'

4.3 Method SORT

With CROSS_SORT we have disadvantages (sort) and advantages (grouping with Partition and Unique). First we sort the classification matrix like a character matrix.

      ’CROSS_SORT[Œ]’
[0]   R„class CROSS_SORT dat;I;big;del;ind;par;sel;sortcl;uni
[1]   © program: sorting class und data
[2]   © class: classification variable(s)
[3]   © dat: data variable
[4]    …(2=½½class)/OK1                     © class matrix?
[5]    class„(2†(½class),1)½class           © always matrix
[6]   OK1:…(0=+/,class=0)/OK2               © containing zero
[7]    sel„0=+/class=0                      © vector sel
[8]    class„sel/[1]class ª dat„sel/dat     © selection
[9]   OK2:big„—/[1]class                    © MAX of class
[10]   R„big½I„0                            © initial values
[11]   ind„ŒAV“•class                       © sort index
[12]   sortcl„(›[2]class)[ind]             © partition vector
[13]   par„(uni„(Ÿ/¨sortcl¬(1‡sortcl),¯1)/sortcl)¼sortcl©unique
[14]   dat„par›dat[ind]                     © partition
[15]   del„–¨(1†½uni)½›'((Iœuni)ÞR)„dat[I„I+1]'

4.4.1 CROSS_SORT_SP (IBM-APL2)

The basic idea of CROSS_SORT_SP is to accelerate CROSS_SORT by simplifying the calculations after encoding the index values of the classification matrix with inner product and then decoding before assignment.

     ’CROSS_SORT_SP[Œ]’
[0]   R„class CROSS_SORT_SP dat;I;big;code;del;ind;par;sel;uni
[1]   © class: classification variable(s)
[2]   © dat: data variable; SP(speed): encode, decode
[3]    …(2=½½class)/OK1                     © is class matrix?
[4]    class„(2†(½class),1)½class           © always matrix
[5]   OK1:…(0=+/,class=0)/OK2               © containing zero
[6]    sel„0=+/class=0                      © vector sel
[7]    class„sel/[1]class ª dat„sel/dat     © selection
[8]   OK2:big„—/[1]class                    © MAX of class
[9]    R„big½I„0                            © initial values
[10]   code„class+.ײ×\¯1‡²(del„1+big),1 © encode (inner prod.)
[11]   par„code„code[ind„“code]             © partition vector
[12]   uni„(code¬(1‡code),¯1)/code          © unique
[13]   dat„par›dat[ind] ª sel„›[1]del‚uni   © partition, decode
[14]   del„–¨(1†½sel)½›'((Iœsel)ÞR)„dat[I„I+1]'© loops

4.4.2 CROSS_SORT_SP (Dyalog and Manugistics APL)

With Dyalog and and Manugistics APL we need other partition vectors but we have the advantage of the direct assignment (no pseudo-loops).

     R„class CROSS_SORT_SPd dat;big;code;del;ind;par;sel;uni;par
[1]   © class: classification variable(s), VERSION Dyalog
[2]   © dat: data variable; SP(speed): encode, decode
[3]    …(2=½½class)/OK1                   © is class matrix?
[4]    class„(2†(½class),1)½class         © always matrix
[5]   OK1:…(0=+/,class=0)/OK2             © containing zero
[6]    sel„0=+/class=0                    © vector sel
[7]    class„sel/[1]class ª dat„sel/dat   © selection
[8]   OK2:big„—/[1]class                  © MAX of class
[9]    R„big½0                            © initial values
[10]   code„class+.ײ×\¯1‡²(del„1+big),1  © encode (inner prod.)
[11]   code„code[ind„“code]               © partition vector
[12]   uni„(par„(¯1²code)¬code)/code      © unique
[13]   dat„par›dat[ind] ª sel„‡[1]del‚uni © partition, encode
[14]   R[sel]„dat                         © replace zero

5. Performance tests

More and more data are taken. Tests were carried through on a Sun-workstation with APL2.

Workspace available?

      ŒWA
6076144

5.1 Two-way classification

We start with 5000 sets of data.

      S„5000 2½S
      D„5000½D
      FRAME¨½¨S D
 ÚÎÎÎÎÎÎÎÎÌ  ÚÎÎÎÎÎÎÌ 
 Û 5000 2 Û  Û 5000 Û 
 ÀÎÎÎÎÎÎÎÎÙ  ÀÎÎÎÎÎÎÙ 
      TIME 0
      ½S CROSS_BITMA D
5 11
      ½+/¨S CROSS_LIST D
5 11
      ½+/¨S CROSS_SORT D
5 11
      ½+/¨S CROSS_SORT_SP D
5 11
      FRAME TIME 1
ÚÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÌ
Û    Time Procent      Program       Û
Û  1 7.29 48.31013917  CROSS_LIST    Û
Û  1 5.37 35.58648111  CROSS_SORT    Û
Û  1 1.05  6.958250497 CROSS_BITMA   Û
Û  1 0.85  5.63286945  STRUC         Û
Û  1 0.35  2.319416832 CROSS_SORT_SP Û
Û  1 0.18  1.192842942 TIME          Û
ÀÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÙ
      S„10000 2½S
      D„10000½D
      FRAME¨½¨S D
 ÚÎÎÎÎÎÎÎÎÎÌ  ÚÎÎÎÎÎÎÎÌ 
 Û 10000 2 Û  Û 10000 Û 
 ÀÎÎÎÎÎÎÎÎÎÙ  ÀÎÎÎÎÎÎÎÙ 
      TIME 0
      ½S CROSS_BITMA D
5 11
      ½+/¨S CROSS_LIST D
5 11
      ½+/¨S CROSS_SORT D
5 11
      ½+/¨S CROSS_SORT_SP D
5 11
      FRAME TIME 1

ÚÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÌ Û Time Procent Program Û Û 1 14.89 46.31415241 CROSS_LIST Û Û 1 12.59 39.16018663 CROSS_SORT Û Û 1 2.14 6.6562986 CROSS_BITMA Û Û 1 1.73 5.381026439 STRUC Û Û 1 0.63 1.959564541 CROSS_SORT_SP Û Û 1 0.17 0.528771384 TIME Û ÀÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÙ

The speadsheet programs such as MS Excel are normally limited to 2*14 lines of data. As we see this is also the limit for CROSS_LIST.

      S„16384 2½S
      D„16384½D
      FRAME¨½¨S D
 ÚÎÎÎÎÎÎÎÎÎÌ  ÚÎÎÎÎÎÎÎÌ 
 Û 16384 2 Û  Û 16384 Û 
 ÀÎÎÎÎÎÎÎÎÎÙ  ÀÎÎÎÎÎÎÎÙ 
      TIME 0
      ½S CROSS_BITMA D
5 11
      ½+/¨S CROSS_LIST D
WS FULL
      ((Iœind)ÞR)„›(¹(Iœind)ÞR),dat[I„I+1]
                  ^            ^
      ½+/¨S CROSS_SORT D
5 11
      ½+/¨S CROSS_SORT_SP D
5 11
      FRAME TIME 1
ÚÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÌ
Û    Time  Procent     Program        Û
Û  1 24.87 75.38648075 CROSS_SORT     Û
Û  1  3.47 10.51833889 CROSS_BITMA    Û
Û  1  2.93  8.88147923 STRUC          Û
Û  1  0.99  3.00090936 CROSS_SORT_SP  Û
Û  1  0.17  0.51530766 TIME           Û
ÀÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÙ

The limit for CROSS_SORT is the following:

      S„50000 2½S
      D„50000½D

FRAME¨½¨S D ÚÎÎÎÎÎÎÎÎÎÌ ÚÎÎÎÎÎÎÎÌ Û 50000 2 Û Û 50000 Û ÀÎÎÎÎÎÎÎÎÎÙ ÀÎÎÎÎÎÎÎÙ TIME 0 ½S CROSS_BITMA D 5 11 ½+/¨S CROSS_SORT D WS FULL CROSS_SORT[13] par„(uni„(Ÿ/¨sortcl¬(1‡sortcl),¯1)/sortcl)¼sortcl © unique ½S CROSS_SORT_SP D 5 11

FRAME TIME 1 ÚÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÌ Û Time Procent Program Û Û 1 10.62 8.058274528 CROSS_BITMA Û Û 1 8.66 6.57106002 STRUC Û Û 1 2.9 2.200470445 CROSS_SORT_SP Û Û 2 0.41 0.311100994 TIME Û ÀÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÙ

The limit for BITMA is the following:

      S„100000 2½S
      D„100000½D
      FRAME¨½¨S D
 ÚÎÎÎÎÎÎÎÎÎÎÌ  ÚÎÎÎÎÎÎÎÎÌ 
 Û 100000 2 Û  Û 100000 Û 
 ÀÎÎÎÎÎÎÎÎÎÎÙ  ÀÎÎÎÎÎÎÎÎÙ 
      TIME 0
      ½S CROSS_BITMA D
WS FULL
STRUC[6]  SCHLEIFE:R„(R×[1]D[;I])°.=¼—/D[;I]
                    ^           ^
      ½+/¨S CROSS_SORT_SP D
5 11
      FRAME TIME 1
ÚÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÌ
Û    Time Procent     Program        Û
Û  1 6.23 97.19188768 CROSS_SORT_SP  Û
Û  1 0.18 2.808112324 TIME           Û
ÀÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÙ

If we count the number of cases and summarize over all axes we obtain:

      FRAME RAND +/¨0<S CROSS_SORT_SP D
ÚÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÌ
Û  1500 1000  500 2500 2000 1000 2000  2000  3000 2000  1500  19000 Û
Û  1000 1500 2500 1500 1500 3500 1000  1500  2500 1000  2000  19500 Û
Û  1500 1500 3500 3000 1000 1500 1500  3000  2500 2000  2000  23000 Û
Û  3000 1500    0  500 1000 2000 2000  3500  2000 1500  1500  18500 Û
Û  3500  500 2500 1500    0 1500 1000  2500  2000 1000  4000  20000 Û
Û 10500 6000 9000 9000 5500 9500 7500 12500 12000 7500 11000 100000 Û
ÀÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÙ

5.2 Four-way-classification

      S„S,S
      FRAME¨½¨S D
 ÚÎÎÎÎÎÎÎÎÎÎÌ  ÚÎÎÎÎÎÎÎÎÌ 
 Û 100000 4 Û  Û 100000 Û 
 ÀÎÎÎÎÎÎÎÎÎÎÙ  ÀÎÎÎÎÎÎÎÎÙ 
      TIME 0
      ½+/¨S CROSS_SORT_SP D
5 11 5 11
      FRAME TIME 1
ÚÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÌ
Û     Time Procent     Program       Û
Û  1  8.1  97.94437727 CROSS_SORT_SP Û
Û  1  0.17  2.05562273 TIME          Û
ÀÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÙ

5.3 Five-way-classification

      S„S,S[;1]
      FRAME¨½¨S D
 ÚÎÎÎÎÎÎÎÎÎÎÌ  ÚÎÎÎÎÎÎÎÎÌ 
 Û 100000 5 Û  Û 100000 Û 
 ÀÎÎÎÎÎÎÎÎÎÎÙ  ÀÎÎÎÎÎÎÎÎÙ 
      TIME 0
      ½+/¨S CROSS_SORT_SP D
5 11 5 11 5
      FRAME TIME 1
ÚÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÌ
Û    Time  Procent     Program        Û
Û  1 10.44 98.49056604 CROSS_SORT_SP  Û
Û  1  0.16  1.50943396 TIME           Û
ÀÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÎÙ

6. Classification Variables

As in the real world classification variables are often nominal or character variables and not numeric or integer. For these cases there are a lot of efficient algorithms in APL to transform these variables from character to numeric (Unique, Match, or Index Of) before starting the cross-tabulation programs.

7. References

  1. Berquist, G.: “APL Advanced techniques and utilities”, Zark Incorporated 1987, pp. 92-115
  2. Jaffe, S. B.: “Topics for a second Course in APL”, APL Quote Quad 1986, Vol 16, Nr. 4, p. 118
  3. Sykes. A: “Counting in APL”, Vector 1989, Vol 5, Nr. 2, p. 9
  4. Sykes, A: “Counting in APL (Part Two)”, Vector 1989, Vol 5, Nr. 3, p. 19
  5. Thomson, N.: “APL Programs for the mathematical Classroom”, Springer 1989, pp. 109-112
  6. Langlet, G.: “APL RISC programming Style”, Vector 1990, Vol 6, Nr. 2, p. 23
  7. Berquist, G. “Frequency Counts, Accumulations and Cross-Tabulations in APL”, APL Quote Quad 1993, Vol 24, Nr. 2, p. 24
  8. Karman, J. “Plus/Reducing Arrays”, APL Quote Quad 1994, Vol 25, Nr. 2, p.24
Martin Barghoorn
Technical University of Berlin
Fachbereich Informatik
Sekr. FR 6-9
Franklinstr. 28
D-14129 BERLIN, Germany

Tel: +49 30 314 24392
FAX: +49 30 314 21103
e-mail: barg@cs.tu-berlin.de
www: http://www.cs.tu-berlin.de/fachgebiete/stat/apl.html