Tuesday, April 28, 2020

Notes q

You Tube - Introduction to KDB+ and q
An Overview of the q language



You Tube - Introduction to KDB+ and q
========================================
Tutorial: https://www.youtube.com/watch?v=8eoysfqO3UY&list=PLypX5sYuDqvrwBD2EMWadIMiTqJZmVsqm
The following is a transcribe of the details in the above introduction course. It is meant as rough notes to capture the syntax and used as a reference, and may not give good worded descriptions.

#lesson 1
NO: control flow, loop, if-else, threads, shared globals, objects, inheritance
interpretive language, dynamically type, vector programming, functional programming, list, dictionaries, tables (1st class items, based on list, dicts)

#lesson 2
function has TYPE
42  -> 64 bit
42i -> 32 bit
4%2 -> % is division
2f  -> float

#lesson 3
1 2 3+10 20 30     -> vecotr operation, results is 11 22 33, spaces are used to separate list items
count 1.0 2.0 3.0  -> 3 is the result
1+10 20 30         -> 11 21 31  scalar(atom) + vector
# All operations have precedence, operations done right to left
2*3+4              -> (3+4) =7 then *2  =14
til 100            -> generates 01 2 ... up to 100 numbers
count til 100      -> 100
101+2*til 1000     -> generate (0 .. 999), then x 2, then add 100 to every item. Result: 101 103 105 ....

#lesson 4
0b      -> false
1b      -> true
101b    -> list of booleans true, false, true, NOT separated by space, and NOT a bitmask representation
42=6*7  -> 1b (true)
1 2 3=10 2 30  -> 010b  vector results of false, true, false
2018.01.01     -> actually count of days since millenium (yyyy.mm.dd)
2000.01.01=0   -> 1b True
1999.12.31=-1  -> 1b True
2000.02.01-2000.01.01  -> 31i (31 days, in INTEGER 'i') date arithmetic
2000.01.01+til 31      -> make 31 integers then add to 1st Jan 2000 -> gives the dates of that month. List-Date arithmetic
2000.01m    -> month
2000.01m=0  -> 1b    Comparing the 1st month since millennium
2000.02m=1  -> 1b    Comparing the 2nd month since millennium
2000.01.01=2000.01m  Comparing 1st day ever, to 1st month ever, q assumes the month is 1st day of month when comparing to date, hence true.

#lesson 5 - Casting
`long$1.0  -> cast to long
`float$1   -> cast to float
`boolean$1 -> cast to boolean
`date$31   -> cast to date format, ie 31 days since millennium is 2000.02.01
2000.01m+til 12   -> list of 12 months  2000.01 2000.02 2000.03 ...
15+2000.01m+til 3 -> 2001.04 2001.05 2001.06m    1st 3 months, after 15 months from the millenium

# lesson 6 - Operations on List
Declarative programming - what we want in q. Other languages are imperative, ie write loops to tell how to compute.
0 +/ 10 20 30 40 50    Initialise to zero then add the items on the list. The +/ symbol called 'over' in q, 'fold' in functional language.
(+/) 10 20 30 40 50    Same as above - add across a list. Does not need to be initialised.
'/'                    The slash is ALSO behaving like an iterator
sum 10 20 30 40 50     ->  same as above
(*/) 1+til 5           Does Factorial(5)
4|5                    -> 5 returns larger number
10&2                   -> 2 returns smaller number
0b|1b                  -> 1b  Larger operator for Boolean is equivalent to 'OR' operator
0b&1b                  -> 0b  Smaller operator for Boolean is equivalent to 'AND' operator
(|/) 10 30 20          -> 30
max 10 30 20           -> 30
(&/) 10 30 20          -> 10
min 10 30 20           -> 10
(+\) 10 20 30 40 50    -> 10 30 60 100 150. Backslash means create a CUMULATIVE / RUNNING list. also called 'sums'
(&\) 40 20 30 10 50    -> 40 20 20 10 10. Running minimum. Also called 'mins'. Similarly with 'maxs'

# Lesson 7 - Functions
{[x]x*x}               -> [x] is the parameter
{[x]x*x}5              -> substitutes 5 into x. Result is 5*5. Can also use {[x]x*x}[5]
{x*x}5                 -> implicit parameter, don't need to declare.

# Lesson 8 - Newton Raphson and Fibonacci Sequence
Say Function is f(x) 2-x^2. So f'(x) = -2x. NewtRaph algo is xn = xn - f(x)/f'(x)
{[xn]xn+(2-xn*xn)%2*xn}/[1.0]   -> 1.414214   .... result from Newton Raphson using 1.0 as initial point, being put into the function on the left iteratively.
{[xn]xn+(2-xn*xn)%2*xn}\[1.0]   -> 1  1.5  1.4.....  shows the intermediate results from Newton Raphson
\P 16                   -> shows 16 decimals. iteration stops until diff is 1e-14
2#10 20 30 40 50        -> 10 20   '#' retrieves the integers from the left of list
-2#10 20 30 40 50       -> 40 50   '#' retrieves the integers from the right of list
10 20, 100 200 300      -> 10 20 100 200 300      joins lists
{x, sum -2#x}/[10;1 1]  -> returns the Fibonacci Sequence. Iterate 10 times, starting with [1 1]

# Lesson 9 - Function, Variables
deltas 110 120 130      -> 110 10 10   takes differences between two numbers. The first result assumes difference with zero.
deltas sums 110 120 130 -> is the same as sums deltas 110 120 130   -> result is 110 120 130. Invariance in q.
a:42                    -> assign value of 42 to variable a.
buys:2 1 4 3 5 4        ->
sell:12                 -> want to sell 12 units from the various buys
sums buys               -> 2 3 7 10 15 19
sell&sums buys          -> 2 3 7 10 12 12 - creates a list of minimum between two lists.
deltas sell&sums buys   -> 1 2 4 3 2 0    - list of what how should buy

# Lesson 10 - Tables
Table = Collection of columns in q. Operations are column based, hence vector operations.
dates:2018.01.01+10000000?31  -> list of 10 million numbers between 0(including BUT not stated) and 31(stated, but EXCLUSIVE)
times:10000000?24:00:00.0000  -> 10 million random nums, between 0 (including, BUT not stated) and midnight (EXCLUSIVE)
qtys:100*1+10000000?100   -> 10 mil random numbes from 0 to 100, then +1, then * 100
ixs:10000000?3            -> 10 mil rand num, from 0 1 2.          store random numbers in ixs
10 20 30 ixs              -> 10 10 30 20 30 20 ......     Since ixs has 3 unique values, so the 10 20 30 will apply to 10 20 30, MAPPED into the 10 million entries of ixs.
syms:`appl`amzn`googl ixs -> `googl`aapl`amzn`googl`googl ....     applies symbols to list, store in syms. The 3 stock quotes is MAPPED to the 3 unique values of ixs for all 10 million values of ixs
172.0 1189.0 1073.0 ixs   -> maps the 3 unique values into the 10million numbers of ixs
pxs:(1+10000000?.03) 172.0 1189.0 1073.0 ixs           -> adds some random noise to the 10 million stock prices.
t:([] date:dates;time:times;sym:syms;qty:qtys;px:pxs)  -> creates a table, with specified column names, and columns of data
t:`date`time xasc t       -> sort ascending time, then date
5#t                       -> returns top(head) 5 rows

# Lesson 11 - qSQL
select date,time,qty,px from t where sym=`aapl 
\t select date,time,qty,px from t where sym=`aapl    -> returns time in milliseconds
first 10 20 30 40 50   -> first item on list
last 10 20 30 40 50    -> last item on list
select open:first px,close:last px by date,time from t where sym=`aapl

#Lesson 12 - Complex Queries(*)
4 3 2 1 wavg 10 20 30 40      -> 20f  calculate weighted average
5 xbar 0 1 2 3 4 5 10 11 21   -> 0 0 0 0 0 5 10 10 20   returns bucketed result. Useful for time domain values that needs bucketing
select max px - mins px from t where sym=`appl   -> is the Max idealised profit (ie buy at the lowest price of the day and sell at the highest price of the day. The function is the same as select (max (px - (mins px from t where sym=`appl) )). The mins is a running cumulative vector of the minimum, while the max just returns the single value.

#Lesson 13 - Interprocess Communication(*)
Server and Client - ie running two different q.
On Server
\p 4242            Starts the server on port 4242
On Client
h:hopen`::4242     Where machine name is omitted between two colons, since it is on the same machine. h is the handler
h "6*7"            Do compute on remote server
sq:{x*x}           Send function to remote server
h (sq; 5)          Execute on remote server
DON'T DO ABOVE since open ports are dangerous
On Server
cub3:{0N!x*x*x}    0N! displays computation on local server first, before shifting to client
On Client
h (`cub3; 5)       Safer this way, CALL the name of the function. First Argument is function name, the rest are arguments

#Lesson 14 - Callbacks(*)
Previously the client server were synchronous, meaning the parties wait for each other after sending a message.
For Asynchronous, when message is sent, the sender does not need to wait.
The handler h is an 'integers'.
On Client
h                      -> 3i  is the integer 3. this is a function handler
(neg h) (`cub3;5)      -> h is positive, neg h is negative. So "(neg h) (`cub3; 5)" is an ASYNCHRONOUS call.
Callback: Without callback, an asynchronous function would send message to server, then receives no response. With callback provided on the server, the server will send the result back whenever it is ready.
Client uses 'h' as its handler to call the server
Server uses '.z.w' as its handler of the function send from the client.  .z.w returns 6i

On Client (FINAL)
continue:{0N!x}                   -> displays x over here
(neg h) (`worker; 5; `continue)   -> (neg h) for asynchronous, call process 'worker' on server, with argument '5', use the callback label 'continue'. First Argument `worker is function name, the rest are arguments [5; `continue]

On Server (FINAL)
cub3:{0N!.z.w; x*x*x}
worker:{[arg; callback] r:cub3 arg; (neg .z.w) (callback; r)  -> 'worker' is process called by client, defined here on server. [arg; callback] are the parameters passed from client. 'callback' is the dummy name corresponding to 'continue'. Runs cub3 on arg, store result in r. Then (neg .z.w) triggers the callback to client asynchronously. Thirdly execute the 'callback' (which is 'continue') with result r.

# Lesson 15 - I/O
Strings - abit different than other languages. Strings are a list of characters, not a primitive type.
"jab"    is a list
`jab     is a variable name
Binary files - operations have 1 in their names
Text files - operations have 0 in their names
File records are list of strings to q. Which means they are a list of list of characters to q.

("So long"; "and thanks"; "for all the dish")    - a list of list of chars
count ("So long"; "and thanks"; "for all the dish")    -> three list items
`:/data/solong.txt  0: ("So long"; "and thanks"; "for all the dish")     -> `:/data/solong.txt   returns this file handler as acknowledgment
'                  error messages are 'Straight ticks'
`                  File handlers are 'back ticks'
`:                 is the file handler or names a file.
/data/solong.txt   is the local path
0:                 is the name of the WRITE to text file function
read0              is the name of the READ from text file operation
read0 `:/data/solong.txt                                  -> prints out the lines of the file
`:/data/answer.txt  0: txt,txt:read0 `:/data/solong.txt   -> `:/data/answer.txt   returns this file handler as acknowledgment


Date - starts from the millennum
Vector Column operations - not Rows
select from 10million rows in 150ms
Single line of code for even for non-trivial functions

select max px - mins px from t where sym=`appl    Algorithm for Idealised Profit in Trading (sym is symbol column, px is price column of table t )
{[xn]xn+(2-xn*xn)%2*xn}\[1.0]                     Algorithm for Newton Raphson with initial point as 1.0
{x, sum -2#x}/[10;1 1]                            Algorithm for Fibonacci Sequence, initial points 1 1 , for 10 iterations



An Overview of the q language
================================

'q' is a proprietary language build on top of an existing 'k' language (developed by ex-Morgan Stanley computer scientist Arthur Whitney) and the kdb+ datatbase (The 'q' language / kdb+ is free for non-commercial use). They are all proprietary products of Kx Systems and it is perhaps more widely known (if not the de facto) in the finance and investment banking industry more than any other industries. As such, it is aimed at time-series based data, large volumes and high transfer rates. Very complex applications can be written in the 'q' language with an extremely small footprint. A large number of Altair's real time visualisation software, Panopticon, connect to kdb+ databases and utilises q.

A recent exploration into the depths of the q language reveals a very sophisticated and expressive computational programming language. It is both like and unlike various programming languages. Perhaps the first point is the vector programming underlying its core, yet this language may not be known widely in the HPC community (or perhaps I have not looked wide enough while I was in HPC, it happens). Vector supercomputers once ruled the HPC world, but when they approached extinction, the vector processing capability were re-incarnated into commodity CPUs and is now powering our PCs (since the days of Intel MMX, SSE, etc...). And it appears that around that similar time shortly after the Millennium, 'q' was born independently of Vector Supercomputers or vectorised CPU instructions, but rather tracing its origin back to the 60s from APL (A Programming Language), thus having both vector and functional programming influence.

For reference, a free book is available at: https://code.kx.com/q4m3/0_Overview/
And a Tutorial series is at: https://www.youtube.com/playlist?list=PLypX5sYuDqvrwBD2EMWadIMiTqJZmVsqm

For those coming from scientific, computational, HPC background, thinking in terms of vectors will ease our understanding to 'q'. For the general programmers who are familiar with C, Java, Python, and the rest, the good advice from the YouTube link would be to throw away your understanding of control flow, loop, if-else, threads, shared globals, objects, inheritance. The syntax itself is short, succinct and raw, yet expresses beautifully the operations it performs. The following will describe some of the noteworthy things about 'q' which are different than the current popular programming languages.

'q' is similar to many scripting language, ie interpretative (no compilation), dynamically type (no pre-declaration of data type). The two fundamental data structures are lists and dictionary, again no surprises here. However it is the vector operations that are perhaps new to most modern day programmers. For example a vector 1 2 3 can be added directly to another vector 10 20 30. Yet these vector expressions should be reminiscent of those in Matlab and Fortran, but the similarities don't extend much further. The syntax is perhaps quite different (though perhaps not too dissimilar to other functional programming languages). The vector (list) items are delimited by spaces (not commas). The mathematical operators has equal precedence (no BODMAS rule) and the order is designed so that the operations are performed in the right-to-left order. For example: 2*3+4 is actually 2*7, which is 14. This right-to-left and single line way of writing code is perhaps the biggest change a programmer of other languages need to overcome.

Control flow and loops goes out the window because an operation acts on all the items in the vector. There is no point in having an explicit index to loop over. As such, an operation over 10 or 10 million items can be written in one line. In fact, entire functions or methods are written in one line, as shown in examples later. In terms of data types, there are the familiar char, boolean, integer, float and so on. Strings are like in Fortran or old C, a string in q is actually a character of arrays. The really most interesting data type I found is the Date representation. Instead of some arbitrary representation having an initial value at 1970 or 1900; q's date are actually integers starting from 1 Jan 2000. This means date computations benefit from the speed of raw integer operations, as well as being able to perform fancy date arithmetic.

List operations are powerful, since this is one of the core data structures of q. List, or vectors can be operated with another vector or scalar, not dissimilar at all to their mathematical counterparts. In addition to aggregate operations like 'max', 'min', 'sum', there are more interesting versions 'maxs', 'mins', 'sums' which actually returns a cumulative vector of the aggregation process. The highlight of the 'q' language of its expressiveness and power would be its function definitions. Instead of describing, ponder upon the following one-liner, and consider how such functions would need to be written in Python, Java, C, etc:
    Algorithm for Newton Raphson with initial point as 1.0
{[xn]xn+(2-xn*xn)%2*xn}\[1.0]
    Algorithm for Fibonacci Sequence, inital points 1 1 , for 10 iterations
{x, sum -2#x}[10;1 1]
    Algorithm for Idealised Profit in Trading (sym is symbol column, px is price column of table t )
select max px - mins px from t where sym=`appl

As for Data Analytics and complex data processing, there are tables, qSQL and a few more useful functions. Tables are collection of columns, like Fortran/Matlab but opposite to C-based languages. Columns in tables generally mean they are of the same data type that means they are easily stored as vectors. For q, this means performing operations on large datasets are extremely fast. Combining the single line functional syntax with vector based operations, very complex yet efficient programs can be written in a terse manner. The qSQL syntax enables very SQL-like expressions to be written. This may be a comfort for SQL programmers, but it still pays to understand the underlying concepts of q, since there are fundamental differences that can lead to incorrect expectations. Native functions like 'xbar' that creates bucketing are extremely powerful when applied to time-series based analysis. In an example of 10million rows, a certain select operation took 150ms.

Most programming languages have the capability of File I/O. q also has File I/O for both text and binary files (again there is the similarity with Fortran's binary files capability, rather than the streams based files I/O of other languages). Not surprisingly by now, q's way or reading and writing files are one-liners. There is no need to open or close a file or the many other lines of instructions needed by other languages. Besides File I/O, q is also capable of Interprocess Communications and Asynchronous processes. Together, these capabilities enable q to write client-server applications without using any other external tools.

From this brief high level exploration, q seems to be an interesting, powerful and useful programmig language. It has the expressiveness of functional programming, the power of vector processing and the applicability to real high volume data computational needs. The fact that it is born within the Financial Trading sector by no means exclude it from being used in other industries. Whereever there is large numerical datasets that requires computation, q may be a good choice. These other industries that may benefit include: data science, scientific research,  defence, telecommunications, etc. In terms of providing real-time visualisations of the large and fast datasets, Altair's Panopticon with kdb+/q would make a formidable combination.

No comments: