Array programming LLM Building Blocks with BQN

This post on HN made me interested in array programming, APL and the ilk, something I have not done before. Seeing Uiua and the BQN Compile code tempted me to get in and this what I have been doing for the last two weeks.

Here are some of my first functions and my learnings from playing with BQN. BQN is downright amazing. I used the online repl for most of my exploration, though setting up Emacs was not too difficult. Yes, the character set is difficult, especially to type in (using the mouse to point and click? Ugh! but works), but over a few days, it became quite easy to read. Click on the symbols to get the documentation of each entity.

I explain my understanding of combinators and trains that enable Tacit programming in BQN. Towards the end, I share some common LLM Building blocks implemented in BQN for fun.

1. Factorial

Let's start small. Following the code for factorial:

n ← 5
×´ 1+↕n

This should be easy enough to follow.

2. Some more simple examples

Palindrome:

a ← 1‿0‿0‿1
×´ a=⌽a

Compress (P9 from the 99 Problems Listing):

a ← 2‿0‿0‿2‿2‿3‿3‿3‿0
(0≠a-»a)/a

3. Some more complex examples

Run length encode (P11 in the 99 Problems set):

a ← 5‿5‿5‿5‿2‿2‿3‿1‿1
b ← (0≠a-»a) 
c ← (¯1↓«b∾1) / 1+↕≠b
(b/a) ⋈¨ c-»c

Sieve of Eratosthenes:

n ← 10
m ← 1 + ⍉ n‿n ⥊ n / ↕n
o ← ⍉ 0 = (1+↕n) |¨ m
p ← 2 = +´˘ o # technically needs 2 ´, but I don't understand the distinction yet
(p≠0) / (1+↕n)

Nth fibanocci number. Full disclosure: I couldn't derive it myself, looked it up on the Crate and then derived it back:

n ← 10
⊑ +`∘⌽ ⍟n 0‿1

This last example introduces repeat ⍟. Surprisingly this was not needed until this example.

I would like to dedicate a moment here to the sheer elegance of this Fibonacci solution. We take the array [O, 1] and then repeat the function n times: the function being flip the array and then prefix sum it.

4. Tacit Programming: Combinators and Trains

If you see the examples above, I have freely mixed variables with functions. That is I have used an argument in more than 1 place. This is Python programming in APL world and not APL programming.

In the APL world, we Tacit Program using combinators and trains.

Combinators are the ones documented here, I'll talk about trains in a moment, but I found this page to be really useful. I am going to go through these in the order I think is easiest to reason about.

4.1. Atop or the 2 Train

Atop is the simplest combinator. F of G is a concept well known in mathematics and used as is here. Atop is also the 2 Train - so basically you learnt 2 concepts in 1 go!

The only thing you need to remember is that both the arguments are passed to G and the result is sent to F.

As an example, consider +´√ 1‿4‿9. First we take square root of the array and then we add up the elements. Here we have already modifed the + which is normally used to add two arrays to a fold version. But, by placing the two functions +´ and √ next to each other, we create a 2-train.

You can instead use the explicit ∘ in between, which becomes useful for longer trains.

4.2. Over

Over as in w F○G x is F(G(w), G(x)), that is apply G to both arguments (independently) and then apply F with these two as arguments.

Let's take an example. 2 {(𝕨×𝕨) + (𝕩×𝕩)} 3 is about squaring and adding the two operands. We use the explicit block notation {} to make this function, where we explicitly refer to the left (𝕨) and right (𝕩) arguments. Now, we see that multiplying the number with itself, can happen using the the Self operation, so the same thing can be refactored into 2 {(×˜𝕨) + (×˜𝕩)} 3. Try this out and see that this works.

Now, the Over pattern can be used here. We want to apply the "Square" function on both operands, first. Then we want to "Sum" them.

2 (+○(×˜)) 3 is the same thing expressed using the Over combinator. In this version, we don't refer to the args explicitly anymore, instead having a single tacit function.

4.3. 3-Train or Fork

In the previous example, we had the same function applied on both args. Instead, let us now consider the following: 2 {(𝕨+𝕩) × (𝕨-𝕩)} 3. Add the two numbers, subtract the same two numbers and multiply the result.

Compare this with previous scenario:

You want to run 2 different functions.
You want these functions to take both inputs as arguments.

This can be massively simplified using the 3-Train to the following: 2 (+×-) 3. Note the use of the paranthesis when using 3-trains in-line. Instead if you define a function like H ← +×- ~, you can use it directly like ~2 H 3.

Look at the 3-train and think about this carefully. It is also called fork - you have the two outer functions and then the inner function acting on the result of those.

Between 3-trains and Atop (or 2-trains) all sorts of things can be done. Note that you can nest the 3-trains to make longer chains of functions.

In this junction, it is good to understand the identify functions left ⊣ and right ⊢.

The previous example with Over, can be re-written as 2 ((×˜⊣)+(×˜⊢)) 3.

4.4. Before and After

Before and After can be thought of special cases of 3-trains. What if you don't want to run a function on both sides, but just one of those.

For example, consider a ← ⟨1, 4, 9, 6⟩. Now (⊏∨a) - a gives us difference between maximum and each element of a. If you notice the expression, we have 2 calls of a. This makes this nor a train, or tacit program. What can we do to make it into a single function call?

From before, we can apply a 3-train as: ((⊏∨)-⊣) a. Notice, that we don't want to do anything with the second argument, so we use the right identity function.

But, instead of this, we can use the built-in "before" as: (⊏∨)⊸- a, as in: call the function on the left first (before) on a, then call the second function with that and a itself (unmodified) as the args.

The same applies with After. Have a look at the diagrams in the linked documentation page, which are very useful to get the hang of Before and After.

4.5. Others

We have more combinators, more complicated, in order:

Repeat ⍟
Choose ◶
Under ⌾

I will let you explore these. Reading only helps so much: you need to try composing a few to get some grip over these. Can't say I can use Under comfortably yet.

5. Building blocks of LLMs

With this knowledge, I implemented a few building blocks of LLMs in BQN. Presented below without comments, except that I did verify accuracies of these against Pytorch versions!

R ← •rand.Range∘0¨↕ # generate a random matrix of shape arg x‿y
M ← +´∘×´∘⋈⌜○(<˘)⟜⍉ # matmul 2 matrices of right shape

# prob Dropout m1
## (nested) train form
Dr ← (1÷(1-⊣))×((⊣<(R∘≢⊢))×⊢)

# Layer Norm, without gamma and beta
# Note that we calculate Population SD, not Sample SD
LNorm ← {
  M ← +´÷≠
  Sd ← √∘M∘×˜-⟜M
  ((-⟜M)÷Sd) 𝕩
}

Relu ← {0<𝕩}◶⟨0, ⊢⟩¨
# approximation lifted from
# https://jaykmody.com/blog/gpt-from-scratch#gelu
# and here: https://pytorch.org/docs/stable/generated/torch.nn.GELU.html
Gelu ← {
    # this extra paren over the first term is a bit surprising
    b ← (√(2÷π)) × (𝕩 + 0.044715×(𝕩⋆3))
    a ← 1 + •math.Tanh(b)
    0.5×𝕩×a
}

# with max correction for numberic stability
SoftMax ← (⊢÷+´)∘{⋆𝕩}-⟜(⊑∨)

6. Conclusion

APL like can expand your brain in unexpected ways. This was just my first foray into this world and I have been enjoying this so far.

Looking forward to playing with BQN more and reporting results here!