Julia: first impressions

Julia: first impressions

Summary

An small but useful application written in Julia to test how it compares against the combo Python + Cython both in terms of performance and ease of use. The results show that Julia is a very practical alternative.

Tags: Julia, data science, recommender

1Intro

I have been a happy Python user for more than 20 years. I use it daily as a data scientist with the classical scientific stack and a little of PyTorch when I get fancy.

Recently I have decided to give Julia a try. I must admit I was skeptical. Is it worthy? I think yes.

Let me say that I was surprised by Julia's design but that is the subject for another post. This post is about what everyone thinks is the differentiating Julia advantage: speed. The main conclusion is that it's possible in a practical way to obtain near C speed. Julia will not make your code run fast automagically but it's a power tool that in trained hands can achieve good results.

I'm still a Julia newbie so I wouldn't be surprised if something I did can be better done or something I wrote is not correct. Please let me know if that is the case.

Just jump to the results if you don't feel like reading such a lengthy post.

2Recommendation algorithms

I wanted to tackle a problem familiar enough for me, which has the right amount of complexity and that has practical applications and so decided to implement a recommender algorithm. Recommender systems are a fascinating problem that I feel have not made much progress recently. New deep learning approaches have not shown the same qualitative improvements that have been achieved in other areas and I think there are great opportunities in the field yet to be found. Anyway, this post is about Julia the language and not about recommender systems.

I wanted to compare speed against a well optimized and mature Cython implementation, LightFM. LightFM implements good old matrix factorization with some additional functionality to incorporate user and item features and also implements some loss functions suitable for implicit problems like WARP which has given me good results in the past. WARP is also interesting because has been designed with sparseness in mind and doesn't lend itself to efficient implementations using automatic differentiation frameworks like PyTorch or in the case of Julia Zygote. At least I have not been able to use them for this problem. It would require generating sparse gradient updates and prevent memory allocations. This also means the GPU offers little advantage over a CPU, if any. There is also Implicit which runs very fast leveraging the GPU if you wish but in my use cases WARP + LightFM has given better results.

My implementation is far from feature parity with LightFM. In particular I have not added user and item features and only implemented the WARP loss without regularization. Nevertheless I'm quite satisfied with the results and everything amounts to less than 300 lines of code.

3Matrix factorization

In case you don't know, or don't remember, the idea is to assign to each user a vector, we will call it an embedding, of some size n_components which is an hyperparameter of our model. The higher it is the higher our model's capacity. We do the same with items and also consider an item bias, a single scalar, for each item (more popular items will have greater bias). The affinity, preference or score of a user with an item is just the scalar product of the user embedding with the item embedding, plus the item bias, so if the ith user has user embedding \(\boldsymbol{u}_i\) and the jth item has embedding \(\mathbf{v}_j\) then the score \(s (i, j)\) that the user assigns to the item is simply:

\(\displaystyle s (i, j) =\boldsymbol{u}_i \cdot \mathbf{v}_j + b_j\)

If we want to recommend the best 5 items for a given user then we compute the score of that user against all item embeddings, which can be done efficiently using matrix-vector multiplication, and then sort the items by score and keep the 5 with the highest score. If you have not found this explanation satisfactory the author of the Implicit library has a nice blog post giving more details about implicit matrix factorization.

The above math in code would be:

struct Factorization{T <: AbstractFloat}
    user_embeddings:: Matrix{T}
    item_embeddings:: Matrix{T}
    item_bias:: Vector{T}

    function Factorization{T}(n_users::Int,
                              n_items::Int,
                              n_components::Int) where {T<:AbstractFloat}
        return new{T}(
            zeros(T, n_components, n_users),
            zeros(T, n_components, n_items),
            zeros(T, n_items))
    end
end

If you are new to Julia the above just declares an struct with 3 fields. It has a type parameter T which can be any floating point type (Float16, Float32, Float64). The type parameter appears inside the fields and constraint them to be arrays of the same type (Matrix{T} is just a type alias for Array{T, 2} and Vector{T} for Array{T, 1}) that contain elements of that type. This is a very important point because it means that any indexing operation will return elements of that type and the compiler will be able to generate very efficient code so this extra syntax noise is worth the effort not just for documentation purposes but because of performance. I would love to say that it allows to catch errors but I have been unable to configure my editor setup, Emacs + julia-lsp. It seems that static type checking is limited but I will have to have a look anyway at this.

Julia's type system is not complicated at all and is a necessary part of the language. You need it to implement method dispatch, an alternative to OOP which is quite natural and easy to pick up if you are coming for example from Python. Anyway a lot of Python code is just checking type arguments at the beginning of functions, for example have a look at a method inside scikit-learn, selected at random.

One of the problems with optional type annotations is the need to find The Golden Path that leads you to performant code: there are lots of ways to do something but just only one of making it go fast. This happens also in Julia but I think the story is much better than say in Cython. For example the following structure declarations are all valid Julia code but suffer from different problems.

struct Factorization
    user_embeddings
    item_embeddings
    item_bias
end

This first variant is perfectly valid and doesn't have type annotations:

Some type annotations:

struct Factorization
    user_embeddings:: AbstractArray
    item_embeddings:: AbstractArray
    item_bias:: AbstractArray
end

This is also valid:

struct Factorization
    user_embeddings:: Array
    item_embeddings:: Array
    item_bias:: Array
end

And this:

struct Factorization
    user_embeddings:: Matrix
    item_embeddings:: Matrix
    item_bias:: Vector
end

In order to add more typing you might be tempted to do the following

struct Factorization
    user_embeddings:: Matrix{AbstractFloat}
    item_embeddings:: Matrix{AbstractFloat}
    item_bias:: Vector{AbstractFloat}
end

The above has two problems AFAIK: first the arrays can be of different types, for example user_embeddings could be Float32 and item_embeddings could be Float64. That would incur in float conversions when performing arithmetic. The other more important problem is that you could actually mix Float32 and Float64 inside the same array. Actually, even if you don't mix them, Julia will be forced to consider that possibility and since both types have different sizes it will store references inside the array instead of values which will add even more performance penalties (and storage). These explanations, and much more, are better developed in Julia's documentation in the section about performance tips. It's not complicated but requires some learning.

Let's continue and implement now some little helper functions, using single expression function declaration syntax:

n_users(model::Factorization) = size(model.user_embeddings, 2)
n_items(model::Factorization) = size(model.item_embeddings, 2)
n_components(model::Factorization) = size(model.user_embeddings, 1)

And the dot product to compute the user/item score:

function(model::Factorization)(user::Int, item::Int)
    s = model.item_bias[item]
    for i=1:n_components(model)
        s += model.item_embeddings[i, item] * model.user_embeddings[i, user]
    end
    return s
end

The above code snippet defines the equivalent of Python's __call__ special method and allows to write for example model(42, 22) to compute the score between user 42 and item 22.

Some comments about performance again: first, notice that the above code doesn't have parametric types anywhere (we just write model::Factorization instead of say model::Factorization{T}). This is perfectly correct and normal and actually does not incur in performance penalties. Second, we could have written the dot product in an alternative form which deviates us from our Golden Path:

(model::Factorization)(user::Int, item::Int) =
    model.item_bias[item] +
    model.item_embeddings[:, item]' * model.user_embeddings[:, user]

Actually I would have done the above if working from the REPL since it's more succinct and more similar to the mathematical definition. The above seemingly innocent expression has several problems. First, in Julia, array slicing creates by default copies of the data, which I think it's great, because you have the alternative of creating views. Simply use:

(model::Factorization)(user::Int, item::Int) =
    model.item_bias[item] +
    view(model.item_embeddings, :, item)' *
    view(model.user_embeddings, :, user)

It will avoid data copying but still will allocate memory for structs of type SubArray. You can check it with the very useful macro @allocated which has proven for me essential to debug performance problems:

julia> @allocated model(10, 10)
432

The above shows that a single call to the function allocates 432 bytes, a very significant amount of memory for a function that will be called hundreds or thousands of millions of times.

With NumPy you need to learn which operations create views and which create copies. It's not complicated but you have to reason if a view can be created based on how arrays are represented internally (pointer to data + strides). For example, in Python, indexing with a boolean array will create a copy but slicing will create a view. I very rarely make mistakes of this kind but I must admit sometimes I introduced a bug because of this, adding code that modified a view in-place, which in turn modifies the original array, because I forgot I was working with a view.

As you can see there are several different ways of doing the same thing in Julia. At least I think there are always two: one form suitable for exploration where you write vectorized functions in the spirit of Matlab or NumPy and the other one where you write loops and annotate types similar to C. The good thing is the transition between the two is completely smooth.

4Training data

I have decided to represent user/item interactions in a similar way to LightFM. Whenever a user has interacted (clicked, bought, viewed, whatever depending on the application) with an item we store the value of the interaction (a 0-5 rating, +1/-1, 0/1) in the cell of a matrix where the row is the item and the column is the user.

When our data is implicit the presence of an entry means that the user somehow liked the item. The absence of an entry can mean that she didn't like the item or most probably didn't know about the item. We build therefore a very big sparse matrix with maybe millions of columns and rows. The matrix has been called interactions in the code and is represented with an SparseMatrixCSC.

We will be using the Last.fm dataset to have something concrete and to perform benchmarks. The following function will process the TSV file converting user hashes and artist ids to numeric indices suitable to represent rows and columns. It returns an array with one row for each interaction and 3 columns: user, item and number of times the artist has been played. We really don't care how many times the artist has been played, just that it has been played, but store nevertheless the amount:

function load_lastfm(path)
    function index!(dict, name)
        i = get(dict, name, 0)
        if i == 0
            dict[name] = i = length(dict) + 1
        end
        return i
    end

    data = Array{Int}(undef, countlines(open(path)), 3)
    users = Dict{String, Int}()
    items = Dict{String, Int}()
    for (i, line) in enumerate(eachline(path))
        cells = split(line, '\t')
        data[i, 1] = index!(users, cells[1])
        data[i, 2] = index!(items, cells[2])
        data[i, 3] = parse(Int, cells[4])
    end

    return data
end

We can convert the above to the sparse matrix representation easily:

julia> lastfm = load_lastfm();
julia> using SparseArrays
julia> lastfm = sparse(view(lastfm, :, 2),
                       view(lastfm, :, 1),
                       view(lastfm, :, 3));

And it's convenient to serialize to disk for easy and fast reading back:

julia> using Serialization
julia> serialize("lastfm.jlser", lastfm)

Now if we restart Julia we can simply read back this data:

julia> using Serialization
julia> lastfm = deserialize("lastfm.jlser")

The dataset has 360K users and 160K different artists and only 0.03% of the interactions matrix contains non-zero entries (17M entries).

5WARP loss

Learning to rank items as for example in recommender systems but also for other domains like say search has the difficulty that the metric we want to optimize directly (actually not completely true for recommender systems, but this would require talking about online/offline metrics and digress a lot) is either non-differentiable or constant. This means that the gradient is either non-existent or zero. Let's say for example we have 5 items whose correct order is:

\(\displaystyle ABCDE\)

Our model has computed the following scores:

\begin{eqnarray*} s_A & = & 4.1\\ s_B & = & 5.3\\ s_C & = & 2.5\\ s_D & = & 1.4\\ s_E & = & 3.0 \end{eqnarray*}

This corresponds to the ordering:

\(\displaystyle BAECD\)

As you can see changing the score of A from 4.1 to 4.2 doesn't change the ordering (gradient zero). If however it changes from 4.1 to 5.4 then the ranking changes and whatever loss we have decided between the two orderings, like for example the count of items correctly ordered, will experience a sudden jump (gradient undefined).

One common approach is to use what are called “pairwise losses”. The main idea is that you compare pairs of items and if they are out of order they add to the loss with some function that depends on their scores. In the above example the following pairs are out of order:

\(\displaystyle \begin{array}{l} (B, A)\\ (E, C)\\ (E, D) \end{array}\)

So we could build the following differentiable loss function, for example:

\(\displaystyle L = \sigma (s_B - s_A) + \sigma (s_E - s_C) + \sigma (s_E - s_D)\)

where \(\sigma\) is a differentiable function like the sigmoid or just the identity function. Obviously a perfect pairwise loss is equivalent to a perfect ordering but it's not obvious that a perfect ordering can be achieved by repeated application of pairwise losses and actually I think it has not been proved. If anyone reading this has any information on the subject I would love to know about it. There has been some recent publications on the subject of directly differentiating sorting and ranking in order to plug them directly into Tensorflow or PyTorch or similar automatic differentiation libraries like this paper Fast Differentiable Sorting and Ranking which achieves optimal asymptotic complexity. I did have a look at the paper although the math is quite dense and I need to study more about combinatorial optimization and I have not tested its practicality though. Anyway, back to WARP.

WARP is one of such pairwise loses which implements a clever trick to avoid computing lots of user/item scores. Take into account that is quite possible to have a million of items in a recommender system and you can have easily millions and hundreds of millions of users. The idea is to pick some item and compute its score and then start picking random items until you find one that has higher score but shouldn't, what I will call an inversion. Then you add the pair to the loss scaled by how difficult it was to find the inversion (the number of random items you needed to test). As soon as you find the pair you just perform the gradient step that modifies the user embedding and the two item embeddings that compose the inverted pair. In order to improve the model we require that the difference in score between the two items has some margin.

In the context of recommender systems with implicit data we usually just discriminate between positive and negative items and don't care with the ranking between positives and the ranking between negatives. For example, if we again have 5 items and \(A\) and \(B\) are positive and the rest are negative then these are a few possibilities of equivalent perfect orderings of the items:

\(\displaystyle \begin{array}{l} \operatorname{ABCDE}\\ \operatorname{BACDE}\\ \operatorname{ABDCE}\\ \operatorname{BADCE} \end{array}\)

The code to find inversions:

function is_positive(interactions::SparseMatrixCSC{S, Int},
                     user::Int,
                     item::Int) where {S}
    for i=interactions.colptr[user]:interactions.colptr[user+1] - 1
        if item == interactions.rowval[i]
            return true
        end
    end
    return false
end

function find_inversion(model::Factorization,
                        user::Int,
                        item::Int,
                        interactions::SparseMatrixCSC{S, Int},
                        max_sample::Int) where {S}
    n = n_items(model)
    s = model(user, item)
    for i=1:max_sample
        j = abs(rand(typeof(item)) % n) + 1
        t = model(user, j)
        if t > (s - 1) && !is_positive(interactions, user, j)
            return (i, j)
        end
    end
    return (max_sample, 0)
end

As you can see the function find_inversion will return the number of steps (random samples) it took until finding an inversion and the item that provokes the inversion. Notice that the functions has an unconstrained type parameter S. This is because type parameters in Julia are invariant. This means that for example SparseMatrixCSC{Float32, Int} is not a subtype of SparseMatrixCSC{Number, Int}:

julia> Float32 <: Number
true

julia> SparseMatrixCSC{Float32, Int} <: SparseMatrixCSC{Number, Int}
false

Had we specified SparseMatrixCSC{Number, Int} in the function declaration we most likely would have gotten a MethodError: no method matching… kind of error. The type parameter S is unconstrained because I don't use the values inside the matrix at all and is just there to allow type matching. It may seem limited at first look but I personally prefer this extra verbosity than to think about type covariance/contravariance. Also, notice the typeof call, this allows to generate the random number of the correct type and avoid either type conversion or method match errors.

Once we find an inversion we just perform an step on the following error which is the WARP loss formula:

log((n_items(model) - 1) / n)*(model(user, item_n) - model(user, item_p) + 1)

I would have loved to apply automatic differentiation to the problem with more complex models in mind for the future but as I have said I have been unable, so let's write the gradient functions directly:

warp_constant(model::Factorization{T}, n) where {T<:AbstractFloat} =
    convert(T, log((n_items(model) - 1) / n))

function warp_embeddings_grad(model::Factorization,
                              user::Int,
                              item_p::Int,
                              item_n::Int,
                              i::Int)
    return (model.item_embeddings[i, item_n] - model.item_embeddings[i, item_p],
            -model.user_embeddings[i, user],
            +model.user_embeddings[i, user])

end

We will reuse the above functions to write the different optimization algorithms.

6Optimization algorithms

We are going to implement three optimization algorithms, one is good old SGD and the other two are going to be Adadelta and Adagrad. I recommend this well known blog post about gradient descent algorithms which I have used as a reference.

Let's start first with SGD:

struct SGD{T <: AbstractFloat, N} <: Optimizer{T, N}
    θ::Array{T, N}
    ϵ::T
    SGD(θ::Array{T}, ϵ::AbstractFloat) where {T <: AbstractFloat, N} =
        new{T, N}(θ, convert(T, ϵ))
end

function step!(optimizer::SGD{T, N},
               g::T,
               i::Vararg{Int, N}) where {T <: AbstractFloat, N}
    optimizer.θ[i…] -= optimizer.ϵ * g
end

As you can see the optimizer wraps the parameters to be updated as \(\theta\) and also adds whatever hyperparameters it needs, in this case just the learning rate \(\epsilon\). A more conventional signature for the method step! would eliminate the indices i and pass the gradients g as an array of the correct shape instead of as a scalar. However, our training has almost all gradient elements as zero on each update and passing full arrays would be wasteful. Passing sparse matrices would be efficient only if we accumulated lots of these updates but we want to update the parameters as soon as possible.

There is little more to comment about this code. Maybe the use of convert over there. The reason is that I want the constructor to accept any AbstractFloat and not only the exact same type as the model otherwise if my model were a Float32 a call like SGD(model, 1e-2) would not match any method since the literal 1e-2 is a Float64. It's sometimes a little annoying to be thinking about these kinds of problems, using functions like typeof, convert, promote, zero or oneunit but I don't see any alternative to build fast code.

Let's implement now Adadelta and Adagrad. I implemented them because they are the ones LightFM uses and also to showcase something that is not very important but usually catches the attention: the use of Unicode characters for math symbols inside Julia code. I found that in this case it was a welcome addition because it allowed me to translate more easily what I read in the algorithm description. Here it is:

struct Adadelta{T <: AbstractFloat, N} <: Optimizer{T, N}
    θ::Array{T, N}
    Eg²::Array{T, N}
    EΔ²::Array{T, N}
    ϵ::T
    η::T
    function Adadelta(θ::Array{T, N},
                      ϵ::AbstractFloat,
                      η::AbstractFloat) where {T <: AbstractFloat, N}
        return new{T, N}(θ,
                         zero(θ),
                         zero(θ),
                         convert(T, ϵ),
                         convert(T, η))
    end
end

function step!(optimizer::Adadelta{T, N},
               g::T,
               i::Vararg{Int, N}) where {T<:AbstractFloat, N}
    EΔ²i = optimizer.EΔ²[i…]
    Eg²i = optimizer.Eg²[i…]
    ϵ = optimizer.ϵ
    η = optimizer.η

    Eg²i = η*Eg²i + (1 - η)*g^2
    Δi = -sqrt((EΔ²i + ϵ)/(Eg²i + ϵ))*g
    EΔ²i = η*EΔ²i + (1 - η)*Δi^2

    optimizer.θ[i…] += Δi
    optimizer.EΔ²[i…] = EΔ²i
    optimizer.Eg²[i…] = Eg²i
end


struct Adagrad{T <: AbstractFloat, N} <: Optimizer{T, N}
    θ::Array{T, N}
    G::Array{T, N}
    ϵ::T
    η::T
    function Adagrad(θ::Array{T},
                     ϵ::AbstractFloat,
                     η::AbstractFloat) where {T <: AbstractFloat, N}
        return new{T, N}(θ,
                         zero(θ),
                         convert(T, ϵ),
                         convert(T, η))
    end
end

function step!(optimizer::Adagrad{T, N},
               g::T,
               i::Vararg{Int, N}) where {T<:AbstractFloat, N}
    Gi = optimizer.G[i…]
    ϵ = optimizer.ϵ
    η = optimizer.η

    Gi += g^2
    Δi = -η/sqrt(Gi + ϵ)*g

    optimizer.θ[i…] += Δi
    optimizer.G[i…] = Gi
end

7Macros

The above optimization algorithms are generic versions which can be reused for other problems. We are going now to create an abstract type FactorizationOptimizer that wraps optimizers for the two embeddings and the biases with our particular problem in mind:

A first try looks like this:

abstract type FactorizationOptimizer{T <: AbstractFloat} end

struct FactorizationSGD{T} <: FactorizationOptimizer{T}
    user_embeddings::SGD{T, 2}
    item_embeddings::SGD{T, 2}
    item_bias::SGD{T, 1}

    function FactorizationSGD(model::Factorization{T},
                              ϵ::AbstractFloat) where {T}
        return new{T}(SGD(model.user_embeddings, ϵ),
                      SGD(model.item_embeddings, ϵ),
                      SGD(model.item_bias, ϵ))
    end

end


struct FactorizationAdadelta{T <: AbstractFloat} <: FactorizationOptimizer{T}
    user_embeddings::Adadelta{T, 2}
    item_embeddings::Adadelta{T, 2}
    item_bias::Adadelta{T, 1}

    function FactorizationAdadelta(model::Factorization{T},
                                   ϵ::AbstractFloat,
                                   η::AbstractFloat) where {T}
        return new{T}(Adadelta(model.user_embeddings, ϵ, η),
                      Adadelta(model.item_embeddings, ϵ, η),
                      Adadelta(model.item_bias, ϵ, η))
    end

end


struct FactorizationAdagrad{T <: AbstractFloat} <: FactorizationOptimizer{T}
    user_embeddings::Adagrad{T, 2}
    item_embeddings::Adagrad{T, 2}
    item_bias::Adagrad{T, 1}

    function FactorizationAdagrad(model::Factorization{T},
                                   ϵ::AbstractFloat,
                                   η::AbstractFloat) where {T}
        return new{T}(Adagrad(model.user_embeddings, ϵ, η),
                      Adagrad(model.item_embeddings, ϵ, η),
                      Adagrad(model.item_bias, ϵ, η))
    end

end

I'm sure you will have noticed how similar the 3 structs are. They follow a common template. If we were going to use just this 3 algorithms I would leave them as they are but let's suppose we are going to implement more. This is a good use case for macros. We want to create the different structures with a single call like:

@optimizertype(SGD, ϵ)
@optimizertype(Adadelta, ϵ, η)
@optimizertype(Adagrad, ϵ, η)

Here it the corresponding macro

macro optimizertype(OptimizerType, params…)
    structname = Symbol("Factorization", OptimizerType)
    return quote
        struct $structname{T} <: FactorizationOptimizer{T}
            user_embeddings::$OptimizerType{T, 2}
            item_embeddings::$OptimizerType{T, 2}
            item_bias::$OptimizerType{T, 1}

            function $structname(model::Factorization{T},
                                 $(params…)) where {T}
                return new{T}(
                    $OptimizerType(model.user_embeddings, $(params…)),
                    $OptimizerType(model.item_embeddings, $(params…)),
                    $OptimizerType(model.item_bias, $(params…)))
            end
        end
    end
end

We have avoided some code copy-pasting at the cost of some readability. If we had 10 different algorithms it would be worthy and with 3 I wouldn't use a macro. Somewhere is the crossing line although I don't know where.

8Training algorithm

The optimization algorithm will proceed as follows: we select a random user/item interaction from the training set, by definition this means that the selected item is a positive item. We then find an inversion for that user/item pair which gives a negative item and the number of random samples it took to find the negative item. We then perform a gradient step using some optimization algorithm. We repeat until we have visited all interactions inside the training dataset.

Selecting a random user/item interaction is just picking a random index inside the interactions. Obtaining the positive item is direct from the SparseMatrixCSC but in order to obtain the user without searching we must build an auxiliary array mapping interaction index to user. To avoid recomputing this array at each training epoch I have defined a Interactions struct that is just a wrapper for the sparse matrix and builds the auxiliary array when constructed:

struct Interactions{S}
    sparse::SparseMatrixCSC{S, Int}
    users::Vector{Int}

    function Interactions(interactions::SparseMatrixCSC{S, Int}) where {S}
        n_interactions = length(interactions.nzval)
        users = Array{Int}(undef, length(interactions.nzval))
        j = 1
        for i=1:length(interactions.colptr)-1
            for j=interactions.colptr[i]:(interactions.colptr[i+1] - 1)
                users[j] = i
            end
        end
        return new{S}(interactions, users)
    end
end

And now the training function is defined except for the warpstep! function and the FactorizationOptimizer struct which we will implement next:

function epoch!(model::Factorization,
                optimizer::FactorizationOptimizer,
                interactions::Interactions,
                max_sample::Int)
    max_sample = min(n_items(model) - 1, max_sample)
    n_interactions = length(interactions.sparse.nzval)
    @Threads.threads for i in randperm(n_interactions)
        user = interactions.users[i]
        item_p = interactions.sparse.rowval[i]
        n, item_n = find_inversion(
            model, user, item_p, interactions.sparse, max_sample)
        if item_n > 0
            warpstep!(model, optimizer, user, item_p, item_n, n)
        end
    end
end

Notice the @Threads.threads. It automagically parallelizes the for loop. We are using here the same algorithm as LightFM, Hogwild, which basically is not caring about any data race. If you think it will corrupt data you are right, but we simply don't care. First, it will happen sparingly for our problem since the probabilities of an item happening simultaneously in two different threads is low and for users is even lower. Second, even if it happens, who cares? We are already using a noisy optimization procedure and we actually like the noise. You need to start the Julia interpreter correctly configured, in my case I added to my ~/.profile the line export JULIA_NUM_THREADS=8. If you like me are coming from Python the fact that this worked at the first try so well will make you cry out of joy.

The function warpstep! is in charge of computing the gradients and passing them to the optimizers for the embeddings and user biases:

function warpstep!(model::Factorization{T},
                   optimizer::FactorizationOptimizer{T},
                   user::Int,
                   item_p::Int,
                   item_n::Int,
                   n::Int) where {T}
    C = warp_constant(model, n)
    for i=1:n_components(model)
        gu, gp, gn = warp_embeddings_grad(model, user, item_p, item_n, i)
        step!(optimizer.user_embeddings, C*gu, i, user)
        step!(optimizer.item_embeddings, C*gp, i, item_p)
        step!(optimizer.item_embeddings, C*gn, i, item_n)
    end
    step!(optimizer.item_bias, convert(T, -1), item_p)
    step!(optimizer.item_bias, convert(T, +1), item_n)
end

Finally, for completeness, this function initializes the embeddings and the biases with sensible values:

function init!(interactions::Interactions, model::Factorization)
    interactions = interactions.sparse

    fill!(model.item_bias, 0)
    for i=1:length(interactions.nzval)
        model.item_bias[interactions.rowval[i]] += 1
    end
    model.item_bias ./= sum(model.item_bias)

    scale = 1.0 / n_components(model)
    rand!(model.user_embeddings)
    rand!(model.item_embeddings)
    model.user_embeddings .-= 0.5
    model.item_embeddings .-= 0.5
    model.user_embeddings .*= scale
    model.item_embeddings .*= scale
end

9Metrics

Before measuring speed we must take an important point into account: if everything goes well we expect each epoch of training to take more time than the previous one. The reason is in the WARP loss. As the parameters get better fit it is more difficult to find an inversion.

We are going therefore to measure the precision@k metric as we train. LightFM has its own function to measure it, which works quite slow by the way and I had to reimplement at my job. Here it is our version with Julia:

function select_k(scores::Matrix{<:AbstractFloat}, k::Int)
    n = size(scores, 2)
    y = Array{Int}(undef, k, n)
    ix = Array{Int}(undef, size(scores, 1))
    for i=1:n
        partialsortperm!(ix, view(scores, :, i), k, rev=true)
        y[:, i] = view(ix, 1:k)
    end
    return y
end

function precision_at_k(model::Factorization,
                        interactions::SparseMatrixCSC,
                        users::Vector{Int},
                        k::Int)
    scores = model.item_bias .+
             model.item_embeddings' * model.user_embeddings[:, users]
    recommendations = select_k(scores, k)
    pk = zeros(Float32, length(users))
    for i=1:size(recommendations, 2)
        for j=1:size(recommendations, 1)
            user = users[i]
            item = recommendations[j, i]
            if is_positive(interactions, user, item)
                pk[i] += 1.0f0
            end
        end
    end
    return pk / k
end

We should get comparable values of precision@k between LightFM and our little test code if we are to compare speed. Also, we hope the precision increases. We will test in a subset of users who are part of our training data so don't take conclusions about how good the model is.

10Results

And so, here are the benchmarks!

Python (LightFM):

Epoch time: 3.34min precision@5: 0.502
Epoch time: 4.83min precision@5: 0.575
Epoch time: 5.46min precision@5: 0.604
Epoch time: 5.80min precision@5: 0.622
Epoch time: 5.81min precision@5: 0.628

Julia:

Epoch time: 2.01min precision@5: 0.518
Epoch time: 3.07min precision@5: 0.574
Epoch time: 3.26min precision@5: 0.599
Epoch time: 3.50min precision@5: 0.628
Epoch time: 3.61min precision@5: 0.638

We can get even a little more juice out of Julia with additional options:

julia --math-mode=fast --check-bounds=no --threads=auto test.jl

To get:

Epoch time: 1.60min precision@5: 0.515
Epoch time: 2.37min precision@5: 0.572
Epoch time: 2.84min precision@5: 0.604
Epoch time: 2.87min precision@5: 0.616
Epoch time: 2.97min precision@5: 0.634

More or less precisions are similar. Sometimes our implementation get slightly better scores and sometimes LightFM because we evaluate on different random sets of users. Timings are more or less the same all times because it's always the same training data.

Even though we have not used the additional features of LightFM maybe they constrain the implementation so it's premature to say our implementation is faster although it looks so because there is some generous margin. I think it's safe to say that Julia is at least as fast as Cython.

You can find all the code to replicate the benchmarks at this repo.

If these results encourage you to give Julia a try that's great but I would advice to not focus on performance at first. In my case I started learning by doing exercises at Exercism. Have fun!