Item2Vec for Sparse Purchase Histories: Embedding Products in Implicit Feedback Regimes

The recommendation engine on our cross-border commerce platform had a specific constraint: the purchase history per user was sparse — most customers had fewer than a dozen completed orders at the time we were building the system. Classical collaborative filtering degrades badly in this regime. Matrix factorisation methods like ALS or BPR require sufficient co-occurrence density to learn meaningful latent factors. We had neither the user count nor the interaction density to make that work reliably.

Item2Vec offered a different framing.

The Core Idea

Item2Vec, introduced by Barkan and Koenigstein (2016), adapts Word2Vec’s skip-gram architecture to item recommendation. Instead of treating a sentence as a sequence of words, you treat a user’s purchase basket (or browsing session) as a sequence of items. The model learns to predict which items occur in the same context as a given item.

Formally, given a set of items $\mathcal{I}$ and a corpus of sessions $\mathcal{S} = \{S_1, S_2, \ldots, S_M\}$ where each $S_k \subseteq \mathcal{I}$ , the objective is to maximise:

\mathcal{L} = \frac{1}{M} \sum_{k=1}^{M} \frac{1}{|S_k|} \sum_{i \in S_k} \sum_{\substack{j \in S_k \\ j \neq i}} \log p(j \mid i)

The conditional probability is defined via softmax over item embeddings:

p(j \mid i) = \frac{\exp\!\left(\mathbf{v}_i^\top \mathbf{v}_j\right)}{\sum_{l \in \mathcal{I}} \exp\!\left(\mathbf{v}_i^\top \mathbf{v}_l\right)}

where $\mathbf{v}_i \in \mathbb{R}^d$ is the embedding vector for item $i$ and $d$ is the embedding dimension (we used $d = 128$ ).

Why It Handles Sparsity Better

The critical distinction from user-based collaborative filtering is that Item2Vec makes no use of the user identity at inference time. The embeddings are item-space representations learned from co-occurrence patterns across all users. A user who has made two purchases can still receive a meaningful recommendation — we find the embedding of their purchased items and return nearest neighbours in the embedding space.

The similarity function at inference is simply cosine similarity:

\text{sim}(i, j) = \frac{\mathbf{v}_i^\top \mathbf{v}_j}{\|\mathbf{v}_i\| \cdot \|\mathbf{v}_j\|}

This is computed at serving time via approximate nearest neighbour search (we used faiss with an IVF index), which keeps latency below 20ms at the 95th percentile even as the catalogue grows.

Training Details

We concatenated purchase baskets and browsing sessions, treating each as an unordered set (Word2Vec’s sg=1 mode with window set large enough to cover the full session). Negative sampling was used to approximate the softmax denominator:

\log \sigma\!\left(\mathbf{v}_i^\top \mathbf{v}_j\right) + \sum_{k=1}^{K} \mathbb{E}_{j_k \sim P_n} \left[\log \sigma\!\left(-\mathbf{v}_i^\top \mathbf{v}_{j_k}\right)\right]

where $P_n(j) \propto f(j)^{3/4}$ is the smoothed unigram noise distribution and $K=15$ negative samples per positive pair.

from gensim.models import Word2Vec

# Each session is a list of product SKUs (strings)
sessions: list[list[str]] = load_sessions()

model = Word2Vec(
    sentences=sessions,
    vector_size=128,
    window=999,        # treat session as unordered bag
    min_count=5,       # ignore items with fewer than 5 occurrences
    sg=1,              # skip-gram
    negative=15,
    epochs=20,
    workers=8,
    seed=42,
)

model.wv.save("item2vec.kv")

The window=999 trick — setting the window larger than any session length — is the standard way to enforce the unordered bag-of-items interpretation in Gensim’s implementation.

Evaluation

Without ground-truth held-out ratings (implicit feedback only), we evaluated using two proxies:

Hit Rate @ K: for each test session, mask the last item, retrieve top- $K$ recommendations from the remaining items, and check whether the masked item appears in the top- $K$ set.

\text{HR}@K = \frac{1}{|\mathcal{T}|} \sum_{t \in \mathcal{T}} \mathbf{1}\!\left[\text{target}_t \in \text{top-}K(\text{context}_t)\right]

Mean Reciprocal Rank:

\text{MRR} = \frac{1}{|\mathcal{T}|} \sum_{t \in \mathcal{T}} \frac{1}{\text{rank}_t}

Our model achieved HR@10 of 0.31 and MRR of 0.14 on the held-out test set, which compared favourably to a popularity baseline (HR@10 = 0.19, MRR = 0.09) and a simple TF-IDF category similarity baseline (HR@10 = 0.22, MRR = 0.11).

Integration in the Platform

The trained embeddings are exported to a faiss IVF index at training time and served by a lightweight FastAPI sidecar process. The main Django application calls the sidecar via an internal HTTP call from the recommendations service layer, which the API exposes through the /api/v1/recommendations/ endpoint consumed by both the React SPA and the Android client.

Retraining runs nightly as a scheduled task, replacing the faiss index atomically so the serving process is never left pointing at a stale index mid-swap.

Limitations and Open Questions

Item2Vec learns good relative relationships between items that co-occur frequently, but performs poorly on long-tail catalogue items with fewer than five appearances in the training corpus (which is why min_count=5 is a hard filter). Cold-start items — newly listed products with no purchase history — fall back to category-based similarity until they accumulate sufficient co-occurrences.

A natural extension would be to incorporate product attribute features (category, weight, origin country) into a hybrid embedding that can handle cold start. We are evaluating whether a simple content-based warm-start initialisation — seeding the embedding of a new item with the mean vector of its category cluster — is sufficient before the item accumulates organic co-occurrences.

Barkan, O. & Koenigstein, N. (2016). Item2Vec: Neural Item Embedding for Collaborative Filtering. MLSP 2016.