Kalakari: AI Sketch Recognition

Kalakari

A sketch recognition experiment.
Draw. The neural network guesses.
Can you communicate with AI?

Built by Shiven Saini
Powered by SketchNet
shiven.dev

50 categories · draw any of these

🐱cat

🐶dog

🚗car

🏠house

🌳tree

🐟fish

🐦bird

✈️airplane

🚲bicycle

🕐clock

☀️sun

🌙moon

⭐star

🌸flower

🍎apple

🍌banana

🍕pizza

🎸guitar

🎩hat

👟shoe

🐘elephant

🦒giraffe

🐧penguin

🐬dolphin

🦋butterfly

🍓strawberry

🍍pineapple

🍉watermelon

🍇grapes

📷camera

☎️telephone

💻laptop

📺television

🛋️couch

🪑chair

🛏️bed

🚪door

🖼️picture frame

🪜ladder

🌉bridge

⛵sailboat

🚌bus

🚂train

🚁helicopter

🎈hot air balloon

⚔️sword

👑crown

💎diamond

⏳hourglass

🕯️candle

Rounds

30s

Per Round

96%+

Model Acc.

ONNX

Runtime

Tip: draw fast and simple shapes for best guesses.

Under the hood

SketchNet

~2.1M

Parameters

64×64

Input Size

96%+

Val Accuracy

Classes

Forward pass — Input → Output

Input

Sketch rendered
from strokes

(1,64,64)

→

Block 1

Conv 3×3
ResBlock
MaxPool /2

(32,32,32)

→

Block 2

Conv 3×3
ResBlock ×3
MaxPool /2

(64,16,16)

→

Multi-Scale

4 parallel branches
Inception-style
Concat → 128ch

(128,16,16)

→

Block 3

DW-Sep Conv
ResBlock
MaxPool /2

(192,8,8)

→

Block 4

DW-Sep ×2
ResBlock
MaxPool /2

(384,4,4)

→

GAP

Global Average
Pooling

(384,)

→

Output

Linear + Softmax
50 classes

(50,)

Multi-scale block — stroke-width invariance

1×1 Conv

Fine details. Stroke tips & sharp corners.

3×3 Conv

Medium strokes. Most common width.

5×5 Conv

Thick & gestural. Bold drawing styles.

Pool + 1×1

Context features. Surrounding structure.

All 4 branches run in parallel then concatenate → 128 channels. A clock drawn thick or thin looks the same to the network.

Design decisions

Residual blocks

Skip connections prevent gradient vanishing and regularize — critical for sketch data where everyone draws differently.

DW-Sep convolutions

Blocks 3 & 4 use depthwise separable convolutions — ~9× fewer params than standard Conv2d with similar accuracy.

Global avg pool

GAP gives translation invariance — a cat drawn in any corner of the canvas produces the same feature vector.

Mixup training

Training blends pairs of drawings (α=0.4). Forces smooth boundaries between confusable pairs like clock/sun and cat/dog.

SGD + OneCycleLR

SGD with Nesterov momentum + OneCycleLR warmup/cosine decay. Converges in ~25 epochs from random init.

Client inference

Model exported to ONNX and runs entirely in-browser via ONNX Runtime Web (WebGPU / WASM). No server needed.

Author

Shiven Saini

Kalakari Project · 2026

shiven.dev GitHub LinkedIn