Clustering

Clustering#

Clustering seeks to group data into clusters based on their properties and then allow us to predict which cluster a new member belongs.

import numpy as np
import matplotlib.pyplot as plt

We’ll use a dataset generator that is part of scikit-learn called make_moons. This generates data that falls into 2 different sets with a shape that looks like half-moons.

from sklearn import datasets
def generate_data():
    xvec, val = datasets.make_moons(200, noise=0.2)

    # encode the output to be 2 elements
    x = []
    v = []
    for xv, vv in zip(xvec, val):
        x.append(np.array(xv))
        v.append(vv)

    return np.array(x), np.array(v)
x, v = generate_data()

Let’s look at a point and it’s value

print(f"x = {x[0]}, value = {v[0]}")
x = [-0.36939631  0.78652053], value = 0

Now let’s plot the data

def plot_data(x, v):
    xpt = [q[0] for q in x]
    ypt = [q[1] for q in x]

    fig, ax = plt.subplots()
    ax.scatter(xpt, ypt, s=40, c=v, cmap="viridis")
    ax.set_aspect("equal")
    return fig
fig = plot_data(x, v)
../_images/6713220acc9292334b11cb185adc20baf21e31ddf553bb627ea3936a4cb71f52.png

We want to partition this domain into 2 regions, such that when we come in with a new point, we know which group it belongs to.

First we setup and train our network

from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Input
from keras.optimizers import RMSprop
model = Sequential()
model.add(Input(shape=(2,)))
model.add(Dense(50, activation="relu"))
model.add(Dense(20, activation="relu"))
model.add(Dense(1, activation="sigmoid"))
rms = RMSprop()
model.compile(loss='binary_crossentropy',
              optimizer=rms, metrics=['accuracy'])
model.summary()
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                     Output Shape                  Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense (Dense)                   │ (None, 50)             │           150 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_1 (Dense)                 │ (None, 20)             │         1,020 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_2 (Dense)                 │ (None, 1)              │            21 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 1,191 (4.65 KB)
 Trainable params: 1,191 (4.65 KB)
 Non-trainable params: 0 (0.00 B)

We seem to need a lot of epochs here to get a good result

epochs = 100
results = model.fit(x, v, batch_size=50, epochs=epochs, verbose=2)
Epoch 1/100
4/4 - 0s - 6ms/step - accuracy: 0.6250 - loss: 0.6576
Epoch 2/100
4/4 - 0s - 6ms/step - accuracy: 0.7700 - loss: 0.6068
Epoch 3/100
4/4 - 0s - 5ms/step - accuracy: 0.8300 - loss: 0.5743
Epoch 4/100
4/4 - 0s - 5ms/step - accuracy: 0.8450 - loss: 0.5484
Epoch 5/100
4/4 - 0s - 5ms/step - accuracy: 0.8450 - loss: 0.5254
Epoch 6/100
4/4 - 0s - 5ms/step - accuracy: 0.8550 - loss: 0.5051
Epoch 7/100
4/4 - 0s - 5ms/step - accuracy: 0.8600 - loss: 0.4862
Epoch 8/100
4/4 - 0s - 5ms/step - accuracy: 0.8700 - loss: 0.4679
Epoch 9/100
4/4 - 0s - 5ms/step - accuracy: 0.8650 - loss: 0.4511
Epoch 10/100
4/4 - 0s - 5ms/step - accuracy: 0.8700 - loss: 0.4366
Epoch 11/100
4/4 - 0s - 6ms/step - accuracy: 0.8700 - loss: 0.4222
Epoch 12/100
4/4 - 0s - 6ms/step - accuracy: 0.8750 - loss: 0.4085
Epoch 13/100
4/4 - 0s - 19ms/step - accuracy: 0.8800 - loss: 0.3960
Epoch 14/100
4/4 - 0s - 5ms/step - accuracy: 0.8800 - loss: 0.3846
Epoch 15/100
4/4 - 0s - 6ms/step - accuracy: 0.8800 - loss: 0.3734
Epoch 16/100
4/4 - 0s - 5ms/step - accuracy: 0.8800 - loss: 0.3632
Epoch 17/100
4/4 - 0s - 5ms/step - accuracy: 0.8800 - loss: 0.3536
Epoch 18/100
4/4 - 0s - 5ms/step - accuracy: 0.8800 - loss: 0.3446
Epoch 19/100
4/4 - 0s - 5ms/step - accuracy: 0.8800 - loss: 0.3355
Epoch 20/100
4/4 - 0s - 8ms/step - accuracy: 0.8800 - loss: 0.3287
Epoch 21/100
4/4 - 0s - 5ms/step - accuracy: 0.8800 - loss: 0.3200
Epoch 22/100
4/4 - 0s - 6ms/step - accuracy: 0.8800 - loss: 0.3133
Epoch 23/100
4/4 - 0s - 6ms/step - accuracy: 0.8850 - loss: 0.3078
Epoch 24/100
4/4 - 0s - 6ms/step - accuracy: 0.8750 - loss: 0.3020
Epoch 25/100
4/4 - 0s - 21ms/step - accuracy: 0.8750 - loss: 0.2976
Epoch 26/100
4/4 - 0s - 5ms/step - accuracy: 0.8750 - loss: 0.2924
Epoch 27/100
4/4 - 0s - 5ms/step - accuracy: 0.8800 - loss: 0.2893
Epoch 28/100
4/4 - 0s - 5ms/step - accuracy: 0.8800 - loss: 0.2852
Epoch 29/100
4/4 - 0s - 5ms/step - accuracy: 0.8800 - loss: 0.2818
Epoch 30/100
4/4 - 0s - 5ms/step - accuracy: 0.8750 - loss: 0.2803
Epoch 31/100
4/4 - 0s - 5ms/step - accuracy: 0.8750 - loss: 0.2787
Epoch 32/100
4/4 - 0s - 5ms/step - accuracy: 0.8850 - loss: 0.2748
Epoch 33/100
4/4 - 0s - 5ms/step - accuracy: 0.8850 - loss: 0.2729
Epoch 34/100
4/4 - 0s - 5ms/step - accuracy: 0.8850 - loss: 0.2706
Epoch 35/100
4/4 - 0s - 5ms/step - accuracy: 0.8800 - loss: 0.2718
Epoch 36/100
4/4 - 0s - 5ms/step - accuracy: 0.8800 - loss: 0.2687
Epoch 37/100
4/4 - 0s - 5ms/step - accuracy: 0.8900 - loss: 0.2661
Epoch 38/100
4/4 - 0s - 5ms/step - accuracy: 0.8900 - loss: 0.2649
Epoch 39/100
4/4 - 0s - 5ms/step - accuracy: 0.8900 - loss: 0.2642
Epoch 40/100
4/4 - 0s - 5ms/step - accuracy: 0.8850 - loss: 0.2617
Epoch 41/100
4/4 - 0s - 5ms/step - accuracy: 0.8900 - loss: 0.2602
Epoch 42/100
4/4 - 0s - 5ms/step - accuracy: 0.8850 - loss: 0.2604
Epoch 43/100
4/4 - 0s - 5ms/step - accuracy: 0.8900 - loss: 0.2574
Epoch 44/100
4/4 - 0s - 6ms/step - accuracy: 0.8850 - loss: 0.2569
Epoch 45/100
4/4 - 0s - 5ms/step - accuracy: 0.8900 - loss: 0.2555
Epoch 46/100
4/4 - 0s - 5ms/step - accuracy: 0.8850 - loss: 0.2543
Epoch 47/100
4/4 - 0s - 5ms/step - accuracy: 0.8900 - loss: 0.2534
Epoch 48/100
4/4 - 0s - 6ms/step - accuracy: 0.8900 - loss: 0.2521
Epoch 49/100
4/4 - 0s - 5ms/step - accuracy: 0.8900 - loss: 0.2498
Epoch 50/100
4/4 - 0s - 5ms/step - accuracy: 0.8850 - loss: 0.2502
Epoch 51/100
4/4 - 0s - 5ms/step - accuracy: 0.8900 - loss: 0.2478
Epoch 52/100
4/4 - 0s - 7ms/step - accuracy: 0.8900 - loss: 0.2485
Epoch 53/100
4/4 - 0s - 6ms/step - accuracy: 0.8900 - loss: 0.2457
Epoch 54/100
4/4 - 0s - 30ms/step - accuracy: 0.8900 - loss: 0.2466
Epoch 55/100
4/4 - 0s - 5ms/step - accuracy: 0.8850 - loss: 0.2438
Epoch 56/100
4/4 - 0s - 5ms/step - accuracy: 0.8900 - loss: 0.2430
Epoch 57/100
4/4 - 0s - 5ms/step - accuracy: 0.8850 - loss: 0.2418
Epoch 58/100
4/4 - 0s - 5ms/step - accuracy: 0.8900 - loss: 0.2423
Epoch 59/100
4/4 - 0s - 5ms/step - accuracy: 0.8900 - loss: 0.2388
Epoch 60/100
4/4 - 0s - 6ms/step - accuracy: 0.8950 - loss: 0.2388
Epoch 61/100
4/4 - 0s - 5ms/step - accuracy: 0.8900 - loss: 0.2392
Epoch 62/100
4/4 - 0s - 5ms/step - accuracy: 0.8950 - loss: 0.2355
Epoch 63/100
4/4 - 0s - 5ms/step - accuracy: 0.8900 - loss: 0.2342
Epoch 64/100
4/4 - 0s - 10ms/step - accuracy: 0.8900 - loss: 0.2334
Epoch 65/100
4/4 - 0s - 5ms/step - accuracy: 0.9000 - loss: 0.2317
Epoch 66/100
4/4 - 0s - 5ms/step - accuracy: 0.9000 - loss: 0.2315
Epoch 67/100
4/4 - 0s - 6ms/step - accuracy: 0.8950 - loss: 0.2293
Epoch 68/100
4/4 - 0s - 6ms/step - accuracy: 0.8950 - loss: 0.2281
Epoch 69/100
4/4 - 0s - 7ms/step - accuracy: 0.8950 - loss: 0.2261
Epoch 70/100
4/4 - 0s - 6ms/step - accuracy: 0.9000 - loss: 0.2246
Epoch 71/100
4/4 - 0s - 21ms/step - accuracy: 0.8900 - loss: 0.2257
Epoch 72/100
4/4 - 0s - 6ms/step - accuracy: 0.9050 - loss: 0.2225
Epoch 73/100
4/4 - 0s - 5ms/step - accuracy: 0.9000 - loss: 0.2219
Epoch 74/100
4/4 - 0s - 5ms/step - accuracy: 0.9100 - loss: 0.2220
Epoch 75/100
4/4 - 0s - 5ms/step - accuracy: 0.9100 - loss: 0.2215
Epoch 76/100
4/4 - 0s - 5ms/step - accuracy: 0.9000 - loss: 0.2167
Epoch 77/100
4/4 - 0s - 5ms/step - accuracy: 0.9100 - loss: 0.2163
Epoch 78/100
4/4 - 0s - 5ms/step - accuracy: 0.9100 - loss: 0.2144
Epoch 79/100
4/4 - 0s - 6ms/step - accuracy: 0.9050 - loss: 0.2148
Epoch 80/100
4/4 - 0s - 5ms/step - accuracy: 0.9150 - loss: 0.2116
Epoch 81/100
4/4 - 0s - 5ms/step - accuracy: 0.9100 - loss: 0.2103
Epoch 82/100
4/4 - 0s - 5ms/step - accuracy: 0.9050 - loss: 0.2094
Epoch 83/100
4/4 - 0s - 6ms/step - accuracy: 0.9100 - loss: 0.2074
Epoch 84/100
4/4 - 0s - 5ms/step - accuracy: 0.9100 - loss: 0.2067
Epoch 85/100
4/4 - 0s - 5ms/step - accuracy: 0.9050 - loss: 0.2043
Epoch 86/100
4/4 - 0s - 5ms/step - accuracy: 0.9100 - loss: 0.2033
Epoch 87/100
4/4 - 0s - 5ms/step - accuracy: 0.9100 - loss: 0.2014
Epoch 88/100
4/4 - 0s - 6ms/step - accuracy: 0.9100 - loss: 0.2016
Epoch 89/100
4/4 - 0s - 5ms/step - accuracy: 0.9150 - loss: 0.1982
Epoch 90/100
4/4 - 0s - 5ms/step - accuracy: 0.9150 - loss: 0.1969
Epoch 91/100
4/4 - 0s - 5ms/step - accuracy: 0.9100 - loss: 0.1950
Epoch 92/100
4/4 - 0s - 5ms/step - accuracy: 0.9150 - loss: 0.1938
Epoch 93/100
4/4 - 0s - 5ms/step - accuracy: 0.9100 - loss: 0.1930
Epoch 94/100
4/4 - 0s - 5ms/step - accuracy: 0.9150 - loss: 0.1911
Epoch 95/100
4/4 - 0s - 6ms/step - accuracy: 0.9100 - loss: 0.1899
Epoch 96/100
4/4 - 0s - 5ms/step - accuracy: 0.9150 - loss: 0.1874
Epoch 97/100
4/4 - 0s - 5ms/step - accuracy: 0.9150 - loss: 0.1878
Epoch 98/100
4/4 - 0s - 5ms/step - accuracy: 0.9150 - loss: 0.1845
Epoch 99/100
4/4 - 0s - 5ms/step - accuracy: 0.9200 - loss: 0.1838
Epoch 100/100
4/4 - 0s - 33ms/step - accuracy: 0.9150 - loss: 0.1817
score = model.evaluate(x, v, verbose=0)
print(f"score = {score[0]}")
print(f"accuracy = {score[1]}")
score = 0.17942476272583008
accuracy = 0.9200000166893005

Let’s look at a prediction. We need to feed in a single point as an array of shape (N, 2), where N is the number of points

res = model.predict(np.array([[-2, 2]]))
res
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 3ms/step

1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 4ms/step
array([[6.1211895e-06]], dtype=float32)

We see that we get a floating point number. We will need to convert this to 0 or 1 by rounding.

Let’s plot the partitioning

M = 128
N = 128

xmin = -1.75
xmax = 2.5
ymin = -1.25
ymax = 1.75

xpt = np.linspace(xmin, xmax, M)
ypt = np.linspace(ymin, ymax, N)

To make the prediction go faster, we want to feed in a vector of these points, of the form:

[[xpt[0], ypt[0]],
 [xpt[1], ypt[1]],
 ...
]

We can see that this packs them into the vector

pairs = np.array(np.meshgrid(xpt, ypt)).T.reshape(-1, 2)
pairs[0]
array([-1.75, -1.25])

Now we do the prediction. We will get a vector out, which we reshape to match the original domain.

res = model.predict(pairs, verbose=0)
res.shape = (M, N)

Finally, round to 0 or 1

domain = np.where(res > 0.5, 1, 0)

and we can plot the data

fig, ax = plt.subplots()
ax.imshow(domain.T, origin="lower",
          extent=[xmin, xmax, ymin, ymax], alpha=0.25)
xpt = [q[0] for q in x]
ypt = [q[1] for q in x]

ax.scatter(xpt, ypt, s=40, c=v, cmap="viridis")
<matplotlib.collections.PathCollection at 0x7f1cfc7cdbd0>
../_images/f251399f2900899c4c9bac0411e871891cc756208bbdfa8938f5627f8fbf09b9.png