Clustering

Clustering#

Clustering seeks to group data into clusters based on their properties and then allow us to predict which cluster a new member belongs.

import numpy as np
import matplotlib.pyplot as plt

We’ll use a dataset generator that is part of scikit-learn called make_moons. This generates data that falls into 2 different sets with a shape that looks like half-moons.

from sklearn import datasets
def generate_data():
    xvec, val = datasets.make_moons(200, noise=0.2)

    # encode the output to be 2 elements
    x = []
    v = []
    for xv, vv in zip(xvec, val):
        x.append(np.array(xv))
        v.append(vv)

    return np.array(x), np.array(v)
x, v = generate_data()

Let’s look at a point and it’s value

print(f"x = {x[0]}, value = {v[0]}")
x = [1.10629383 0.71598317], value = 0

Now let’s plot the data

def plot_data(x, v):
    xpt = [q[0] for q in x]
    ypt = [q[1] for q in x]

    fig, ax = plt.subplots()
    ax.scatter(xpt, ypt, s=40, c=v, cmap="viridis")
    ax.set_aspect("equal")
    return fig
fig = plot_data(x, v)
../_images/e9dc139a650c3cef4299c08421529ba827dc89e058d41c98e95475681f477388.png

We want to partition this domain into 2 regions, such that when we come in with a new point, we know which group it belongs to.

First we setup and train our network

from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Input
from keras.optimizers import RMSprop
2025-05-09 12:35:37.815530: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-05-09 12:35:37.818634: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-05-09 12:35:37.827071: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1746794137.840901    3539 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1746794137.845006    3539 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1746794137.856744    3539 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1746794137.856768    3539 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1746794137.856770    3539 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1746794137.856772    3539 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2025-05-09 12:35:37.860885: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
model = Sequential()
model.add(Input(shape=(2,)))
model.add(Dense(50, activation="relu"))
model.add(Dense(20, activation="relu"))
model.add(Dense(1, activation="sigmoid"))
2025-05-09 12:35:39.630414: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
rms = RMSprop()
model.compile(loss='binary_crossentropy',
              optimizer=rms, metrics=['accuracy'])
model.summary()
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                     Output Shape                  Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense (Dense)                   │ (None, 50)             │           150 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_1 (Dense)                 │ (None, 20)             │         1,020 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_2 (Dense)                 │ (None, 1)              │            21 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 1,191 (4.65 KB)
 Trainable params: 1,191 (4.65 KB)
 Non-trainable params: 0 (0.00 B)

We seem to need a lot of epochs here to get a good result

epochs = 100
results = model.fit(x, v, batch_size=50, epochs=epochs, verbose=2)
Epoch 1/100
4/4 - 1s - 147ms/step - accuracy: 0.6900 - loss: 0.6694
Epoch 2/100
4/4 - 0s - 6ms/step - accuracy: 0.7850 - loss: 0.6438
Epoch 3/100
4/4 - 0s - 6ms/step - accuracy: 0.8050 - loss: 0.6222
Epoch 4/100
4/4 - 0s - 6ms/step - accuracy: 0.8150 - loss: 0.6035
Epoch 5/100
4/4 - 0s - 6ms/step - accuracy: 0.8150 - loss: 0.5865
Epoch 6/100
4/4 - 0s - 6ms/step - accuracy: 0.8150 - loss: 0.5704
Epoch 7/100
4/4 - 0s - 6ms/step - accuracy: 0.8150 - loss: 0.5536
Epoch 8/100
4/4 - 0s - 6ms/step - accuracy: 0.8200 - loss: 0.5373
Epoch 9/100
4/4 - 0s - 6ms/step - accuracy: 0.8200 - loss: 0.5219
Epoch 10/100
4/4 - 0s - 6ms/step - accuracy: 0.8150 - loss: 0.5063
Epoch 11/100
4/4 - 0s - 6ms/step - accuracy: 0.8200 - loss: 0.4908
Epoch 12/100
4/4 - 0s - 6ms/step - accuracy: 0.8300 - loss: 0.4752
Epoch 13/100
4/4 - 0s - 6ms/step - accuracy: 0.8200 - loss: 0.4603
Epoch 14/100
4/4 - 0s - 8ms/step - accuracy: 0.8350 - loss: 0.4462
Epoch 15/100
4/4 - 0s - 6ms/step - accuracy: 0.8250 - loss: 0.4329
Epoch 16/100
4/4 - 0s - 6ms/step - accuracy: 0.8350 - loss: 0.4196
Epoch 17/100
4/4 - 0s - 6ms/step - accuracy: 0.8350 - loss: 0.4066
Epoch 18/100
4/4 - 0s - 6ms/step - accuracy: 0.8350 - loss: 0.3950
Epoch 19/100
4/4 - 0s - 6ms/step - accuracy: 0.8350 - loss: 0.3844
Epoch 20/100
4/4 - 0s - 6ms/step - accuracy: 0.8400 - loss: 0.3741
Epoch 21/100
4/4 - 0s - 6ms/step - accuracy: 0.8450 - loss: 0.3652
Epoch 22/100
4/4 - 0s - 6ms/step - accuracy: 0.8450 - loss: 0.3564
Epoch 23/100
4/4 - 0s - 6ms/step - accuracy: 0.8500 - loss: 0.3486
Epoch 24/100
4/4 - 0s - 6ms/step - accuracy: 0.8400 - loss: 0.3410
Epoch 25/100
4/4 - 0s - 6ms/step - accuracy: 0.8550 - loss: 0.3358
Epoch 26/100
4/4 - 0s - 6ms/step - accuracy: 0.8550 - loss: 0.3285
Epoch 27/100
4/4 - 0s - 6ms/step - accuracy: 0.8550 - loss: 0.3233
Epoch 28/100
4/4 - 0s - 6ms/step - accuracy: 0.8400 - loss: 0.3187
Epoch 29/100
4/4 - 0s - 6ms/step - accuracy: 0.8550 - loss: 0.3134
Epoch 30/100
4/4 - 0s - 6ms/step - accuracy: 0.8550 - loss: 0.3104
Epoch 31/100
4/4 - 0s - 6ms/step - accuracy: 0.8550 - loss: 0.3049
Epoch 32/100
4/4 - 0s - 6ms/step - accuracy: 0.8600 - loss: 0.3005
Epoch 33/100
4/4 - 0s - 6ms/step - accuracy: 0.8500 - loss: 0.2970
Epoch 34/100
4/4 - 0s - 6ms/step - accuracy: 0.8700 - loss: 0.2931
Epoch 35/100
4/4 - 0s - 6ms/step - accuracy: 0.8700 - loss: 0.2904
Epoch 36/100
4/4 - 0s - 6ms/step - accuracy: 0.8700 - loss: 0.2871
Epoch 37/100
4/4 - 0s - 6ms/step - accuracy: 0.8700 - loss: 0.2832
Epoch 38/100
4/4 - 0s - 6ms/step - accuracy: 0.8700 - loss: 0.2802
Epoch 39/100
4/4 - 0s - 6ms/step - accuracy: 0.8750 - loss: 0.2784
Epoch 40/100
4/4 - 0s - 6ms/step - accuracy: 0.8750 - loss: 0.2761
Epoch 41/100
4/4 - 0s - 6ms/step - accuracy: 0.8800 - loss: 0.2720
Epoch 42/100
4/4 - 0s - 6ms/step - accuracy: 0.8700 - loss: 0.2710
Epoch 43/100
4/4 - 0s - 6ms/step - accuracy: 0.8750 - loss: 0.2691
Epoch 44/100
4/4 - 0s - 6ms/step - accuracy: 0.8800 - loss: 0.2653
Epoch 45/100
4/4 - 0s - 6ms/step - accuracy: 0.8800 - loss: 0.2640
Epoch 46/100
4/4 - 0s - 6ms/step - accuracy: 0.8850 - loss: 0.2618
Epoch 47/100
4/4 - 0s - 6ms/step - accuracy: 0.8850 - loss: 0.2593
Epoch 48/100
4/4 - 0s - 6ms/step - accuracy: 0.8800 - loss: 0.2573
Epoch 49/100
4/4 - 0s - 6ms/step - accuracy: 0.8850 - loss: 0.2556
Epoch 50/100
4/4 - 0s - 6ms/step - accuracy: 0.8850 - loss: 0.2544
Epoch 51/100
4/4 - 0s - 6ms/step - accuracy: 0.8850 - loss: 0.2528
Epoch 52/100
4/4 - 0s - 8ms/step - accuracy: 0.8850 - loss: 0.2489
Epoch 53/100
4/4 - 0s - 6ms/step - accuracy: 0.8800 - loss: 0.2497
Epoch 54/100
4/4 - 0s - 6ms/step - accuracy: 0.8900 - loss: 0.2463
Epoch 55/100
4/4 - 0s - 6ms/step - accuracy: 0.8850 - loss: 0.2445
Epoch 56/100
4/4 - 0s - 6ms/step - accuracy: 0.8850 - loss: 0.2437
Epoch 57/100
4/4 - 0s - 6ms/step - accuracy: 0.8850 - loss: 0.2416
Epoch 58/100
4/4 - 0s - 6ms/step - accuracy: 0.8850 - loss: 0.2399
Epoch 59/100
4/4 - 0s - 6ms/step - accuracy: 0.8850 - loss: 0.2379
Epoch 60/100
4/4 - 0s - 6ms/step - accuracy: 0.8850 - loss: 0.2366
Epoch 61/100
4/4 - 0s - 6ms/step - accuracy: 0.8900 - loss: 0.2362
Epoch 62/100
4/4 - 0s - 6ms/step - accuracy: 0.8950 - loss: 0.2327
Epoch 63/100
4/4 - 0s - 6ms/step - accuracy: 0.9050 - loss: 0.2321
Epoch 64/100
4/4 - 0s - 6ms/step - accuracy: 0.9000 - loss: 0.2309
Epoch 65/100
4/4 - 0s - 6ms/step - accuracy: 0.8950 - loss: 0.2296
Epoch 66/100
4/4 - 0s - 6ms/step - accuracy: 0.9050 - loss: 0.2288
Epoch 67/100
4/4 - 0s - 6ms/step - accuracy: 0.9000 - loss: 0.2254
Epoch 68/100
4/4 - 0s - 6ms/step - accuracy: 0.9050 - loss: 0.2241
Epoch 69/100
4/4 - 0s - 6ms/step - accuracy: 0.9050 - loss: 0.2259
Epoch 70/100
4/4 - 0s - 6ms/step - accuracy: 0.9100 - loss: 0.2232
Epoch 71/100
4/4 - 0s - 6ms/step - accuracy: 0.9100 - loss: 0.2215
Epoch 72/100
4/4 - 0s - 6ms/step - accuracy: 0.9150 - loss: 0.2187
Epoch 73/100
4/4 - 0s - 6ms/step - accuracy: 0.9100 - loss: 0.2220
Epoch 74/100
4/4 - 0s - 6ms/step - accuracy: 0.9150 - loss: 0.2172
Epoch 75/100
4/4 - 0s - 6ms/step - accuracy: 0.9100 - loss: 0.2153
Epoch 76/100
4/4 - 0s - 6ms/step - accuracy: 0.9150 - loss: 0.2146
Epoch 77/100
4/4 - 0s - 6ms/step - accuracy: 0.9150 - loss: 0.2129
Epoch 78/100
4/4 - 0s - 6ms/step - accuracy: 0.9150 - loss: 0.2118
Epoch 79/100
4/4 - 0s - 6ms/step - accuracy: 0.9100 - loss: 0.2134
Epoch 80/100
4/4 - 0s - 6ms/step - accuracy: 0.9150 - loss: 0.2098
Epoch 81/100
4/4 - 0s - 6ms/step - accuracy: 0.9150 - loss: 0.2072
Epoch 82/100
4/4 - 0s - 6ms/step - accuracy: 0.9200 - loss: 0.2069
Epoch 83/100
4/4 - 0s - 6ms/step - accuracy: 0.9150 - loss: 0.2047
Epoch 84/100
4/4 - 0s - 6ms/step - accuracy: 0.9200 - loss: 0.2075
Epoch 85/100
4/4 - 0s - 6ms/step - accuracy: 0.9250 - loss: 0.2032
Epoch 86/100
4/4 - 0s - 6ms/step - accuracy: 0.9250 - loss: 0.2033
Epoch 87/100
4/4 - 0s - 6ms/step - accuracy: 0.9250 - loss: 0.2014
Epoch 88/100
4/4 - 0s - 6ms/step - accuracy: 0.9250 - loss: 0.1987
Epoch 89/100
4/4 - 0s - 9ms/step - accuracy: 0.9200 - loss: 0.2008
Epoch 90/100
4/4 - 0s - 6ms/step - accuracy: 0.9200 - loss: 0.1973
Epoch 91/100
4/4 - 0s - 6ms/step - accuracy: 0.9250 - loss: 0.1967
Epoch 92/100
4/4 - 0s - 6ms/step - accuracy: 0.9250 - loss: 0.1956
Epoch 93/100
4/4 - 0s - 6ms/step - accuracy: 0.9250 - loss: 0.1934
Epoch 94/100
4/4 - 0s - 6ms/step - accuracy: 0.9250 - loss: 0.1948
Epoch 95/100
4/4 - 0s - 6ms/step - accuracy: 0.9300 - loss: 0.1917
Epoch 96/100
4/4 - 0s - 6ms/step - accuracy: 0.9300 - loss: 0.1916
Epoch 97/100
4/4 - 0s - 6ms/step - accuracy: 0.9300 - loss: 0.1910
Epoch 98/100
4/4 - 0s - 6ms/step - accuracy: 0.9300 - loss: 0.1879
Epoch 99/100
4/4 - 0s - 6ms/step - accuracy: 0.9300 - loss: 0.1875
Epoch 100/100
4/4 - 0s - 6ms/step - accuracy: 0.9250 - loss: 0.1854
score = model.evaluate(x, v, verbose=0)
print(f"score = {score[0]}")
print(f"accuracy = {score[1]}")
score = 0.18385840952396393
accuracy = 0.9300000071525574

Let’s look at a prediction. We need to feed in a single point as an array of shape (N, 2), where N is the number of points

res = model.predict(np.array([[-2, 2]]))
res
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 30ms/step

1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 41ms/step
array([[1.16590035e-08]], dtype=float32)

We see that we get a floating point number. We will need to convert this to 0 or 1 by rounding.

Let’s plot the partitioning

M = 128
N = 128

xmin = -1.75
xmax = 2.5
ymin = -1.25
ymax = 1.75

xpt = np.linspace(xmin, xmax, M)
ypt = np.linspace(ymin, ymax, N)

To make the prediction go faster, we want to feed in a vector of these points, of the form:

[[xpt[0], ypt[0]],
 [xpt[1], ypt[1]],
 ...
]

We can see that this packs them into the vector

pairs = np.array(np.meshgrid(xpt, ypt)).T.reshape(-1, 2)
pairs[0]
array([-1.75, -1.25])

Now we do the prediction. We will get a vector out, which we reshape to match the original domain.

res = model.predict(pairs, verbose=0)
res.shape = (M, N)

Finally, round to 0 or 1

domain = np.where(res > 0.5, 1, 0)

and we can plot the data

fig, ax = plt.subplots()
ax.imshow(domain.T, origin="lower",
          extent=[xmin, xmax, ymin, ymax], alpha=0.25)
xpt = [q[0] for q in x]
ypt = [q[1] for q in x]

ax.scatter(xpt, ypt, s=40, c=v, cmap="viridis")
<matplotlib.collections.PathCollection at 0x7f9b0ce0cf50>
../_images/6264a61763c754e10ee09ecfd9b05f0031e5c8d44d4746a92f85b38d1be894bd.png