Clustering

Clustering#

Clustering seeks to group data into clusters based on their properties and then allow us to predict which cluster a new member belongs.

import numpy as np
import matplotlib.pyplot as plt

We’ll use a dataset generator that is part of scikit-learn called make_moons. This generates data that falls into 2 different sets with a shape that looks like half-moons.

from sklearn import datasets
def generate_data():
    xvec, val = datasets.make_moons(200, noise=0.2)

    # encode the output to be 2 elements
    x = []
    v = []
    for xv, vv in zip(xvec, val):
        x.append(np.array(xv))
        v.append(vv)

    return np.array(x), np.array(v)
x, v = generate_data()

Let’s look at a point and it’s value

print(f"x = {x[0]}, value = {v[0]}")
x = [0.7827843  0.72758715], value = 0

Now let’s plot the data

def plot_data(x, v):
    xpt = [q[0] for q in x]
    ypt = [q[1] for q in x]

    fig, ax = plt.subplots()
    ax.scatter(xpt, ypt, s=40, c=v, cmap="viridis")
    ax.set_aspect("equal")
    return fig
fig = plot_data(x, v)
../_images/6dfb847c553ad73ad14fac6021fecee4aa1a9c8636d4f8d33064dff242fb5ef0.png

We want to partition this domain into 2 regions, such that when we come in with a new point, we know which group it belongs to.

First we setup and train our network

from keras.models import Sequential
from keras.layers import Dense, Dropout, Activation, Input
from keras.optimizers import RMSprop
2025-05-15 11:29:44.228110: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-05-15 11:29:44.231161: I external/local_xla/xla/tsl/cuda/cudart_stub.cc:32] Could not find cuda drivers on your machine, GPU will not be used.
2025-05-15 11:29:44.239506: E external/local_xla/xla/stream_executor/cuda/cuda_fft.cc:467] Unable to register cuFFT factory: Attempting to register factory for plugin cuFFT when one has already been registered
WARNING: All log messages before absl::InitializeLog() is called are written to STDERR
E0000 00:00:1747308584.253119    3473 cuda_dnn.cc:8579] Unable to register cuDNN factory: Attempting to register factory for plugin cuDNN when one has already been registered
E0000 00:00:1747308584.257202    3473 cuda_blas.cc:1407] Unable to register cuBLAS factory: Attempting to register factory for plugin cuBLAS when one has already been registered
W0000 00:00:1747308584.268577    3473 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1747308584.268587    3473 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1747308584.268588    3473 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
W0000 00:00:1747308584.268590    3473 computation_placer.cc:177] computation placer already registered. Please check linkage and avoid linking the same target more than once.
2025-05-15 11:29:44.272752: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
model = Sequential()
model.add(Input(shape=(2,)))
model.add(Dense(50, activation="relu"))
model.add(Dense(20, activation="relu"))
model.add(Dense(1, activation="sigmoid"))
2025-05-15 11:29:46.066809: E external/local_xla/xla/stream_executor/cuda/cuda_platform.cc:51] failed call to cuInit: INTERNAL: CUDA error: Failed call to cuInit: UNKNOWN ERROR (303)
rms = RMSprop()
model.compile(loss='binary_crossentropy',
              optimizer=rms, metrics=['accuracy'])
model.summary()
Model: "sequential"
┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━━━━━━━━┓
┃ Layer (type)                     Output Shape                  Param # ┃
┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━━━━━━━━┩
│ dense (Dense)                   │ (None, 50)             │           150 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_1 (Dense)                 │ (None, 20)             │         1,020 │
├─────────────────────────────────┼────────────────────────┼───────────────┤
│ dense_2 (Dense)                 │ (None, 1)              │            21 │
└─────────────────────────────────┴────────────────────────┴───────────────┘
 Total params: 1,191 (4.65 KB)
 Trainable params: 1,191 (4.65 KB)
 Non-trainable params: 0 (0.00 B)

We seem to need a lot of epochs here to get a good result

epochs = 100
results = model.fit(x, v, batch_size=50, epochs=epochs, verbose=2)
Epoch 1/100
4/4 - 1s - 147ms/step - accuracy: 0.7450 - loss: 0.6313
Epoch 2/100
4/4 - 0s - 6ms/step - accuracy: 0.8650 - loss: 0.5835
Epoch 3/100
4/4 - 0s - 6ms/step - accuracy: 0.8700 - loss: 0.5510
Epoch 4/100
4/4 - 0s - 6ms/step - accuracy: 0.8700 - loss: 0.5253
Epoch 5/100
4/4 - 0s - 6ms/step - accuracy: 0.8600 - loss: 0.5018
Epoch 6/100
4/4 - 0s - 6ms/step - accuracy: 0.8500 - loss: 0.4807
Epoch 7/100
4/4 - 0s - 6ms/step - accuracy: 0.8550 - loss: 0.4612
Epoch 8/100
4/4 - 0s - 6ms/step - accuracy: 0.8500 - loss: 0.4424
Epoch 9/100
4/4 - 0s - 6ms/step - accuracy: 0.8500 - loss: 0.4250
Epoch 10/100
4/4 - 0s - 6ms/step - accuracy: 0.8500 - loss: 0.4073
Epoch 11/100
4/4 - 0s - 6ms/step - accuracy: 0.8550 - loss: 0.3904
Epoch 12/100
4/4 - 0s - 6ms/step - accuracy: 0.8500 - loss: 0.3751
Epoch 13/100
4/4 - 0s - 6ms/step - accuracy: 0.8500 - loss: 0.3618
Epoch 14/100
4/4 - 0s - 6ms/step - accuracy: 0.8500 - loss: 0.3501
Epoch 15/100
4/4 - 0s - 6ms/step - accuracy: 0.8500 - loss: 0.3393
Epoch 16/100
4/4 - 0s - 6ms/step - accuracy: 0.8550 - loss: 0.3303
Epoch 17/100
4/4 - 0s - 6ms/step - accuracy: 0.8500 - loss: 0.3231
Epoch 18/100
4/4 - 0s - 6ms/step - accuracy: 0.8650 - loss: 0.3141
Epoch 19/100
4/4 - 0s - 6ms/step - accuracy: 0.8600 - loss: 0.3078
Epoch 20/100
4/4 - 0s - 6ms/step - accuracy: 0.8700 - loss: 0.3015
Epoch 21/100
4/4 - 0s - 6ms/step - accuracy: 0.8700 - loss: 0.2961
Epoch 22/100
4/4 - 0s - 6ms/step - accuracy: 0.8650 - loss: 0.2918
Epoch 23/100
4/4 - 0s - 6ms/step - accuracy: 0.8700 - loss: 0.2861
Epoch 24/100
4/4 - 0s - 6ms/step - accuracy: 0.8650 - loss: 0.2817
Epoch 25/100
4/4 - 0s - 6ms/step - accuracy: 0.8700 - loss: 0.2782
Epoch 26/100
4/4 - 0s - 6ms/step - accuracy: 0.8700 - loss: 0.2740
Epoch 27/100
4/4 - 0s - 6ms/step - accuracy: 0.8700 - loss: 0.2703
Epoch 28/100
4/4 - 0s - 6ms/step - accuracy: 0.8700 - loss: 0.2678
Epoch 29/100
4/4 - 0s - 6ms/step - accuracy: 0.8650 - loss: 0.2643
Epoch 30/100
4/4 - 0s - 6ms/step - accuracy: 0.8750 - loss: 0.2612
Epoch 31/100
4/4 - 0s - 6ms/step - accuracy: 0.8750 - loss: 0.2577
Epoch 32/100
4/4 - 0s - 6ms/step - accuracy: 0.8750 - loss: 0.2545
Epoch 33/100
4/4 - 0s - 6ms/step - accuracy: 0.8750 - loss: 0.2515
Epoch 34/100
4/4 - 0s - 6ms/step - accuracy: 0.8750 - loss: 0.2496
Epoch 35/100
4/4 - 0s - 6ms/step - accuracy: 0.8750 - loss: 0.2458
Epoch 36/100
4/4 - 0s - 6ms/step - accuracy: 0.8750 - loss: 0.2444
Epoch 37/100
4/4 - 0s - 6ms/step - accuracy: 0.8800 - loss: 0.2417
Epoch 38/100
4/4 - 0s - 6ms/step - accuracy: 0.8800 - loss: 0.2398
Epoch 39/100
4/4 - 0s - 6ms/step - accuracy: 0.8750 - loss: 0.2363
Epoch 40/100
4/4 - 0s - 6ms/step - accuracy: 0.8750 - loss: 0.2349
Epoch 41/100
4/4 - 0s - 6ms/step - accuracy: 0.8750 - loss: 0.2327
Epoch 42/100
4/4 - 0s - 6ms/step - accuracy: 0.8850 - loss: 0.2293
Epoch 43/100
4/4 - 0s - 6ms/step - accuracy: 0.8950 - loss: 0.2272
Epoch 44/100
4/4 - 0s - 6ms/step - accuracy: 0.8950 - loss: 0.2254
Epoch 45/100
4/4 - 0s - 6ms/step - accuracy: 0.8900 - loss: 0.2241
Epoch 46/100
4/4 - 0s - 6ms/step - accuracy: 0.8950 - loss: 0.2207
Epoch 47/100
4/4 - 0s - 6ms/step - accuracy: 0.8950 - loss: 0.2191
Epoch 48/100
4/4 - 0s - 6ms/step - accuracy: 0.9000 - loss: 0.2172
Epoch 49/100
4/4 - 0s - 6ms/step - accuracy: 0.8950 - loss: 0.2157
Epoch 50/100
4/4 - 0s - 6ms/step - accuracy: 0.9000 - loss: 0.2130
Epoch 51/100
4/4 - 0s - 6ms/step - accuracy: 0.9050 - loss: 0.2108
Epoch 52/100
4/4 - 0s - 6ms/step - accuracy: 0.9050 - loss: 0.2099
Epoch 53/100
4/4 - 0s - 6ms/step - accuracy: 0.9000 - loss: 0.2069
Epoch 54/100
4/4 - 0s - 6ms/step - accuracy: 0.9050 - loss: 0.2067
Epoch 55/100
4/4 - 0s - 6ms/step - accuracy: 0.9100 - loss: 0.2037
Epoch 56/100
4/4 - 0s - 6ms/step - accuracy: 0.9050 - loss: 0.2034
Epoch 57/100
4/4 - 0s - 6ms/step - accuracy: 0.9100 - loss: 0.2029
Epoch 58/100
4/4 - 0s - 6ms/step - accuracy: 0.9100 - loss: 0.1989
Epoch 59/100
4/4 - 0s - 6ms/step - accuracy: 0.9100 - loss: 0.1976
Epoch 60/100
4/4 - 0s - 6ms/step - accuracy: 0.9100 - loss: 0.1963
Epoch 61/100
4/4 - 0s - 6ms/step - accuracy: 0.9100 - loss: 0.1962
Epoch 62/100
4/4 - 0s - 6ms/step - accuracy: 0.9100 - loss: 0.1936
Epoch 63/100
4/4 - 0s - 6ms/step - accuracy: 0.9050 - loss: 0.1932
Epoch 64/100
4/4 - 0s - 6ms/step - accuracy: 0.9100 - loss: 0.1916
Epoch 65/100
4/4 - 0s - 6ms/step - accuracy: 0.9100 - loss: 0.1904
Epoch 66/100
4/4 - 0s - 6ms/step - accuracy: 0.9100 - loss: 0.1875
Epoch 67/100
4/4 - 0s - 6ms/step - accuracy: 0.9100 - loss: 0.1870
Epoch 68/100
4/4 - 0s - 6ms/step - accuracy: 0.9100 - loss: 0.1881
Epoch 69/100
4/4 - 0s - 6ms/step - accuracy: 0.9100 - loss: 0.1846
Epoch 70/100
4/4 - 0s - 6ms/step - accuracy: 0.9100 - loss: 0.1837
Epoch 71/100
4/4 - 0s - 6ms/step - accuracy: 0.9100 - loss: 0.1811
Epoch 72/100
4/4 - 0s - 6ms/step - accuracy: 0.9050 - loss: 0.1808
Epoch 73/100
4/4 - 0s - 6ms/step - accuracy: 0.9100 - loss: 0.1787
Epoch 74/100
4/4 - 0s - 6ms/step - accuracy: 0.9000 - loss: 0.1801
Epoch 75/100
4/4 - 0s - 6ms/step - accuracy: 0.9100 - loss: 0.1761
Epoch 76/100
4/4 - 0s - 6ms/step - accuracy: 0.9100 - loss: 0.1738
Epoch 77/100
4/4 - 0s - 6ms/step - accuracy: 0.9100 - loss: 0.1738
Epoch 78/100
4/4 - 0s - 6ms/step - accuracy: 0.9050 - loss: 0.1729
Epoch 79/100
4/4 - 0s - 6ms/step - accuracy: 0.9200 - loss: 0.1704
Epoch 80/100
4/4 - 0s - 6ms/step - accuracy: 0.9150 - loss: 0.1704
Epoch 81/100
4/4 - 0s - 6ms/step - accuracy: 0.9100 - loss: 0.1704
Epoch 82/100
4/4 - 0s - 6ms/step - accuracy: 0.9200 - loss: 0.1667
Epoch 83/100
4/4 - 0s - 6ms/step - accuracy: 0.9200 - loss: 0.1647
Epoch 84/100
4/4 - 0s - 6ms/step - accuracy: 0.9250 - loss: 0.1632
Epoch 85/100
4/4 - 0s - 6ms/step - accuracy: 0.9200 - loss: 0.1621
Epoch 86/100
4/4 - 0s - 6ms/step - accuracy: 0.9200 - loss: 0.1618
Epoch 87/100
4/4 - 0s - 6ms/step - accuracy: 0.9250 - loss: 0.1622
Epoch 88/100
4/4 - 0s - 6ms/step - accuracy: 0.9200 - loss: 0.1580
Epoch 89/100
4/4 - 0s - 6ms/step - accuracy: 0.9300 - loss: 0.1568
Epoch 90/100
4/4 - 0s - 6ms/step - accuracy: 0.9200 - loss: 0.1548
Epoch 91/100
4/4 - 0s - 6ms/step - accuracy: 0.9200 - loss: 0.1549
Epoch 92/100
4/4 - 0s - 6ms/step - accuracy: 0.9300 - loss: 0.1553
Epoch 93/100
4/4 - 0s - 6ms/step - accuracy: 0.9300 - loss: 0.1506
Epoch 94/100
4/4 - 0s - 6ms/step - accuracy: 0.9300 - loss: 0.1501
Epoch 95/100
4/4 - 0s - 6ms/step - accuracy: 0.9300 - loss: 0.1494
Epoch 96/100
4/4 - 0s - 6ms/step - accuracy: 0.9350 - loss: 0.1496
Epoch 97/100
4/4 - 0s - 6ms/step - accuracy: 0.9350 - loss: 0.1462
Epoch 98/100
4/4 - 0s - 6ms/step - accuracy: 0.9350 - loss: 0.1440
Epoch 99/100
4/4 - 0s - 6ms/step - accuracy: 0.9300 - loss: 0.1439
Epoch 100/100
4/4 - 0s - 6ms/step - accuracy: 0.9300 - loss: 0.1423
score = model.evaluate(x, v, verbose=0)
print(f"score = {score[0]}")
print(f"accuracy = {score[1]}")
score = 0.13965898752212524
accuracy = 0.9350000023841858

Let’s look at a prediction. We need to feed in a single point as an array of shape (N, 2), where N is the number of points

res = model.predict(np.array([[-2, 2]]))
res
1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 30ms/step

1/1 ━━━━━━━━━━━━━━━━━━━━ 0s 40ms/step
array([[6.0088795e-10]], dtype=float32)

We see that we get a floating point number. We will need to convert this to 0 or 1 by rounding.

Let’s plot the partitioning

M = 128
N = 128

xmin = -1.75
xmax = 2.5
ymin = -1.25
ymax = 1.75

xpt = np.linspace(xmin, xmax, M)
ypt = np.linspace(ymin, ymax, N)

To make the prediction go faster, we want to feed in a vector of these points, of the form:

[[xpt[0], ypt[0]],
 [xpt[1], ypt[1]],
 ...
]

We can see that this packs them into the vector

pairs = np.array(np.meshgrid(xpt, ypt)).T.reshape(-1, 2)
pairs[0]
array([-1.75, -1.25])

Now we do the prediction. We will get a vector out, which we reshape to match the original domain.

res = model.predict(pairs, verbose=0)
res.shape = (M, N)

Finally, round to 0 or 1

domain = np.where(res > 0.5, 1, 0)

and we can plot the data

fig, ax = plt.subplots()
ax.imshow(domain.T, origin="lower",
          extent=[xmin, xmax, ymin, ymax], alpha=0.25)
xpt = [q[0] for q in x]
ypt = [q[1] for q in x]

ax.scatter(xpt, ypt, s=40, c=v, cmap="viridis")
<matplotlib.collections.PathCollection at 0x7f703427e150>
../_images/1c703738594acb31983dbc3baf3225bab529351ebf2f007fa43f89a62c8c5d69.png