This repository contains a complete, end-to-end implementation of a fully unsupervised pipeline that:
The pipeline was developed and validated on a subset of AffectNet-YOLO (25 266 faces at 96×96 px) and cross-validated on FER (48×48 px). It yields semantically meaningful clusters (e.g. closed vs. open mouth, eye openness levels, five pupil‐gaze directions, three head orientations) without any labels or pretrained embeddings.
├── README.md
├── shape_predictor_68_face_landmarks.dat # dlib pretrained model (not included here; download separately)
├── data/
│ ├── AFFECTNET-YOLO/ # Unlabeled images used for training (96×96 px)
│ └── FER/ # (Optional) 48×48 px images for cross-dataset validation
├── outputs/ # All intermediate and final results will be saved here
│ ├── Alignment/ # Nose-aligned images
│ ├── HeadAngleDisp_Images/ # Annotated images for angle & displacement
│ ├── Head_Clusters/ # Head-pose clustering results (plots, CSVs, cluster folders)
│ ├── LipClustering1/ # Lip ratio clustering (annotated images & plots)
│ ├── Eye_Emo/ # Eye ratio extraction & clustering
│ ├── Pupil_Annotated/ # Pupil detection & heuristic labeling
│ └── Pupil_Clusters/ # Pupil clustering results (plots, CSVs)
├── LipClusteringForSubfolderGreyImageSSHyper.py # Lip ratio extraction + clustering script
├── EyeClusteringForSubfolderGreyImageSSHyper.py # Eye ratio extraction + clustering script
├── PupilMoveClassification.py # Pupil localization & K-Means clustering
├── HeadNoseAlignment.py # Compute average nose tip → align images
├── HeadAngleDisp.py # Compute eye-line displacement & angle for each image
├── HeadClusteringAngleDisp.py # Perform K-Means on head pose features (angle, displacement)
└── requirements.txt # Python dependencies
pip install -r requirements.txt
The main dependencies are:
dlib (v19.22 or compatible) – for face detection & 68-point landmarksopencv-python (v4.5) – for image I/O & processingnumpy (v1.21) – numerical computationspandas (v1.3) – CSV handlingscikit-learn (v1.0) – K-Means, StandardScaler, clustering metricsmatplotlib (v3.4) – plottingseaborn (v0.11) – (optional, used in head clustering plots)shape_predictor_68_face_landmarks.dat
in this top‐level directory. You can download it from:
http://dlib.net/files/shape_predictor_68_face_landmarks.dat.bz2
Unzip it so that shape_predictor_68_face_landmarks.dat is available.
data/AFFECTNET-YOLO/, preserving any subfolder structure.data/FER/.outputs/.outputs/—see Output & Folder Structure for details.At a high level, our pipeline consists of three stages:
rlipreye(nx, ny)(d, ∆θ)k chosen via elbow methodBelow is a detailed breakdown of each stage.
Goal: Translate every face such that its nose tip aligns with the dataset-wide average nose-tip coordinate. This removes global translation, so that subsequent head-pose features only capture rotation & displacement relative to this global center.
HeadNoseAlignment.py
data/AFFECTNET-YOLO/, detect the nose tip (landmark #34 in dlib’s 0-indexed 68-point model) for each face.(x, y) of nose tip across all images to compute:
avg_nose_tip = ( Σ p34_i ) / N
p34_i = (xi, yi).dx = avg_nose_tip.x − xi
dy = avg_nose_tip.y − yi
M = [[1, 0, dx],
[0, 1, dy]]
repositioned = warpAffine(image, M)
outputs/Alignment/.outputs/Alignment/ contains all nose-aligned images, preserving original filenames (overwritten if existing).Average Nose Tip Coordinate (x, y): ….After nose alignment, we extract four low-dimensional geometric descriptors from each aligned face:
LipClusteringForSubfolderGreyImageSSHyper.py
data/AFFECTNET-YOLO/ (or any input folder), uses dlib to detect the first face and 68 landmarks.P61 = (landmarks.part(61).x, landmarks.part(61).y)P65 = (landmarks.part(65).x, landmarks.part(65).y)P63 = (landmarks.part(63).x, landmarks.part(63).y)P67 = (landmarks.part(67).x, landmarks.part(67).y)width = ||P61 − P65||
height = ||P63 − P67||
ratio = height / width
“Length: <width> Height: <height> Ratio: <ratio>”. Saves those 50 annotated images under outputs/LipClustering1/output/.ratio values in a NumPy array, then:
StandardScaler.n_clusters=5 or user‐adjustable) on the scaled lip ratios.outputs/LipClustering1/results.txt (one line per image: Image: <path>, Cluster: <id>).outputs/LipClustering1/plots/:
outputs/LipClustering1/
├── output/ # Annotated images (first 50 only)
├── plots/
│ ├── cluster_visualization.png
│ └── ratios_histogram.png
└── results.txt # One line per processed image with cluster ID
Descriptor:
[
r_{
m eye} = rac{|P_{37} - P_{40}|}{igl| frac{P_{38}+P_{39}}{2} - frac{P_{41}+P_{42}}{2}igr|}
]
where the numerator is the horizontal distance between outer eye corners, and the denominator is the vertical distance between eyelid midpoints. It measures eye openness.
EyeClusteringForSubfolderGreyImageSSHyper.py
data/AFFECTNET-YOLO/ (or any input folder), reads each image, converts to grayscale.P37 = landmarks.part(36), P40 = landmarks.part(39),P38 = landmarks.part(37), P39 = landmarks.part(38),P41 = landmarks.part(40), P42 = landmarks.part(41).width = ||P37 − P40||
midpoint_top = (P38 + P39)/2
midpoint_bot = (P41 + P42)/2
height = ||midpoint_top − midpoint_bot||
ratio = width / height
<Image Name, Height, Width, Ratio> to a CSV:outputs/Eye_Emo/Eye_ExpFull_landmark_ratios.csv.Ratio column (via StandardScaler), then:
outputs/Eye_Emo/elbow_method_plot.png.n_clusters=4 (optimal elbow), run final K-Means.Cluster column to the CSV and save as clustered_data_ratio.csv.outputs/Eye_Emo/Cluster_0/…/Cluster_3/ and move each annotated image into its cluster folder.outputs/Eye_Emo/:
clusters_plot_ratio.png: scatter (index vs. Ratio) colored by K-Means cluster.clusters_ratio_vs_mean_plot.png: ratio distribution per cluster with red centroids.outputs/Eye_Emo/
├── Eye_ExpFull_landmark_ratios.csv # Raw ratios CSV
├── elbow_method_plot.png
├── clustered_data_ratio.csv # Ratios + cluster labels
├── Cluster_0/, Cluster_1/, Cluster_2/, Cluster_3/ # images with landmarks overlaid
├── clusters_plot_ratio.png
└── clusters_ratio_vs_mean_plot.png
(nx, ny) within the eye bounding box [0,1]×[0,1] into five regions:
center (dead zone).left if nx < 0.5, else right.top if ny < 0.5, else bottom.(nx, ny) pairs, standardize (mean = 0, var = 1) and run K-Means with n_clusters=5, selecting via elbow or domain knowledge.PupilMoveClassification.py
data/AFFECTNET-YOLO/, detect faces and 68 landmarks for the first face per image.(x, y, w, h) via cv2.boundingRect(...).THRESH_BINARY_INV with T=30.(cx, cy) in crop.0.1):
centerleft/right vs. top/bottom.center, left, right, top, bottom) above the eye.outputs/Pupil_Annotated/.{
image_path, annotated_path,
label, norm_x, norm_y,
eye_x, eye_y, eye_w, eye_h,
pupil_x, pupil_y
}
results_df and save to outputs/Pupil_Annotated/clustering_results.csv.outputs/Pupil_Clusters/:
cluster_distribution.png: bar chart of the five heuristic counts.pupil_scatter.png: scatter of (norm_x, norm_y) color‐coded by heuristic label.[(norm_x, norm_y)] with n_clusters=5:
kmeans_label.kmeans_label) as clustering_results.csv (overwriting).kmeans_pupil_scatter.png: scatter (norm_x, norm_y) color‐coded by K-Means cluster.outputs/Pupil_Annotated/
├── <annotated images>.jpg
├── clustering_results.csv
└── outputs/Pupil_Clusters/
├── cluster_distribution.png
├── pupil_scatter.png
└── kmeans_pupil_scatter.png
We encode head pose by measuring:
d = Euclidean distance between the midpoints of:
angle_i = atan2(y46−y37, x46−x37))angle_ref = atan2(ȳ46−ȳ37, x̄46−x̄37))HeadAngleDisp.py
data/AFFECTNET-YOLO/ to accumulate all (\mathbf{p}{37}) and (\mathbf{p}{46}) across detected faces → compute
avg_37 = Σ p37_i / N, avg_46 = Σ p46_i / N.
p37 = (x37, y37), p46 = (x46, y46).actual_mid = ((x37 + x46)/2, (y37 + y46)/2).ref_mid = ((avg_37.x + avg_46.x)/2, (avg_37.y + avg_46.y)/2).displacement = ||actual_mid − ref_mid||.angle_i = atan2(y46−y37, x46−x37), angle_ref = atan2(avg_46.y−avg_37.y, avg_46.x−avg_37.x).angle_diff = (angle_i − angle_ref) × (180/π).avg_37 → avg_46).p37 → p46).outputs/HeadAngleDisp_Images/.[filename, angle_diff, displacement] to a list.outputs/HeadAngleDisp_Images/colds_AngleDisp_Data.csv
HeadClusteringAngleDisp.py
outputs/HeadAngleDisp_Images/colds_AngleDisp_Data.csv.X_angle = df[['Angle (degrees)']]X_displacement = df[['Displacement (pixels)']]X_combined = df[['Angle (degrees)', 'Displacement (pixels)']]StandardScaler().fit_transform(...).outputs/Head_Clusters/Angle/Cluster_0…Cluster_2
outputs/Head_Clusters/Displacement/Cluster_0…Cluster_2
outputs/Head_Clusters/Angle_Displacement/Cluster_0…Cluster_4
n_clusters=3 for single‐feature (angle only, displacement only).n_clusters=5 for combined (angle+disp).random_state=42, n_init=10, max_iter=500.df['Angle Cluster']).data/AFFECTNET-YOLO/ into the corresponding cluster folder.(index vs. value) colored by cluster.(angle, displacement) colored by cluster.outputs/Head_Clusters/KMeans_Clustering_Results.csv
outputs/Head_Clusters/Clustering_Evaluation_Metrics.csv
outputs/Head_Clusters/
├── Angle/
│ ├── Cluster_0/, Cluster_1/, Cluster_2/ # aligned images in each angle cluster
│ └── KMeans_Clustering_Angle.png
├── Displacement/
│ └── similarly structured
├── Angle_Displacement/
│ └── Cluster_0…Cluster_4, plots, CSV subsets
├── KMeans_Clustering_Results.csv # combined clusters
└── Clustering_Evaluation_Metrics.csv
After you run all scripts in sequence (alignment → descriptor extraction → clustering), you’ll see the following top-level outputs/ structure:
outputs/
├── Alignment/ # Nose-aligned images
│ └── *.jpg, *.png, …
├── HeadAngleDisp_Images/ # Head pose annotation & data
│ ├── <annotated images>.jpg
│ └── colds_AngleDisp_Data.csv
├── Head_Clusters/
│ ├── Angle/
│ │ ├── Cluster_0/, Cluster_1/, Cluster_2/
│ │ └── KMeans_Clustering_Angle.png
│ ├── Displacement/
│ │ ├── Cluster_0/, Cluster_1/, Cluster_2/
│ │ └── KMeans_Clustering_Displacement.png
│ ├── Angle_Displacement/
│ │ ├── Cluster_0/…/Cluster_4/
│ │ └── KMeans_Clustering_Angle_Displacement.png
│ ├── KMeans_Clustering_Results.csv
│ └── Clustering_Evaluation_Metrics.csv
├── LipClustering1/
│ ├── output/ # first 50 lip‐annotated images
│ ├── plots/
│ │ ├── cluster_visualization.png
│ │ └── ratios_histogram.png
│ └── results.txt
├── Eye_Emo/
│ ├── Eye_ExpFull_landmark_ratios.csv # raw height,width,ratio
│ ├── elbow_method_plot.png
│ ├── clustered_data_ratio.csv # ratio + cluster
│ ├── Cluster_0/…/Cluster_3/ # images with landmarks overlaid
│ ├── clusters_plot_ratio.png
│ └── clusters_ratio_vs_mean_plot.png
└── Pupil_Annotated/
├── <annotated images>.jpg
├── clustering_results.csv # “heuristic + kmeans” labels
└── Pupil_Clusters/
├── cluster_distribution.png
├── pupil_scatter.png
└── kmeans_pupil_scatter.png
Every script creates and populates its respective subfolder under outputs/. You can safely delete or re-run them in any order, but a recommended execution sequence is:
HeadNoseAlignment.py → outputs/Alignment/HeadAngleDisp.py → outputs/HeadAngleDisp_Images/HeadClusteringAngleDisp.py → outputs/Head_Clusters/LipClusteringForSubfolderGreyImageSSHyper.py → outputs/LipClustering1/EyeClusteringForSubfolderGreyImageSSHyper.py → outputs/Eye_Emo/PupilMoveClassification.py → outputs/Pupil_Annotated/ and outputs/Pupil_Clusters/For every clustering module (lips, eyes, pupils, head), the following metrics are computed and/or available:
| Module | k clusters | Silhouette | Calinski–Harabasz | Davies–Bouldin |
|---|---|---|---|---|
| Lips | 4 | 0.5781 | 76 789.41 | 0.5781 |
| Eyes | 4 | 0.6143 | 34 515.84 | 0.4879 |
| Pupils | 5 | 0.6500 (≈) | 18 000 (≈) | 0.5200 (≈) |
| Head | 3 (angle) | 0.5600 (≈) | 30 658 | 0.5600 (≈) |
| 3 (disp) | 0.5500 (≈) | 37 520 | 0.5600 (≈) | |
| 5 (comb) | 0.3400 (≈) | 10 288 | 0.8700 (≈) | |
| Pupil | 5 | 0.6500 (≈) | 18 000 | 0.5200 (≈) |
| Module | Dataset | Silhouette | Calinski–Harabasz | Davies–Bouldin |
|---|---|---|---|---|
| Lips | AffectNet-YOLO | 0.5781 | 76 789.41 | 0.5781 |
| FER | 0.7067 | 20 988.36 | 0.4588 | |
| Eyes | AffectNet-YOLO | 0.6143 | 34 515.84 | 0.4879 |
| FER | 0.6020 | 5 587.74 | 0.5289 |
Advanced Landmark Detectors
Replace dlib’s 68-point with a contrastive/self-supervised keypoint extractor [6, 7] to improve robustness under occlusion & extreme pose.
Temporal Dynamics
Extend to video streams: spatio-temporal clustering of descriptor trajectories → model micro-expressions, blink patterns, head-motion sequences.
Multimodal Fusion
Fuse depth, infrared, or thermal modalities to mitigate low-light & privacy‐sensitive failure modes.
Semi-/Weakly-Supervised Refinement
Incorporate a small set of annotations to map clusters to emotion or gaze labels automatically while retaining overall unsupervised learning benefits.
On-Device Optimization
Prune/quantize dlib model, use MiniBatch-KMeans, and optimize pre/postprocessing for edge/mobile real-time applications (assistive HCI, on-device biometrics).