Using OpenCV and Neural Networks for hand landmarking with Google MediaPipe
Leo Gao
Here is the process of utilizing raspicam's images to get recognized as gestures.
Summary
After you have properly subscribed to the camera:
This would be how you would convert a compressed image message into an OpenCV-readable format:
def image_cb(self, msg):
"""Process incoming images from the camera."""
try:
np_arr = np.frombuffer(msg.data, np.uint8)
self.cv_image = cv.imdecode(np_arr, cv.IMREAD_COLOR)
except Exception as e:
rospy.logerr(f"Image conversion failed: {e}")
For each detected hand, calculate the pixel coordinates of the landmarks relative to the image dimensions, keep in mind that it is one hand at a time:
def calc_landmark_list(self, image, hand_landmarks):
image_width, image_height = image.shape[1], image.shape[0]
return [
[
int(landmark.x * image_width),
int(landmark.y * image_height),
]
for landmark in hand_landmarks.landmark
]
Normalize the coordinates relative to the wrist (first landmark) and scale them:
def preprocess_landmarks(self, landmark_list):
base_x, base_y = landmark_list[0]
for landmark in landmark_list:
landmark[0] -= base_x
landmark[1] -= base_y
flattened = list(itertools.chain.from_iterable(landmark_list))
max_value = max(map(abs, flattened))
normalized = [n / max_value for n in flattened]
return normalized
Draw landmarks for the detected hand :
def draw_landmarks(self, image, landmark_list):
for point in landmark_list:
cv.circle(image, tuple(point), 5, (255, 0, 0), -1)
Combining it together by first tidying up the image by flipping it and then converting from BGR to RGB for mediapipe you would be able to achieve your processed gesture:
def process_frame(self):
image = cv.flip(self.cv_image, 1) # Mirror image for convenience
debug_image = copy.deepcopy(image)
# Convert image to RGB for MediaPipe
rgb_image = cv.cvtColor(image, cv.COLOR_BGR2RGB)
results = self.hands.process(rgb_image)
if results.multi_hand_landmarks:
for hand_landmarks in results.multi_hand_landmarks:
# Process hand landmarks
landmark_list = self.calc_landmark_list(image, hand_landmarks)
preprocessed_landmarks = self.preprocess_landmarks(landmark_list)
# Classify gesture
hand_sign_id = self.keypoint_classifier(preprocessed_landmarks)
hand_sign_label = self.keypoint_classifier_labels[hand_sign_id]
# Annotate image
brect = self.calc_bounding_rect(image, hand_landmarks)
debug_image = self.draw_landmarks(debug_image, landmark_list)
debug_image = self.draw_bounding_rect(debug_image, brect)
debug_image = self.draw_info_text(debug_image, brect, hand_sign_label)
cv.imshow("Gesture Recognition", debug_image)
cv.waitKey(1)
These would be the outputs to the neural network which would have been pre-trained by you.
hand_sign_id = self.keypoint_classifier(preprocessed_landmarks)
hand_sign_label = self.keypoint_classifier_labels[hand_sign_id]
self.keypoint_classifier:
This is a neural network thatwas trained on a dataset of hand landmarks corresponding to various gestures. Input: Normalized landmarks (e.g., [x1, y1, x2, y2, ..., x21, y21]). Output: Gesture ID (e.g., 0 for "Fist").
When called, the neural network takes the normalized coordinates as input and outputs a gesture class ID (an integer representing a specific gesture) based on the input landmarks.