SIFT (Scale-Invariant Feature Transform) is a computer vision algorithm for detecting and describing local features in images. Developed by David Lowe in 1999, SIFT has become a fundamental tool for various applications due to its robustness and versatility.

Use Cases of SIFT

👀 Object Recognition

SIFT is widely used for detecting and identifying specific objects within complex scenes. Its ability to extract distinctive features that are invariant to scale, rotation, and illumination changes makes it effective for recognizing objects across different viewpoints and conditions.

🪡 Image Stitching

One of SIFT’s prominent applications is in image stitching. The algorithm helps find corresponding points between overlapping images, allowing for seamless alignment and blending to create panoramas.

🗿 3D Reconstruction

SIFT is used in 3D modeling and reconstruction tasks. By identifying matching points between images taken from different angles, it enables the triangulation of 3D point positions, enabling the reconstruction of scenes or objects in 3D space.

🤖 Robot Navigation and Mapping

In robotics, SIFT is employed for navigation and mapping purposes. Autonomous robots can use SIFT features to localize themselves within an environment and build maps of their surroundings.

Minimal Example

This example demonstrates the use of SIFT for feature detection and matching between two images using Numpy and OpenCV. It includes steps for keypoint detection, descriptor computation, matching, filtering good matches, and applying RANSAC to find a homography matrix. The results are visualized by drawing the matches between the two images.

Input images

👇 image_1.jpg

SIFT Input Image 1

👇 image_2.jpg

SIFT Input Image 2

👇 image_3.jpg

SIFT Input Image 3

👇 image_4.jpg

SIFT Input Image 4

👇 image_5.jpg

SIFT Input Image 5

Setup and feature detection

import numpy as np
import cv2
import sys

# Print Python and OpenCV versions for reference
print(f"Python version: {sys.version}")
print(f"OpenCV version: {cv2.__version__}")

# Load an image
# Repeat with 5 different input images to detect keypoints
img1 = cv2.imread('data/image_1.jpg', cv2.IMREAD_GRAYSCALE)  # input image

# Initialize SIFT detector
sift = cv2.xfeatures2d.SIFT_create()

# Detect keypoints and compute descriptors for both images
kp1, des1 = sift.detectAndCompute(img1, None)

Detected features

👇 image_1.jpg (3238 keypoints)

SIFT Image 1 Features

👇 image_2.jpg (2830 keypoints)

SIFT Image 2 Features

👇 image_3.jpg (1721 keypoints)

SIFT Image 3 Features

👇 image_4.jpg (1612 keypoints)

SIFT Image 4 Features

👇 image_5.jpg (1660 keypoints)

SIFT Image 5 Features

Descriptor Matching

# Minimum number of good matches required
MIN_MATCH_COUNT = 4

# Load the query and train images in grayscale
img1 = cv2.imread('data/image_1.jpg', cv2.IMREAD_GRAYSCALE)  # queryImage
img2 = cv2.imread('data/image_2.jpg', cv2.IMREAD_GRAYSCALE)  # trainImage

# Initialize SIFT detector
sift = cv2.xfeatures2d.SIFT_create()

# Detect keypoints and compute descriptors for both images
kp1, des1 = sift.detectAndCompute(img1, None)
kp2, des2 = sift.detectAndCompute(img2, None)

# Initialize brute-force matcher and perform k-nearest neighbor matching
bf = cv2.BFMatcher()
matches = bf.knnMatch(des1, des2, k=2)

# Print total number of matches before RANSAC
print(f'Matches before RANSAC: {len(matches)}')

# Apply Lowe's ratio test to filter good matches
good = [m for m, n in matches if m.distance < 0.7 * n.distance]

# Print number of good matches
print(f'Number of "good" matches: {len(good)}')

# Sort good matches based on distance
good = sorted(good, key=lambda val: val.distance)

# Check if there are enough good matches
if len(good) > MIN_MATCH_COUNT:
    # Extract coordinates of matched keypoints
    src_pts = np.float32([kp1[m.queryIdx].pt for m in good]).reshape(-1, 1, 2)
    dst_pts = np.float32([kp2[m.trainIdx].pt for m in good]).reshape(-1, 1, 2)

    # Find homography matrix using RANSAC
    M, mask = cv2.findHomography(src_pts, dst_pts, cv2.RANSAC, 5.0)
    matchesMask = mask.ravel().tolist()

    # Print homography matrix
    print("Homography matrix:")
    print(M)

    # Get dimensions of query image
    h, w = img1.shape
    pts = np.float32([[0, 0], [0, h-1], [w-1, h-1], [w-1, 0]]).reshape(-1, 1, 2)
else:
    print(f"Not enough matches are found - {len(good)}/{MIN_MATCH_COUNT}")
    matchesMask = None

# Set drawing parameters for matches
draw_params = dict(
    matchColor=(0, 255, 0),  # Draw matches in green color
    singlePointColor=None,
    matchesMask=matchesMask,  # Draw only inliers
    flags=2
)

# Print number of matches after applying homography (RANSAC)
if matchesMask:
    print(f'Number of matches after RANSAC: {matchesMask.count(1)}')

# Draw matches on the images
img_matches = cv2.drawMatches(img1, kp1, img2, kp2, good, None, **draw_params)
cv2.imshow('Matches', img_matches)

# Wait for a key press and close all windows
cv2.waitKey(0)
cv2.destroyAllWindows()

Top matches before RANSAC

A good match is defined by the number of matches filtered by distance shorter than 0.7 (p1.distance < 0.7 * p2.distance).

👇 Between image_1.jpg and image_3.jpg (409 good matches)

Image 1 and 3

👇 Between image_1.jpg and image_4.jpg (235 good matches)

Image 1 and 4

👇 Between image_1.jpg and image_5.jpg (559 good matches)

Image 1 and 5

👇 Between image_2.jpg and image_3.jpg (199 good matches)

Image 2 and 3

👇 Between image_2.jpg and image_4.jpg (213 good matches)

Image 2 and 4

👇 Between image_2.jpg and image_5.jpg (48 good matches)

Image 2 and 5

Top matches after RANSAC

RANSAC (Random Sample Consensus) is an iterative algorithm used to estimate parameters of a mathematical model from a set of observed data that contains outliers.

These results are non-deterministic due to random sampling used in RANSAC.

👇 Between image_1.jpg and image_3.jpg (471 good matches)

Image 1 and 3 after RANSAC

👇 Between image_1.jpg and image_4.jpg (56 good matches)

Image 1 and 4 after RANSAC

👇 Between image_1.jpg and image_5.jpg (620 good matches)

Image 1 and 5 after RANSAC

👇 Between image_2.jpg and image_3.jpg (80 good matches)

Image 2 and 3 after RANSAC

👇 Between image_2.jpg and image_4.jpg (262 good matches)

Image 2 and 4 after RANSAC

👇 Between image_2.jpg and image_5.jpg (36 good matches)

Image 2 and 5 after RANSAC

Analsis of Results

How well does the method work?

The method works well. The results show good matches between our object image and images containing multiple objects. It is surprising that the objects were well detected even with obstacles present.

Does it work equally well on the different examples?

Both yes and no. The number of matches varies widely based on the position and color intensity distribution of the objects. Objects with barriers to viewing have a much smaller number of keypoints and matches. The second textbook (“Computer Vision and Pattern Recognition” - image_2.jpg) generally has a smaller number of matches due to its smaller font size for the book title and large mix of dark gray colors.

However, the perspective transformations applied through the homography matrices derived using cv2.findHomography() show very good matches with the objects placed in different images. Therefore, I would conclude that our method works well, although not equally across all examples.

Current Status of SIFT

While SIFT (Scale-Invariant Feature Transform) is still a powerful and widely used algorithm in computer vision, some newer alternatives have emerged that offer improved performance in certain areas:

  • SURF (Speeded Up Robust Features)
  • ORB (Oriented FAST and Rotated BRIEF)
  • AKAZE (Accelerated-KAZE)
  • BRISK (Binary Robust Invariant Scalable Keypoints)