Talk To Segment | Pratyay Dutta

Concept Note

Text prompts identify regions without manual masks — zero-shot segmentation via open-vocabulary grounding.

Segmentation and editing stages are modular and independently replaceable.

Unmasked context is fully preserved for natural-looking, photorealistic outputs.

The Project Brief

A text-driven image editing pipeline that enables precise, mask-free modifications to any region of an image using natural language prompts.

How it works:

A text prompt (e.g., “hands”) is fed to GroundingDINO + SAM to automatically segment the target region.
The generated mask is passed to a Stable Diffusion Inpainting model alongside a new edit prompt (e.g., “replace hands with gloves”).
The model synthesizes a photorealistic edit while preserving all unmasked regions.

Key contributions:

Zero-shot segmentation via open-vocabulary grounding — no domain-specific fine-tuning required.
Modular design: segmentation and inpainting stages can be swapped independently.
Supports creative editing, content removal, attribute transfer, and artistic stylization.

Stack: Python · GroundingDINO · Segment Anything Model (SAM) · Stable Diffusion · HuggingFace Diffusers

GroundingDINO SAM Stable Diffusion Inpainting HuggingFace Diffusers Prompt-based Segmentation

GitHub README

🎯

Grounding DINO

Open-vocabulary object detection from text

✂️

Segment Anything

Zero-shot region masking from bounding box

🎨

SD Inpainting

Prompt-guided photorealistic region fill

🔗

Modular Pipeline

Swap any stage without breaking the rest

Pipeline

GroundingDINO + SAM produces a target mask, then Stable Diffusion Inpainting applies the transformation using a second prompt.

Use Cases

Creative edits, retouching, object replacement, and controllable image manipulation research.

Result Structure

README outlines input image, segmentation mask, and edited output comparisons as the core demonstration format.

Clone the repo and install dependencies from requirements.

Run segmentation with an object prompt to generate a mask (for example: hands).

Feed the mask and edit prompt to inpainting (for example: replace hands with gloves).

"The best edit is the one the image never knew happened."