Talk To Segment
Edit only what you mean: segment by text, rewrite by prompt, keep everything else intact.
Text prompts identify regions without manual masks — zero-shot segmentation via open-vocabulary grounding.
Segmentation and editing stages are modular and independently replaceable.
Unmasked context is fully preserved for natural-looking, photorealistic outputs.
The Project Brief
A text-driven image editing pipeline that enables precise, mask-free modifications to any region of an image using natural language prompts.
How it works:
- A text prompt (e.g., “hands”) is fed to GroundingDINO + SAM to automatically segment the target region.
- The generated mask is passed to a Stable Diffusion Inpainting model alongside a new edit prompt (e.g., “replace hands with gloves”).
- The model synthesizes a photorealistic edit while preserving all unmasked regions.
Key contributions:
- Zero-shot segmentation via open-vocabulary grounding — no domain-specific fine-tuning required.
- Modular design: segmentation and inpainting stages can be swapped independently.
- Supports creative editing, content removal, attribute transfer, and artistic stylization.
Stack: Python · GroundingDINO · Segment Anything Model (SAM) · Stable Diffusion · HuggingFace Diffusers
Key Components
Project Highlights
GroundingDINO + SAM produces a target mask, then Stable Diffusion Inpainting applies the transformation using a second prompt.
Creative edits, retouching, object replacement, and controllable image manipulation research.
README outlines input image, segmentation mask, and edited output comparisons as the core demonstration format.
Quickstart
"The best edit is the one the image never knew happened."