Generative AI Basics
Discriminative Models
Google Gen AI
🔍 What is Labeled vs. Unlabeled Data in Gen AI?
Feature | Labeled Data | Unlabeled Data |
---|---|---|
Definition | Data with tags or annotations that describe the input | Raw data without any tags or annotations |
Usage | Used in supervised learning and fine-tuning models | Used in pretraining large models |
Human Involvement | Requires manual or automated labeling | No human labeling; collected as-is |
Example (Text) | Review labeled as “positive” or “negative” | A review with no sentiment tag |
Example (Image) | Photo labeled as “cat”, “dog”, or “car” | Random image without description |
Application in Gen AI | Improves model accuracy, alignment, and safety | Enables large-scale model training |
Cost & Time | Expensive and time-consuming to produce | Easy and cheap to gather from the web or data lakes |
📘 Examples
🔹 Labeled Data Examples:
- A dataset where each email is marked as “spam” or “not spam”
- Sentences labeled with the sentiment: positive, neutral, or negative
- Medical images annotated by doctors with diagnoses
- Customer chat transcripts tagged with resolution outcomes
Use in Gen AI:
- Fine-tuning models like Gemini or GPT for specific tasks
- Building safe, trustworthy AI systems that understand context
🔹 Unlabeled Data Examples:
- Raw Wikipedia articles
- Open internet images without any metadata
- Video footage from cameras with no scene descriptions
- Audio recordings without transcription
Use in Gen AI:
- Pretraining foundational models like Gemma, Imagen, or Veo
- Allows models to learn language, image structure, or video flow patterns at scale
🎯 Why It Matters in Generative AI
Aspect | Labeled Data | Unlabeled Data |
---|---|---|
Performance Tuning | Helps models respond more accurately in domain-specific tasks | Enables foundational capabilities like language understanding |
Bias Detection | Easier to identify and correct bias | Risk of hidden bias if source data is not curated |
Scalability | Harder to scale due to manual work | Scales easily with web scraping and automated pipelines |
Business Use | Custom AI tools (chatbots, document summarizers) | Training base models for future reuse |
🧠 Summary
- Labeled data is essential for fine-tuning and specialized use cases, such as sentiment analysis, medical diagnostics, and customer service bots.
- Unlabeled data is crucial for training general-purpose foundation models that power text generation, image creation, and video synthesis.
- Both types are critical in different stages of the Gen AI development pipeline.