The Right Tool for the Job: A Guide to AI Models by Data Type

Understand the different types of AI models and their ideal use cases, from text to vision to audio.

In the world of AI runtimes like Foundry, LM Studio, and OpenAI, we often talk about "models" as a single category. But just as you wouldn't use a calculator to write a novel, different AI models are built to handle specific types of data.

If you are building an application or setting up a local workflow, understanding these categories is the first step to success. Here is a breakdown of AI models categorized by the data they process.

A) Text & Language Models (The "Brains")

These are the most common models you’ll encounter in LM Studio and OpenAI. They process strings of text to understand intent or create content.

1) Chat/LLMs (Large Language Models)

Designed for conversation and reasoning.

Examples: Llama 3 (Meta), phi-4 (Microsoft), GPT-4o (OpenAI).
Use Case: Writing emails, coding, or summarizing a long report.

2) Embedding Models

These turn text into math (vectors) so computers can "search" by meaning rather than just keywords.

Examples: nomic-embed-text, text-embedding-3-small.
Use Case: Powering a search engine that understands what you mean, even if you don't use the exact words.

B) Vision Models (The "Eyes")

Vision models allow AI to interpret the physical world through pixels. These are becoming standard in runtimes like Foundry.

1) Vision-Language Models (VLM)

These are "multimodal," meaning they can look at an image and then talk about it in a chat interface.

Examples: Qwen2.5-VL, GPT-4o with Vision.
Use Case: Uploading a screenshot of a complex chart and asking the AI to "Explain the trend in this data."

2) Image Generation

These turn text descriptions into brand-new visuals.

Examples: DALL·E 3, Stable Diffusion.

C) Audio & Speech Models (The "Ears & Voice")

These models bridge the gap between sound waves and digital text.

1) Speech-to-Text (STT)

High-speed transcription.

Example: Whisper (OpenAI).
Use Case: Automatically creating subtitles for a video or transcribing a meeting in real-time.

2) Text-to-Speech (TTS)

Turning digital text into natural human voices.

Use Case: Narrating an audiobook or providing a voice for a virtual assistant.

D) Reasoning & Structured Data Models (The "Logic")

Sometimes you don't need a chatty AI; you need a logical one that can handle math or massive tables of data.

1) Reasoning Models

These use "Chain of Thought" processing. They pause to "think" through a logical sequence before giving an answer.

Examples: DeepSeek-R1, o1-preview.
Use Case: Solving complex math problems or debugging deep architectural flaws in code.

2) Rerank Models

These specialize in sorting. They take a list of data and re-order it based on exact relevance.

Example: Cohere-rerank.

E) Summary: Which One Do You Need?

If your data is...	Use this Model Type	Popular Choice
A conversation	Chat/LLM	Llama 3.1
A library of PDFs	Embedding	nomic-embed
A photo or chart	Vision (VLM)	phi-4-multimodal
An audio recording	STT	Whisper
A complex math problem	Reasoning	DeepSeek-R1

F) The Future is Multimodal

The most exciting trend in runtimes like Foundry is the move toward Multimodal models. Instead of needing four different models for text, images, and audio, a single model like GPT-4o or Phi-4 can process all of them at once.

G) Example: Designing a Private Corporate Brain

Let’s design a Private Corporate Brain—a system that allows employees to upload internal documents (PDFs, Excel, Wiki pages) and ask questions about them securely. This is a classic RAG (Retrieval-Augmented Generation) system. Here is the high-level design broken down by the flow of data.

Phase 1: The Ingestion Pipeline (Data to Math)

This happens behind the scenes before you ever start chatting.

The Product Action: You upload a 100-page HR Policy PDF.
System Design:
- Parser: A script (like PyPDF) reads the text.
- Chunker: The system breaks the 100 pages into 500 small "snippets" of 200 words each.
- Model Used: Embedding Model.
  - Choice: nomic-embed-text (Local/Foundry) or text-embedding-3 (OpenAI).
  - Role: It turns each snippet into a vector (a string of numbers).
- Storage: These vectors are saved in a Vector Database (like ChromaDB or Pinecone).

Phase 2: The Retrieval Step (Search)

This happens the moment you type a question.

The Product Action: You type: "What is the maternity leave policy?"
System Design:
- Model Used: Embedding Model (Same one as Phase 1).
  - Role: It turns your question into a vector.
  - Vector Search: The system compares your "question vector" against the "snippet vectors" in the database.
  - Result: It pulls the top 3 most mathematically similar snippets from your HR PDF.

Phase 3: The Generation Step (The Answer)

This is where the AI actually "talks."

The Product Action: The UI displays a natural language answer.
System Design:
- The Prompt: The system sends a hidden instruction to the LLM: "Here is some context from our files: [Snippet 1, Snippet 2]. Based ONLY on this, answer the user's question: What is the maternity leave policy?"
- Model Used: Chat/LLM Model.
  - Choice: phi-4 (Local/Foundry) for privacy, or gpt-4o (OpenAI) for complex logic.
  - Role: It reads the context and writes a polite, accurate response.

Summary of Models Used

Component	Model Category	Specific Example
Indexing/Search	Embedding Model	nomic-embed-text
Refining Search	Rerank Model (Optional)	bge-reranker
Generating Answer	Chat/LLM Model	gpt-4o or Llama 3.1
Analyzing Images	Vision Model	phi-4-multimodal

Would you like a diagram of this flow, or should we look at the specific Python libraries (like LangChain) used to build this?

For a hands-on example of building a Retrieval-Augmented Generation (RAG) system, check out our Local RAG Studio created in Week 2 of the 24 Weeks project.

H) Example: Building an Automated Data Extraction Pipeline

This product is an Automated Data Extraction Pipeline (often called an "Intelligent Document Processor" or IDP). It is designed to take "messy" data like an email thread or a photo of a crumpled receipt and turn it into a clean, machine-readable JSON file that your accounting software can understand.

Phase 1: The Input & Recognition (The "Eyes")

Before the data can be turned into JSON, the system must "see" it properly.

The Product Action: An employee uploads a photo of a taxi receipt or forwards a vendor email.
System Design:
- OCR (Optical Character Recognition): If it’s an image, the system extracts the text.
- Vision Model (VLM): Modern systems use a vision model to understand the layout (e.g., recognizing that the number at the bottom right is the "Total").
  - Model Used: Vision-Language Model.
  - Choice: gpt-4o (OpenAI) or Qwen2-VL (Local/LM Studio).
    - Role: It looks at the image and converts it into a structured text description.

Phase 2: Schema-Based Parsing (The "Logic")

This is where the unstructured text is forced into a specific JSON format (e.g., {"vendor": "Uber", "amount": 25.50}).

The Product Action: The system maps the text to your database fields.
System Design:
- Function Calling / Structured Output: You provide the model with a JSON Schema (a blueprint of the fields you need).
  - Model Used: Reasoning or Small Language Model (SLM).
  - Choice: gpt-4o-mini (OpenAI) for high accuracy/low cost, or phi-4 (Foundry) for local processing.
    - Role: It acts as a "translator," ignoring the fluff (like "Regards, John") and extracting only the required keys: Date, Vendor, Total, and Currency.

Phase 3: Validation & Error Handling (The "Checker")

AI can sometimes hallucinate, so the system must validate the JSON against your business rules.

The Product Action: The system flags a receipt if the "Tax" + "Subtotal" does not equal the "Total."
System Design:
- Code Interpreter / Logic Layer: A simple Python script validates the math.
  - Model Used: Reasoning Model.
  - Choice: o1-mini (OpenAI) or DeepSeek-R1 (Local).
    - Role: If the validation fails, this model "thinks" through the error to see if it misread a number (e.g., mistaking an '8' for a 'B').

Summary of Models Used for Data Extraction

Component	Model Category	Specific Example
Reading the Image	Vision Model	gpt-4o or Qwen2-VL
JSON Extraction	Chat/LLM (Structured)	gpt-4o-mini or phi-4
Math/Logic Check	Reasoning Model	o1-mini or DeepSeek-R1

Why use different models? You wouldn't use a massive model like GPT-4o just to check if 10 + 5 = 15. You use a Vision model to see, a Small model (like phi-4) to format the JSON cheaply, and a Reasoning model only when the logic gets complicated.

For a hands-on example of building an Automated Data Extraction Pipeline, check out our Week 3 project from the 24 Weeks series.

I) Example: Voice-to-Workflow Pipeline

Transform voice input into structured tasks such as tickets or notes.

This product is a Digital Chief of Staff. It’s designed for the "busy hands" professional—like a doctor finishing a surgery or a construction manager on-site—who needs to speak a stream of consciousness and have it turn into organized, actionable data (like a Jira ticket or a CRM entry).

Phase 1: The Transcription (The "Ears")

The system first needs to turn raw audio waves into highly accurate text, filtering out background noise and "umms/ahhs."

The Product Action: You tap a button and say: "Hey, remind me to follow up with Sarah from Hooli about the API migration, set a high priority, and link it to the Q3 Project."
System Design:
- STT (Speech-to-Text): The audio file is sent to a transcription engine.
  - Diarization: If multiple people are talking, the system identifies "Who said what."
  - Model Used: Audio/Speech Model.
    - Choice: Whisper (OpenAI) or Deepgram.
      - Role: It creates a perfect text transcript of your voice.

Phase 2: Intent Extraction (The "Brain")

This is the most critical step. The AI must figure out that "Sarah from Hooli" is a Contact, "API migration" is the Subject, and "High" is the Priority.

The Product Action: The system maps your "messy" speech into a structured format.
System Design:
- NER (Named Entity Recognition): The model identifies people, companies, and dates.
  - Instruction Following: The model is told: "Extract the task, the person involved, and the urgency from this text."
  - Model Used: Chat/LLM Model (Optimized for Extraction).
    - Choice: gpt-4o-mini (OpenAI), Claude 3.5 Haiku (Anthropic), or phi-4 (Foundry).
      - Role: It outputs a JSON block like {"task": "Follow up", "contact": "Sarah", "urgency": "High"}.

Phase 3: The Execution (The "Hands")

The structured data is then "pushed" into the actual software tools you use.

The Product Action: A new ticket appears in your project management tool (like Jira, Trello, or Notion) with all fields pre-filled.
System Design:
- Function Calling / Tool Use: The AI model is given access to "Tools" (APIs). It doesn't just write the text; it triggers a command like create_jira_ticket(title, priority).
  - Model Used: Agentic/Action Model.
    - Choice: gpt-4o (OpenAI) or Claude 3.5 Sonnet.
      - Role: It decides which tool to use (e.g., "This sounds like a calendar event, not a ticket") and executes the API call.

Summary of Models Used

Component	Model Category	Specific Example
Audio to Text	Speech-to-Text (STT)	Whisper v3
Understanding Intent	Chat/LLM (Structured)	gpt-4o-mini
Taking Action	Agentic/Tool Use	Claude 3.5 Sonnet

Available Products

OpenAI: The Realtime API allows the creation of "Voice-to-Action" apps with minimal delay.
Google: Vertex AI Agent Builder can connect Gemini to enterprise data and voice interfaces.
Consumer Apps: Apps like Granola and AudioPen transform voice notes into structured summaries or tasks.

For a hands-on example of building a Voice-to-Workflow Pipeline, check out our Week 4 project from the 24 Weeks series.