
Learn Azure AI Document Intelligence with C# to extract data from PDFs and documents automatically. Step-by-step tutorial with code examples.
Azure AI Document Intelligence with C# — Extract Data from Documents Automatically
If you've ever needed to pull structured data out of invoices, receipts, contracts, or any scanned document, you know how painful manual data entry can be. Azure AI Document Intelligence (formerly Azure Form Recognizer) solves this problem by using pre-trained AI models to extract text, key-value pairs, tables, and structured fields from documents — with just a few lines of C# code.
In this Azure AI Document Intelligence C# tutorial, you'll learn how to set up the service, analyze documents using prebuilt and custom models, and handle real-world extraction scenarios. Whether you're automating invoice processing, digitizing paper forms, or building an intelligent document pipeline, this guide covers everything you need to get started.
What Is Azure AI Document Intelligence?
Azure AI Document Intelligence is a cloud-based AI service from Microsoft that applies machine learning to extract text, structure, and semantic meaning from documents. It supports PDFs, images (JPEG, PNG, TIFF, BMP), and Office file formats.
It was previously known as Azure Form Recognizer. If you've searched for "Azure Form Recognizer C#" and ended up here, you're in the right place — Microsoft rebranded it to Azure AI Document Intelligence, but the SDK and core concepts remain familiar.
Key Capabilities
- Prebuilt models — Ready-to-use models for invoices, receipts, ID documents, W-2 forms, business cards, and more
- Layout analysis — Extract text, tables, selection marks, and document structure from any document
- Custom models — Train your own models on domain-specific documents
- Read (OCR) — High-accuracy optical character recognition for printed and handwritten text
- Add-on capabilities — Barcode extraction, formula recognition, font detection, and high-resolution analysis
Setting Up Azure AI Document Intelligence in Your C# Project
Step 1: Create the Azure Resource
Before writing any code, you need an Azure AI Document Intelligence resource:
- Go to the Azure Portal and search for "Document Intelligence"
- Click Create, choose your subscription, resource group, and region
- Select the Free (F0) tier to start — it gives you 500 free pages per month
- Once deployed, copy the Endpoint and Key from the "Keys and Endpoint" section
Step 2: Install the NuGet Package
The official SDK package is Azure.AI.DocumentIntelligence. Install it via the .NET CLI:
dotnet add package Azure.AI.DocumentIntelligence
Step 3: Configure Authentication
Store your endpoint and key securely. For development, user secrets or environment variables work well. Never hard-code credentials in source files.
using Azure;
using Azure.AI.DocumentIntelligence;
// Load from environment variables or configuration
string endpoint = Environment.GetEnvironmentVariable("AZURE_DOCUMENT_ENDPOINT")!;
string apiKey = Environment.GetEnvironmentVariable("AZURE_DOCUMENT_KEY")!;
var client = new DocumentIntelligenceClient(
new Uri(endpoint),
new AzureKeyCredential(apiKey)
);
For production applications, use Azure.Identity with DefaultAzureCredential instead of API keys. This supports Managed Identity and avoids storing secrets entirely.
Extract Text from PDF with C# — Using the Read Model
The simplest use case is OCR — extracting all text from a document. The prebuilt-read model handles this for printed and handwritten text across 300+ languages.
using Azure;
using Azure.AI.DocumentIntelligence;
var client = new DocumentIntelligenceClient(
new Uri(endpoint),
new AzureKeyCredential(apiKey)
);
// Analyze a document from a URL
var content = new AnalyzeDocumentContent
{
UrlSource = new Uri("https://example.com/sample-document.pdf")
};
var operation = await client.AnalyzeDocumentAsync(
WaitUntil.Completed,
"prebuilt-read",
content
);
AnalyzeResult result = operation.Value;
// Extract all text content
Console.WriteLine("--- Extracted Text ---");
foreach (DocumentPage page in result.Pages)
{
Console.WriteLine($"Page {page.PageNumber} ({page.Width}x{page.Height} {page.Unit})");
foreach (DocumentLine line in page.Lines)
{
Console.WriteLine($" {line.Content}");
}
}
// Extract paragraphs with roles (title, header, footnote, etc.)
if (result.Paragraphs != null)
{
foreach (DocumentParagraph paragraph in result.Paragraphs)
{
string role = paragraph.Role ?? "body";
Console.WriteLine($"[{role}] {paragraph.Content}");
}
}
This approach is ideal when you need raw text from any document — PDFs, scanned images, or photos of printed material. The read model is fast and cost-effective for high-volume OCR in C#.
Extract Data from Invoices Using the Prebuilt Model
The real power of Azure AI Document Intelligence is in its prebuilt models. The invoice model extracts structured fields like vendor name, invoice number, total amount, line items, and due dates — without any training.
var content = new AnalyzeDocumentContent
{
UrlSource = new Uri("https://example.com/invoice.pdf")
};
var operation = await client.AnalyzeDocumentAsync(
WaitUntil.Completed,
"prebuilt-invoice",
content
);
AnalyzeResult result = operation.Value;
foreach (AnalyzedDocument invoice in result.Documents)
{
Console.WriteLine($"Document Type: {invoice.DocType}");
Console.WriteLine($"Confidence: {invoice.Confidence}");
if (invoice.Fields.TryGetValue("VendorName", out DocumentField? vendorField))
{
Console.WriteLine($"Vendor: {vendorField.Content} " +
$"(confidence: {vendorField.Confidence})");
}
if (invoice.Fields.TryGetValue("InvoiceId", out DocumentField? idField))
{
Console.WriteLine($"Invoice ID: {idField.Content}");
}
if (invoice.Fields.TryGetValue("InvoiceTotal", out DocumentField? totalField))
{
Console.WriteLine($"Total: {totalField.Content}");
}
if (invoice.Fields.TryGetValue("InvoiceDate", out DocumentField? dateField))
{
Console.WriteLine($"Date: {dateField.Content}");
}
// Extract line items
if (invoice.Fields.TryGetValue("Items", out DocumentField? itemsField)
&& itemsField.ValueList != null)
{
Console.WriteLine("\nLine Items:");
foreach (DocumentField item in itemsField.ValueList)
{
if (item.ValueObject != null)
{
var fields = item.ValueObject;
string description = fields.TryGetValue("Description", out var desc)
? desc.Content : "N/A";
string amount = fields.TryGetValue("Amount", out var amt)
? amt.Content : "N/A";
string quantity = fields.TryGetValue("Quantity", out var qty)
? qty.Content : "N/A";
Console.WriteLine($" - {description} | Qty: {quantity} | Amount: {amount}");
}
}
}
}
This is how you extract data from PDF files in C# with zero manual parsing. The model returns field-level confidence scores, so you can flag low-confidence extractions for human review.
Analyzing Local Files (Not Just URLs)
You won't always have documents hosted at a URL. Here's how to analyze a local file from disk:
byte[] fileBytes = await File.ReadAllBytesAsync("invoice.pdf");
var content = new AnalyzeDocumentContent
{
Base64Source = BinaryData.FromBytes(fileBytes)
};
var operation = await client.AnalyzeDocumentAsync(
WaitUntil.Completed,
"prebuilt-invoice",
content
);
AnalyzeResult result = operation.Value;
// Process result same as above
Extract Tables from Documents
Documents often contain tabular data — financial statements, reports, schedules. The layout model excels at table extraction:
var content = new AnalyzeDocumentContent
{
Base64Source = BinaryData.FromBytes(await File.ReadAllBytesAsync("report.pdf"))
};
var operation = await client.AnalyzeDocumentAsync(
WaitUntil.Completed,
"prebuilt-layout",
content
);
AnalyzeResult result = operation.Value;
if (result.Tables != null)
{
foreach (DocumentTable table in result.Tables)
{
Console.WriteLine($"Table: {table.RowCount} rows x {table.ColumnCount} columns");
foreach (DocumentTableCell cell in table.Cells)
{
Console.WriteLine(
$" [{cell.RowIndex},{cell.ColumnIndex}] " +
$"({cell.Kind}): {cell.Content}"
);
}
Console.WriteLine();
}
}
Building a Reusable Document Processing Service
In a real application, you'll want a clean service layer that wraps the SDK. Here's a production-ready pattern that handles multiple document types:
public class DocumentProcessingService
{
private readonly DocumentIntelligenceClient _client;
public DocumentProcessingService(DocumentIntelligenceClient client)
{
_client = client;
}
public async Task<InvoiceData> ExtractInvoiceAsync(Stream documentStream)
{
byte[] bytes;
using (var ms = new MemoryStream())
{
await documentStream.CopyToAsync(ms);
bytes = ms.ToArray();
}
var content = new AnalyzeDocumentContent
{
Base64Source = BinaryData.FromBytes(bytes)
};
var operation = await _client.AnalyzeDocumentAsync(
WaitUntil.Completed,
"prebuilt-invoice",
content
);
AnalyzeResult result = operation.Value;
var doc = result.Documents.FirstOrDefault();
if (doc == null)
throw new InvalidOperationException("No invoice detected in the document.");
return new InvoiceData
{
VendorName = GetFieldValue(doc, "VendorName"),
InvoiceId = GetFieldValue(doc, "InvoiceId"),
InvoiceDate = GetFieldValue(doc, "InvoiceDate"),
DueDate = GetFieldValue(doc, "DueDate"),
Total = GetFieldValue(doc, "InvoiceTotal"),
Confidence = doc.Confidence
};
}
private static string GetFieldValue(AnalyzedDocument doc, string fieldName)
{
return doc.Fields.TryGetValue(fieldName, out DocumentField? field)
? field.Content ?? string.Empty
: string.Empty;
}
}
public record InvoiceData
{
public string VendorName { get; init; } = "";
public string InvoiceId { get; init; } = "";
public string InvoiceDate { get; init; } = "";
public string DueDate { get; init; } = "";
public string Total { get; init; } = "";
public double? Confidence { get; init; }
}
Register the service in your DI container in Program.cs:
builder.Services.AddSingleton(sp =>
new DocumentIntelligenceClient(
new Uri(builder.Configuration["Azure:DocumentIntelligence:Endpoint"]!),
new AzureKeyCredential(builder.Configuration["Azure:DocumentIntelligence:Key"]!)
)
);
builder.Services.AddScoped<DocumentProcessingService>();
Available Prebuilt Models
Azure AI Document Intelligence ships with several prebuilt models. Choose the one that matches your document type:
prebuilt-read— OCR for any document, extracts raw text and language detectionprebuilt-layout— Text, tables, selection marks, and document structureprebuilt-invoice— Invoices with vendor info, line items, totalsprebuilt-receipt— Sales receipts with merchant, items, totals, taxprebuilt-idDocument— Passports, driver's licenses, national IDsprebuilt-tax.us.w2— US W-2 tax formsprebuilt-healthInsuranceCard.us— US health insurance cardsprebuilt-contract— Contracts with parties, terms, jurisdictionsprebuilt-creditCard— Credit/debit card detailsprebuilt-bankStatement— Bank statements with transactions
Best Practices for Document Data Extraction in C#
1. Always Check Confidence Scores
Every extracted field includes a confidence score between 0 and 1. Set a threshold (typically 0.7–0.85) and route low-confidence results to human review rather than blindly trusting the output.
2. Use the Right Model for the Job
Don't use the generic prebuilt-layout model for invoices. The prebuilt-invoice model understands invoice-specific semantics and returns strongly-typed fields. Using the right model dramatically improves accuracy.
3. Optimize Document Quality
AI extraction quality depends on input quality. For scanned documents, ensure at least 300 DPI resolution. Avoid skewed or blurry images. The service handles some preprocessing, but clean inputs give better results.
4. Handle Long-Running Operations Properly
Document analysis is an asynchronous operation. Use WaitUntil.Completed for simple scenarios, but for production workloads, consider polling with WaitUntil.Started and implementing proper cancellation support:
var operation = await client.AnalyzeDocumentAsync(
WaitUntil.Started,
"prebuilt-invoice",
content
);
// Poll until complete with cancellation support
await operation.WaitForCompletionAsync(cancellationToken);
AnalyzeResult result = operation.Value;
5. Implement Retry Logic
The Azure SDK has built-in retry policies, but for document processing pipelines, add application-level retries for transient HTTP failures. The Azure.Core library handles throttling (429) responses automatically with exponential backoff.
Common Pitfalls to Avoid
- Exceeding page limits — The Free tier allows 500 pages/month. The Standard tier charges per page. Monitor your usage to avoid surprise bills.
- Ignoring multi-page documents — Always iterate through
result.Pages. A single PDF can have dozens of pages, and data can appear on any of them. - Hardcoding API keys — Use Azure Key Vault, environment variables, or Managed Identity. Never commit keys to source control.
- Not disposing streams — When reading files for upload, ensure proper stream disposal with
usingstatements to avoid memory leaks in high-throughput scenarios. - Assuming field existence — Not every document will contain every expected field. Always use
TryGetValueto safely access fields instead of direct indexing.
Pricing Overview (2026)
Azure AI Document Intelligence pricing is per-page:
- Free tier (F0) — 500 pages/month, limited to 1 request per second
- Read model — Starting at $0.001 per page
- Prebuilt models — Starting at $0.01 per page
- Custom models — Starting at $0.03 per page (plus training costs)
For the latest pricing, always check the official Azure pricing page, as rates may have changed since this article was published.
Conclusion
Azure AI Document Intelligence with C# makes it straightforward to automate document data extraction at scale. Whether you're processing invoices, receipts, IDs, or custom forms, the SDK provides a clean, async API that fits naturally into .NET applications.
Key takeaways:
- Use prebuilt models first — they cover the most common document types with zero training
- Always check confidence scores and route uncertain extractions to human review
- Wrap the SDK in a service class for clean separation and testability
- Use Managed Identity in production instead of API keys
- Start with the Free tier (500 pages/month) to evaluate before committing to a paid plan
The combination of Azure's pre-trained AI and C#'s strong typing makes document intelligence one of the most practical AI integrations you can add to a .NET application today. Set up the resource, install the NuGet package, and start extracting structured data in minutes.
Your go-to resource for C#, .NET, and modern software development. Follow along for daily tutorials, tips, and real-world examples.
Comments
Post a Comment