Skip to main content

Document Processing

This section covers all APIs related to document processing, including uploading documents, retrieving results, and managing processed documents.

Postman Collection

Document Schema

Before diving into the APIs, here's the common schema for a document and its processing results:

{
"document_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "string", // "success", "pending", "failure"
"uploaded_at": "string", // ISO 8601 timestamp
"metadata": "string | object", // Optional metadata attached during upload
"original_document_name": "string", // Original filename or URL of the document
"raw_document_url": "string", // URL to access the original document
"verification_status": "string", // "success", "failed"
"verification_stage": "string", // stage_id where document is flagged for verification
"verification_message": "string", // Optional message explaining verification failure
"assigned_reviewers": ["string"], // List of email addresses of assigned reviewers
"pages": [
{
"page_id": "550e8400-e29b-41d4-a716-446655440001",
"page_number": 1,
"image_url": "string", // URL to access the page image
"data": {
"fields": {
"invoice_number": [ // Field names must be alphanumeric with underscores only
{
"field_data_id": "f1a2b3c4-d5e6-4f7g-8h9i-j0k1l2m3n4o5", // Unique identifier for the field value
"value": "string",
"confidence": 0.98,
"bbox": [100, 200, 300, 250], // [x1, y1, x2, y2]
"verification_status": "string", // "success", "failed"
"verification_message": "string", // Optional message explaining verification failure
"is_moderated": false // Indicates if the field value was manually corrected by a reviewer
}
]
},
"tables": [
{
"table_id": "string",
"bbox": [100, 300, 800, 600],
"cells": [
{
"cell_id": "string",
"row": 0,
"col": 0,
"header": "item_description", // Table header names must be alphanumeric with underscores only
"text": "string",
"bbox": [100, 330, 300, 360],
"verification_status": "string", // "success", "failed"
"verification_message": "string", // Optional message explaining verification failure
"is_moderated": false // Indicates if the cell value was manually corrected by a reviewer
}
]
}
]
}
}
]
}

Schema Components

  1. Document Information

    • document_id: Unique identifier for the document
    • status: Current processing status
    • uploaded_at: Timestamp of document upload
    • metadata: Optional data attached during upload
    • original_document_name: Original filename or URL of the document
    • raw_document_url: URL to access the original document
    • verification_status: Overall document verification status
    • verification_stage: stage_id where document is flagged for verification
    • verification_message: Optional message explaining verification failure
    • assigned_reviewers: List of email addresses of assigned reviewers
  2. Page Information

    • page_id: Unique identifier for the page
    • page_number: Sequential page number
    • image_url: URL to access the page image
  3. Extracted Data

    • fields: Key-value pairs of extracted information
      • Field names must be alphanumeric with underscores only
      • Field names must be unique within a workflow
      • Each field can have multiple values with confidence scores
      • Each field value has a unique field_data_id for tracking moderation
      • Includes bounding box coordinates for each value
      • Includes verification status and moderation flag
      • is_moderated: Indicates if the field value was manually corrected by a reviewer
      • Bounding box format: [x1, y1, x2, y2] where:
        • x1, y1: Top-left corner coordinates
        • x2, y2: Bottom-right corner coordinates
        • Coordinates are in pixels from the top-left of the page
        • Example: [100, 200, 300, 250] represents a box starting at (100,200) and ending at (300,250)
    • tables: Extracted tabular data
      • Each table has a unique ID and bounding box
      • Table header names must be alphanumeric with underscores only
      • Table header names must be unique within a workflow
      • Cells include row/column position and header information
      • Each cell has a unique cell_id for tracking moderation
      • Each cell has its own bounding box coordinates
      • Includes verification status and moderation flag
      • is_moderated: Indicates if the cell value was manually corrected by a reviewer
      • Bounding boxes follow the same format as field values

Upload Document for Processing

Postman

Upload a document for processing. The response structure depends on the workflow's processing_type setting and whether the upload is synchronous or asynchronous.

POST /api/v4/workflows/{workflow_id}/documents

Overview

  • Upload documents for processing in a workflow
  • Supports both file upload and URL-based processing
  • Can process documents synchronously or asynchronously
  • Returns immediate results for sync processing
  • Returns document ID for async processing
  • Supports metadata attachment for better document tracking
  • Results include extracted fields, tables, and page information

Supported File Types

  • Images: JPG, JPEG, PNG, TIFF
  • Documents: PDF
  • Spreadsheets: XLS, XLSX
  • Maximum file size: 20MB

Processing Limits

Rate Limits

  • Maximum processing rate: 75 pages per minute
  • Applies to both synchronous and asynchronous processing
  • Exceeding the rate limit will result in a 429 (Too Many Requests) response

Document Size Limits

  • Maximum pages per document: 500 pages
  • Documents exceeding 500 pages will be rejected with a 400 (Bad Request) response

Processing Mode Rules

  • Documents with 3 or fewer pages: Can be processed synchronously or asynchronously
  • Documents with more than 3 pages: Will be automatically converted to asynchronous processing
  • When converted to async, the response will include status: "processing" and a document_id

Request

The request can be sent in two ways:

  1. Using multipart/form-data with the following fields:

    • file: The document file to process (required if document_url is not provided)
    • async: Boolean value indicating whether to process asynchronously (optional, default: false)
    • metadata: Any string or JSON that will be attached to the document (optional)
  2. Using application/json with the following fields:

    • document_url: URL of the document to process (required if file is not provided)
    • async: Boolean value indicating whether to process asynchronously (optional, default: false)
    • metadata: Any string or JSON that will be attached to the document (optional)

Example

import requests
from requests.auth import HTTPBasicAuth

API_KEY = 'YOUR_API_KEY'
WORKFLOW_ID = '550e8400-e29b-41d4-a716-446655440000'
url = f"https://app.nanonets.com/api/v4/workflows/{WORKFLOW_ID}/documents"

# Method 1: Upload document with metadata as JSON
files = {
'file': ('invoice.pdf', open('invoice.pdf', 'rb'), 'application/pdf')
}

data = {
'async': 'false',
'metadata': '{"customer_id": "12345", "document_type": "invoice", "department": "finance"}'
}

response = requests.post(
url,
files=files,
data=data,
auth=HTTPBasicAuth(API_KEY, '')
)
print(response.json())

# Method 2: Process document from URL
json_data = {
'document_url': 'https://example.com/invoice.pdf',
'async': 'false',
'metadata': '{"customer_id": "12345", "document_type": "invoice", "department": "finance"}'
}

response = requests.post(
url,
json=json_data,
auth=HTTPBasicAuth(API_KEY, '')
)
print(response.json())

Synchronous Response (async: false)

  • Returns data organized by page
  • Each page contains its own fields and tables
  • Better for multi-page documents
  • Allows page-by-page processing and verification
{
"document_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "success",
"uploaded_at": "2024-03-14T12:00:00Z",
"metadata": {
"customer_id": "12345",
"document_type": "invoice"
},
"original_document_name": "invoice_2024_001.pdf",
"raw_document_url": "https://storage.nanonets.com/documents/550e8400-e29b-41d4-a716-446655440000.pdf",
"verification_status": "success",
"verification_stage": "stage_123",
"verification_message": "",
"assigned_reviewers": ["john.doe@example.com", "jane.smith@example.com"],
"pages": [
{
"page_id": "550e8400-e29b-41d4-a716-446655440001",
"page_number": 1,
"image_url": "https://storage.nanonets.com/pages/550e8400-e29b-41d4-a716-446655440001.jpg",
"data": {
"fields": {
"invoice_number": [
{
"field_data_id": "f1a2b3c4-d5e6-4f7g-8h9i-j0k1l2m3n4o5",
"value": "INV-2024-001",
"confidence": 0.98,
"bbox": [100, 200, 300, 250],
"verification_status": "success",
"verification_message": "",
"is_moderated": false
}
]
},
"tables": [
{
"table_id": "d8e5c1d2-4e71-4d0e-babc-a845f2de4f1b",
"bbox": [100, 300, 800, 600],
"cells": [
{
"cell_id": "1b5d3df7-3df7-420a-a82b-29dbdfd3e1b1",
"row": 0,
"col": 0,
"header": "item_description",
"text": "Product A",
"bbox": [100, 330, 300, 360],
"verification_status": "success",
"verification_message": "",
"is_moderated": false
}
]
}
]
}
}
]
}

Asynchronous Response (async: true)

  • Returns immediately with document ID and status
  • Use the document ID to check processing status
  • Best for large documents or batch processing
  • Reduces timeout issues for complex documents
{
"document_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "pending",
"uploaded_at": "2024-03-14T12:00:00Z",
"metadata": {
"customer_id": "12345",
"document_type": "invoice"
},
"original_document_name": "invoice_2024_001.pdf",
"verification_status": "success",
"verification_stage": "stage_123",
"verification_message": "",
"assigned_reviewers": ["john.doe@example.com", "jane.smith@example.com"]
}

List Documents

Postman

Retrieve a paginated list of all documents processed by a workflow. Results are always sorted by upload time in descending order (newest first).

GET /api/v4/workflows/{workflow_id}/documents

Overview

  • Lists all documents processed by a workflow
  • Supports pagination for large document sets
  • Includes document status and metadata
  • Returns basic page information
  • Useful for monitoring and management
  • Can be used to track processing status
  • Note: This endpoint only returns basic document information. To get detailed extracted data (fields and tables), use the Get Document by ID API

Query Parameters

  • page: Page number for pagination (default: 1)
  • limit: Number of documents per page (default: 10, max: 100)

Response

{
"documents": [
{
"document_id": "db237bf3-f3c1-4441-905e-a6b7538db269",
"status": "pending",
"uploaded_at": "2025-05-23T09:38:46.583415Z",
"metadata": "{\"customer_id\": \"12345\", \"document_type\": \"invoice\", \"department\": \"finance\"}",
"original_document_name": "invoice.pdf",
"raw_document_url": "uploadedfiles/37b9d483-d43f-4d0c-80ef-e0ac463055ba/RawPredictions/db237bf3-f3c1-4441-905e-a6b7538db269.pdf",
"verification_status": "success",
"verification_stage": "00000000-0000-0000-0000-000000000000",
"assigned_reviewers": [],
"pages": [
{
"page_id": "b94d9ca6-37b9-11f0-a23c-367dda7a627d",
"page_number": 0,
"image_url": "",
"data": {
"fields": null,
"tables": null
}
}
]
},
{
"document_id": "00000000-0000-0000-0000-000000000000",
"status": "success",
"uploaded_at": "2025-05-23T09:14:53.330939Z",
"original_document_name": "invoice.pdf",
"raw_document_url": "uploadedfiles/37b9d483-d43f-4d0c-80ef-e0ac463055ba/RawPredictions/00000000-0000-0000-0000-000000000000.pdf",
"verification_status": "success",
"verification_stage": "ffffffff-ffff-ffff-ffff-ffffffffffff",
"assigned_reviewers": [],
"pages": [
{
"page_id": "6304a3ce-37b6-11f0-a24d-367dda7a627d",
"page_number": 0,
"image_url": "",
"data": {
"fields": null,
"tables": null
}
}
]
}
],
"total_count": 2,
"page_no": 1,
"page_size": 50
}

Example

import requests
from requests.auth import HTTPBasicAuth

API_KEY = 'YOUR_API_KEY'
WORKFLOW_ID = '550e8400-e29b-41d4-a716-446655440000'
url = f"https://app.nanonets.com/api/v4/workflows/{WORKFLOW_ID}/documents"

# Get first page with default limit (10)
response = requests.get(
url,
auth=HTTPBasicAuth(API_KEY, '')
)
print(response.json())

# Get specific page with custom limit
params = {
'page': 2,
'limit': 20
}
response = requests.get(
url,
params=params,
auth=HTTPBasicAuth(API_KEY, '')
)
print(response.json())

Get Document Data

Postman

Retrieve the processing results for a specific document.

GET /api/v4/workflows/{workflow_id}/documents/{document_id}

Overview

  • Retrieves complete processing results for a document
  • Returns the same structure as synchronous upload response
  • Includes all extracted fields and tables
  • Contains page information and image URLs
  • Useful for retrieving results of async processing
  • Can be used to verify processing results

Response

Same structure as the synchronous response from the upload endpoint.

Example

import requests
from requests.auth import HTTPBasicAuth

API_KEY = 'YOUR_API_KEY'
WORKFLOW_ID = '550e8400-e29b-41d4-a716-446655440000'
DOCUMENT_ID = '550e8400-e29b-41d4-a716-446655440001'
url = f"https://app.nanonets.com/api/v4/workflows/{WORKFLOW_ID}/documents/{DOCUMENT_ID}"

response = requests.get(
url,
auth=HTTPBasicAuth(API_KEY, '')
)
print(response.json())

Get Page Data

Postman

Retrieve the processing results for a specific page of a document.

GET /api/v4/workflows/{workflow_id}/documents/{document_id}/pages/{page_id}

Overview

  • Retrieves data for a specific page
  • Useful for multi-page documents
  • Returns page-specific fields and tables
  • Includes page image URL and dimensions
  • Can be used for page-level verification
  • Helps in handling large documents efficiently

Response

{
"page_id": "550e8400-e29b-41d4-a716-446655440001",
"page_number": 1,
"image_url": "https://storage.nanonets.com/pages/550e8400-e29b-41d4-a716-446655440001.jpg",
"data": {
"fields": {
"invoice_number": [
{
"field_data_id": "f1a2b3c4-d5e6-4f7g-8h9i-j0k1l2m3n4o5",
"value": "INV-2024-001",
"confidence": 0.98,
"bbox": [100, 200, 300, 250],
"verification_status": "success",
"verification_message": "",
"is_moderated": false
}
],
"total_amount": [
{
"field_data_id": "f1a2b3c4-d5e6-4f7g-8h9i-j0k1l2m3n4o5",
"value": "1500.00",
"confidence": 0.95,
"bbox": [400, 200, 500, 250],
"verification_status": "success",
"verification_message": "",
"is_moderated": false
}
]
},
"tables": [
{
"table_id": "d8e5c1d2-4e71-4d0e-babc-a845f2de4f1b",
"bbox": [100, 300, 800, 600],
"cells": [
{
"cell_id": "1b5d3df7-3df7-420a-a82b-29dbdfd3e1b1",
"row": 0,
"col": 0,
"header": "item_description",
"text": "Product A",
"bbox": [100, 330, 300, 360],
"verification_status": "success",
"verification_message": "",
"is_moderated": false
},
{
"cell_id": "43bd4a61-0131-47b9-9015-4df4b62d4531",
"row": 0,
"col": 1,
"header": "quantity",
"text": "2",
"bbox": [350, 330, 450, 360],
"verification_status": "success",
"verification_message": "",
"is_moderated": false
}
]
}
]
}
}

Example

import requests
from requests.auth import HTTPBasicAuth

API_KEY = 'YOUR_API_KEY'
WORKFLOW_ID = '550e8400-e29b-41d4-a716-446655440000'
DOCUMENT_ID = '550e8400-e29b-41d4-a716-446655440001'
PAGE_ID = '550e8400-e29b-41d4-a716-446655440002'
url = f"https://app.nanonets.com/api/v4/workflows/{WORKFLOW_ID}/documents/{DOCUMENT_ID}/pages/{PAGE_ID}"

response = requests.get(
url,
auth=HTTPBasicAuth(API_KEY, '')
)
print(response.json())

Delete Document

Postman

Delete a processed document and its associated data.

DELETE /api/v4/workflows/{workflow_id}/documents/{document_id}

Overview

  • Permanently removes a document and its data
  • Cannot be undone
  • Frees up storage space
  • Useful for data cleanup
  • Should be used with caution
  • Consider implementing a retention policy

Response

{
"message": "Document deleted successfully"
}

Example

import requests
from requests.auth import HTTPBasicAuth

API_KEY = 'YOUR_API_KEY'
WORKFLOW_ID = '550e8400-e29b-41d4-a716-446655440000'
DOCUMENT_ID = '550e8400-e29b-41d4-a716-446655440001'
url = f"https://app.nanonets.com/api/v4/workflows/{WORKFLOW_ID}/documents/{DOCUMENT_ID}"

response = requests.delete(
url,
auth=HTTPBasicAuth(API_KEY, '')
)
print(response.json())

Error Handling

All document processing APIs return standard HTTP status codes:

  • 200 OK: Request successful
  • 201 Created: Document uploaded successfully
  • 400 Bad Request: Invalid request parameters or unsupported file type
  • 401 Unauthorized: Invalid or missing API key
  • 404 Not Found: Workflow or document not found
  • 413 Payload Too Large: File size exceeds limit
  • 500 Internal Server Error: Server-side error

Common Error Scenarios

  1. File Upload Issues

    • Unsupported file type
    • File too large (>20MB)
    • Corrupted file
  2. Processing Errors

    • Document processing timeout
    • Unreadable content
    • Processing failure
  3. Field and Table Header Issues

    • Invalid field or table header names (non-alphanumeric characters)
    • Duplicate field names within a workflow
    • Duplicate table header names within a workflow

For detailed error handling, refer to the Error Handling Guide.

Best Practices

  1. File Upload

    • Use async mode for large files or batch processing
    • Include relevant metadata for better tracking
    • Validate file types before upload
  2. Result Processing

    • Check confidence scores before using extracted data
    • Handle both sync and async responses appropriately
    • Implement retry logic for failed processing
  3. Resource Management

    • Delete processed documents when no longer needed
    • Monitor storage usage
    • Implement document retention policies

For more best practices, refer to the Best Practices Guide.