Skip to main content

Document Processing

The SDK provides functionality for processing documents. This section covers the available methods for document processing, fully aligned with the Nanonets API.

Document Schema

The document object returned by the API contains the following fields:

{
"document_id": "550e8400-e29b-41d4-a716-446655440000",
"status": "string", # "success", "pending", "failure"
"uploaded_at": "string", # ISO 8601 timestamp
"metadata": "string | object", # Optional metadata attached during upload
"original_document_name": "string", # Original filename or URL of the document
"raw_document_url": "string", # URL to access the original document
"verification_status": "string", # "success", "failed"
"verification_stage": "string", # stage_id where document is flagged for verification
"verification_message": "string", # Optional message explaining verification failure
"assigned_reviewers": ["string"], # List of email addresses of assigned reviewers
"pages": [
{
"page_id": "550e8400-e29b-41d4-a716-446655440001",
"page_number": 1,
"image_url": "string", # URL to access the page image
"data": {
"fields": {
"invoice_number": [
{
"field_data_id": "f1a2b3c4-d5e6-4f7g-8h9i-j0k1l2m3n4o5",
"value": "string",
"confidence": 0.98,
"bbox": [100, 200, 300, 250],
"verification_status": "string",
"verification_message": "string",
"is_moderated": False
}
]
},
"tables": [
{
"table_id": "string",
"bbox": [100, 300, 800, 600],
"cells": [
{
"cell_id": "string",
"row": 0,
"col": 0,
"header": "item_description",
"text": "string",
"bbox": [100, 330, 300, 360],
"verification_status": "string",
"verification_message": "string",
"is_moderated": False
}
]
}
]
}
}
]
}

Upload Document

Uploads a document for processing in a specific workflow. Supports both file and URL upload, with async and metadata options.

from nanonetsclient import NanonetsClient

client = NanonetsClient(api_key='your_api_key')

# Upload document from file
result = client.workflows.upload_document(
workflow_id="workflow_123",
file_path="path/to/document.pdf",
async_mode=False, # Set to True for asynchronous processing
metadata={
"customer_id": "12345",
"document_type": "invoice",
"department": "finance"
}
)

# Upload document from URL
result = client.workflows.upload_document(
workflow_id="workflow_123",
document_url="https://example.com/invoice.pdf",
async_mode=False,
metadata={
"customer_id": "12345",
"document_type": "invoice",
"department": "finance"
}
)

Get Document Status

Retrieves the current processing status and results of a specific document.

document = client.workflows.get_document(
workflow_id="workflow_123",
document_id="document_123"
)

List Documents

Retrieves a list of all documents in a specific workflow.

documents = client.workflows.list_documents(
workflow_id="workflow_123",
page=1, # Page number for pagination
limit=10 # Number of documents per page
)

Delete Document

Removes a document from the workflow.

client.workflows.delete_document(
workflow_id="workflow_123",
document_id="document_123"
)

Get Document Fields

Retrieves all extracted fields from a specific document.

fields = client.workflows.get_document_fields(
workflow_id="workflow_123",
document_id="document_123"
)

Get Document Tables

Retrieves all extracted tables from a specific document.

tables = client.workflows.get_document_tables(
workflow_id="workflow_123",
document_id="document_123"
)

Get Document Original File

Downloads the original document file.

original_file = client.workflows.get_document_original_file(
workflow_id="workflow_123",
document_id="document_123"
)

Error Handling & Common Scenarios

API error codes:

  • 200 OK: Request successful
  • 201 Created: Document uploaded successfully
  • 400 Bad Request: Invalid request parameters or unsupported file type
  • 401 Unauthorized: Invalid/missing API key
  • 404 Not Found: Workflow or document not found
  • 413 Payload Too Large: File size exceeds limit
  • 500 Internal Server Error: Server-side error

Common error scenarios:

  • File upload issues (unsupported type, too large, corrupted)
  • Processing errors (timeout, unreadable content, failure)
  • Field/table header issues (invalid/duplicate names)
from nanonets.exceptions import NanonetsError, AuthenticationError, ValidationError

try:
result = client.workflows.upload_document(...)
except AuthenticationError as e:
print(f"Authentication failed: {e}")
except ValidationError as e:
print(f"Invalid input: {e}")
except NanonetsError as e:
print(f"An error occurred: {e}")

Best Practices

  • Use async_mode for large files or batch processing
  • Include relevant metadata for better tracking
  • Validate file types before upload
  • Check confidence scores before using extracted data
  • Handle both sync and async responses appropriately
  • Implement retry logic for failed processing
  • Delete processed documents when no longer needed
  • Monitor storage usage and implement retention policies

For more detailed information about specific features, see:

Setup

Minimum Python version required: 3.7

Install the Nanonets Python SDK using pip:

pip install nanonets