Use this file to discover all available pages before exploring further.
The library is a lightweight Python library you can use for parsing documents, classifying pages, extracting data, generating tables of contents, and splitting documents into sub-documents.The library is automatically generated from our API specification, ensuring you have access to the latest endpoints and parameters.
To use the library, first generate an API key. Save the key to a .zshrc file or another secure location on your computer. Then export the key as an environment variable.
export VISION_AGENT_API_KEY=<your-api-key>
For more information about API keys and alternate methods for setting the API key, go to API Key.
By default, the library uses the US endpoints. If your API key is from the EU endpoint, set the environment parameter to eu when initializing the client.
from pathlib import Pathfrom landingai_ade import LandingAIADEclient = LandingAIADE( environment="eu",)# ... rest of your code
The parse function converts documents into structured markdown with chunk and grounding metadata. Use these examples as guides to get started with parsing with the library.
The parse function accepts optional parameters to customize parsing behavior. To see all available parameters, go to ADE Parse API.Pass these parameters directly to the parse() function.
The parse_jobs function enables you to asynchronously parse documents that are up to 1,000 pages or 1 GB.
For more information about parse jobs, go to Parse Large Files (Parse Jobs).Here is the basic workflow for working with parse jobs:
Start a parse job.
Copy the job_id in the response.
Get the results from the parsing job with the job_id.
This script contains the full workflow:
import timefrom pathlib import Pathfrom landingai_ade import LandingAIADEclient = LandingAIADE()## Step 1: Create a parse jobjob = client.parse_jobs.create( document=Path("/path/to/file/document"), model="dpt-2-latest")job_id = job.job_idprint(f"Job {job_id} created.") # Step 2: Get the parsing resultswhile True: response = client.parse_jobs.get(job_id) if response.status == "completed": print(f"Job {job_id} completed.") break print(f"Job {job_id}: {response.status} ({response.progress * 100:.0f}% complete)") time.sleep(5)# Step 3: Access the parsed dataprint("Global markdown:", response.data.markdown[:200] + "...")print(f"Number of chunks: {len(response.data.chunks)}")# Save Markdown output (useful if you plan to run extract on the Markdown)with open("output.md", "w", encoding="utf-8") as f: f.write(response.data.markdown)
To list all async parse jobs associated with your API key, run this code:
from pathlib import Pathfrom landingai_ade import LandingAIADEclient = LandingAIADE()# List all jobsresponse = client.parse_jobs.list()for job in response.jobs: print(f"Job {job.job_id}: {job.status}")
for chunk in response.chunks: if chunk.type == 'text': print(f"Chunk {chunk.id}: {chunk.markdown}")
Filter chunks by page:
page_0_chunks = [chunk for chunk in response.chunks if chunk.grounding.page == 0]
Get chunk locations:
for chunk in response.chunks: box = chunk.grounding.box print(f"Chunk at page {chunk.grounding.page}: ({box.left}, {box.top}, {box.right}, {box.bottom})")
Access detailed chunk types from grounding dictionary:
for chunk_id, grounding in response.grounding.items(): print(f"Chunk {chunk_id} has type: {grounding.type}")
The extract function extracts structured data from Markdown content using extraction schemas. Use these examples as guides to get started with extracting with the library.Pass Markdown ContentThe library supports a few methods for passing the Markdown content for extraction:
If you already have a Markdown file (from a previous parsing operation), you can extract data directly from it. Use the markdown parameter for local markdown files or markdown_url for remote markdown files.
import jsonfrom pathlib import Pathfrom landingai_ade import LandingAIADE# Define your extraction schemaschema_dict = { "type": "object", "properties": { "employee_name": { "type": "string", "description": "The employee's full name" }, "employee_ssn": { "type": "string", "description": "The employee's Social Security Number" }, "gross_pay": { "type": "number", "description": "The gross pay amount" } }}client = LandingAIADE()schema_json = json.dumps(schema_dict)# Extract from a local markdown fileextract_response = client.extract( schema=schema_json, markdown=Path("/path/to/output.md"), model="extract-latest")# Or extract from a remote markdown fileextract_response = client.extract( schema=schema_json, markdown_url="https://example.com/document.md", model="extract-latest")# Access the extracted dataprint(extract_response.extraction)
Use Pydantic models to define your extraction schema in a type-safe way. The library provides a helper function to convert Pydantic models to JSON schemas.
from pathlib import Pathfrom landingai_ade import LandingAIADEfrom landingai_ade.lib import pydantic_to_json_schemafrom pydantic import BaseModel, Field# Define your extraction schema as a Pydantic modelclass PayStubData(BaseModel): employee_name: str = Field(description="The employee's full name") employee_ssn: str = Field(description="The employee's Social Security Number") gross_pay: float = Field(description="The gross pay amount")# Initialize the clientclient = LandingAIADE()# First, parse the document to get markdownparse_response = client.parse( document=Path("/path/to/pay-stub.pdf"), model="dpt-2-latest")# Convert Pydantic model to JSON schemaschema = pydantic_to_json_schema(PayStubData)# Extract structured data using the schemaextract_response = client.extract( schema=schema, markdown=parse_response.markdown, model="extract-latest")# Access the extracted dataprint(extract_response.extraction)# Access extraction metadata to see which chunks were referencedprint(extract_response.extraction_metadata)
Define your extraction schema directly as a JSON string in your script.
import jsonfrom pathlib import Pathfrom landingai_ade import LandingAIADE# Define your extraction schema as a dictionaryschema_dict = { "type": "object", "properties": { "employee_name": { "type": "string", "description": "The employee's full name" }, "employee_ssn": { "type": "string", "description": "The employee's Social Security Number" }, "gross_pay": { "type": "number", "description": "The gross pay amount" } }}# Initialize the clientclient = LandingAIADE()# First, parse the document to get markdownparse_response = client.parse( document=Path("/path/to/pay-stub.pdf"), model="dpt-2-latest")# Convert schema dictionary to JSON stringschema_json = json.dumps(schema_dict)# Extract structured data using the schemaextract_response = client.extract( schema=schema_json, markdown=parse_response.markdown, model="extract-latest")# Access the extracted dataprint(extract_response.extraction)# Access extraction metadata to see which chunks were referencedprint(extract_response.extraction_metadata)
Load your extraction schema from a separate JSON file for better organization and reusability.For example, here is the pay_stub_schema.json file:
{ "type": "object", "properties": { "employee_name": { "type": "string", "description": "The employee's full name" }, "employee_ssn": { "type": "string", "description": "The employee's Social Security Number" }, "gross_pay": { "type": "number", "description": "The gross pay amount" } }}
You can pass the JSON file defined above in the following script:
import jsonfrom pathlib import Pathfrom landingai_ade import LandingAIADE# Initialize the clientclient = LandingAIADE()# First, parse the document to get markdownparse_response = client.parse( document=Path("/path/to/pay-stub.pdf"), model="dpt-2-latest")# Load schema from JSON filewith open("pay_stub_schema.json", "r") as f: schema_json = f.read()# Extract structured data using the schemaextract_response = client.extract( schema=schema_json, markdown=parse_response.markdown, model="extract-latest")# Access the extracted dataprint(extract_response.extraction)# Access extraction metadata to see which chunks were referencedprint(extract_response.extraction_metadata)
Define nested Pydantic models to extract hierarchical data from documents. This approach organizes related information under meaningful section names.Define nested models before the main extraction schema. Otherwise, the nested model classes will not be defined when referenced.For example, to extract data from the Patient Details and Emergency Contact Information sections in this Medical Form, define separate models for each section, then combine them in a main model.
from pathlib import Pathfrom pydantic import BaseModel, Fieldfrom landingai_ade import LandingAIADEfrom landingai_ade.lib import pydantic_to_json_schema# Define a nested model for patient-specific informationclass PatientDetails(BaseModel): patient_name: str = Field( ..., description='Full name of the patient.', title='Patient Name' ) date: str = Field( ..., description='Date the patient information form was filled out.', title='Date', )# Define a nested model for emergency contact detailsclass EmergencyContactInformation(BaseModel): emergency_contact_name: str = Field( ..., description='Full name of the emergency contact person.', title='Emergency Contact Name', ) relationship_to_patient: str = Field( ..., description='Relationship of the emergency contact to the patient.', title='Relationship to Patient', ) primary_phone_number: str = Field( ..., description='Primary phone number of the emergency contact.', title='Primary Phone Number', ) secondary_phone_number: str = Field( ..., description='Secondary phone number of the emergency contact.', title='Secondary Phone Number', ) address: str = Field( ..., description='Full address of the emergency contact.', title='Address' )# Define the main extraction schema that combines all the nested modelsclass PatientAndEmergencyContactInformationExtractionSchema(BaseModel): # Nested field containing patient details patient_details: PatientDetails = Field( ..., description='Information about the patient as provided in the form.', title='Patient Details', ) # Nested field containing emergency contact information emergency_contact_information: EmergencyContactInformation = Field( ..., description='Details of the emergency contact person for the patient.', title='Emergency Contact Information', )# Initialize the clientclient = LandingAIADE()# Parse the document to get markdownparse_response = client.parse( document=Path("/path/to/medical-form.pdf"), model="dpt-2-latest")# Convert Pydantic model to JSON schemaschema = pydantic_to_json_schema(PatientAndEmergencyContactInformationExtractionSchema)# Extract structured data using the schemaextract_response = client.extract( schema=schema, markdown=parse_response.markdown, model="extract-latest")# Display the extracted structured dataprint(extract_response.extraction)
Use python List type inside of a Pydantic BaseModel to extract repeatable data structures when you don’t know how many items will appear. Common examples include line items in invoices, transaction records, or contact information for multiple people.For example, to extract variable-length wire instructions and line items from this Wire Transfer Form, use List[DescriptionItem] for line items and List[WireInstruction] for wire transfer details.
from typing import Listfrom pathlib import Pathfrom pydantic import BaseModel, Fieldfrom landingai_ade import LandingAIADEfrom landingai_ade.lib import pydantic_to_json_schema# Nested models for list fieldsclass DescriptionItem(BaseModel): description: str = Field(description="Invoice or Bill Description") amount: float = Field(description="Invoice or Bill Amount")class WireInstruction(BaseModel): bank_name: str = Field(description="Bank name") bank_address: str = Field(description="Bank address") bank_account_no: str = Field(description="Bank account number") swift_code: str = Field(description="SWIFT code") aba_routing: str = Field(description="ABA routing number") ach_routing: str = Field(description="ACH routing number")# Invoice model containing list object fieldsclass Invoice(BaseModel): description_or_particular: List[DescriptionItem] = Field( description="List of invoice line items (description and amount)" ) wire_instructions: List[WireInstruction] = Field( description="Wire transfer instructions" )# Main extraction modelclass ExtractedInvoiceFields(BaseModel): invoice: Invoice = Field(description="Invoice list-type fields")# Initialize the clientclient = LandingAIADE()# Parse the document to get markdownparse_response = client.parse( document=Path("/path/to/wire-transfer.pdf"), model="dpt-2-latest")# Convert Pydantic model to JSON schemaschema = pydantic_to_json_schema(ExtractedInvoiceFields)# Extract structured data using the schemaextract_response = client.extract( schema=schema, markdown=parse_response.markdown, model="extract-latest")# Display the extracted dataprint(extract_response.extraction)
The classify function classifies each page in a document by type. Provide your document and a list of classes, and the API assigns a class to each page. Use these examples as guides to get started with classifying with the library.
Use the document_url parameter to classify files from remote URLs (http, https, ftp, ftps).
import jsonfrom landingai_ade import LandingAIADEclient = LandingAIADE()classes = [ {"class": "invoice", "description": "A commercial bill with line items, totals, and payment terms"}, {"class": "bank_statement", "description": "A monthly summary of account transactions"}]response = client.classify( classes=json.dumps(classes), document_url="https://example.com/document.pdf", model="classify-latest")for result in response.classification: print(f"Page {result.page}: {result.class_}")
The classify function returns a ClassifyResponse object with the following fields:
classification: List of Classification objects, one per page, each containing:
class_: The predicted class label, or 'unknown' if the page could not be classified. Note: class_ is used instead of class because class is a reserved keyword in Python.
page: The zero-indexed page number
reason: A brief explanation of the classification (for debugging)
suggested_class: A proposed class when the prediction is 'unknown'
for result in response.classification: print(f"Page {result.page}: {result.class_}")
Filter pages by class:
invoices = [r for r in response.classification if r.class_ == "invoice"]print(f"Found {len(invoices)} invoice pages")
Handle pages that could not be classified:
unknown = [r for r in response.classification if r.class_ == "unknown"]for r in unknown: print(f"Page {r.page}: suggested class is {r.suggested_class}")
The section function analyzes a parsed document and generates a hierarchical table of contents. Use these examples as guides to get started with sectioning with the library.Pass Markdown ContentThe library supports a few methods for passing the Markdown content for sectioning:
If you already have a Markdown file (from a previous parsing operation), you can section it directly. Use the markdown parameter for local Markdown files or markdown_url for remote Markdown files.
from pathlib import Pathfrom landingai_ade import LandingAIADEclient = LandingAIADE()# Section from a local Markdown filesection_response = client.section( markdown=Path("/path/to/parsed_output.md"), model="section-latest")# Or section from a remote Markdown filesection_response = client.section( markdown_url="https://example.com/document.md", model="section-latest")# Access the table of contentsfor entry in section_response.table_of_contents: indent = " " * (entry.level - 1) print(f"{indent}{entry.section_number}. {entry.title}")
The split function classifies and separates a parsed document into multiple sub-documents based on Split Rules you define. Use these examples as guides to get started with splitting with the library.Pass Markdown ContentThe library supports a few methods for passing the Markdown content for splitting:
After parsing a document, you can pass the Markdown string directly from the ParseResponse to the split function without saving it to a file.
import jsonfrom pathlib import Pathfrom landingai_ade import LandingAIADEclient = LandingAIADE()# Parse the documentparse_response = client.parse( document=Path("/path/to/document.pdf"), model="dpt-2-latest")# Define Split Rulessplit_class = [ { "name": "Bank Statement", "description": "Document from a bank that summarizes all account activity over a period of time." }, { "name": "Pay Stub", "description": "Document that details an employee's earnings, deductions, and net pay for a specific pay period.", "identifier": "Pay Stub Date" }]# Split using the Markdown string from parse responsesplit_response = client.split( split_class=json.dumps(split_class), markdown=parse_response.markdown, # Pass Markdown string directly model="split-latest")# Access the splitsfor split in split_response.splits: print(f"Classification: {split.classification}") print(f"Identifier: {split.identifier}") print(f"Pages: {split.pages}")
If you already have a Markdown file (from a previous parsing operation), you can split it directly. Use the markdown parameter for local Markdown files or markdown_url for remote Markdown files.
import jsonfrom pathlib import Pathfrom landingai_ade import LandingAIADEclient = LandingAIADE()# Define Split Rulessplit_class = [ { "name": "Invoice", "description": "A document requesting payment for goods or services.", "identifier": "Invoice Number" }, { "name": "Receipt", "description": "A document acknowledging that payment has been received." }]# Split from a local Markdown filesplit_response = client.split( split_class=json.dumps(split_class), markdown=Path("/path/to/parsed_output.md"), model="split-latest")# Or split from a remote Markdown filesplit_response = client.split( split_class=json.dumps(split_class), markdown_url="https://example.com/document.md", model="split-latest")# Access the splitsfor split in split_response.splits: print(f"Classification: {split.classification}") if split.identifier: print(f"Identifier: {split.identifier}") print(f"Number of pages: {len(split.pages)}") print(f"Markdown content: {split.markdowns[0][:100]}...")
for split in split_response.splits: print(f"Split Type: {split.classification}") print(f"Pages included: {split.pages}")
Filter splits by classification:
invoices = [split for split in split_response.splits if split.classification == "Invoice"]print(f"Found {len(invoices)} invoices")
Access Markdown content for each split:
for split in split_response.splits: print(f"Classification: {split.classification}") for i, markdown in enumerate(split.markdowns): print(f" Page {split.pages[i]} Markdown: {markdown[:100]}...")
Group splits by identifier:
from collections import defaultdictsplits_by_id = defaultdict(list)for split in split_response.splits: if split.identifier: splits_by_id[split.identifier].append(split)for identifier, splits in splits_by_id.items(): print(f"Identifier '{identifier}': {len(splits)} split(s)")
Pass a directory path. The library names the file using the input document’s filename and the function called (for example, document_parse_output.json).
When passing Markdown content as a string (markdown=parse_response.markdown), the library cannot derive a filename from the content. In this situation, use Set the File Name instead.
The parse response includes a markdown field that you can pass directly to other functions in the same script. To save the Markdown for downstream tasks, write it to a file:
with open("output.md", "w", encoding="utf-8") as f: f.write(response.markdown)