Upload a document
Response:
id
: document id, also calledupload_id
, you can use it to get the document byGET /documents/{upload_id}
. It's of uuid format.
Please note: you should store id
in your database, because we don't have a GET /documents/list
API to list all documents for now.
status
: Once uploaded, the status isUN_PARSED
, after a series of processing, the
status would change by time, and finally to be one of the following two cases:
UN_PARSED = 1 file uploaded or collection has no document
# final statuses
ELEMENT_PARSED = 300 analysis of the document has succeeded
ERROR_STATUSES (< 0) error occurred during analysis
So before the document status finalized, you can poll the status by
GET /documents/{upload_id}
at interval of 10s , generally it takes 1-2 minutes to
finish depending on content length of the document. If error occurred, it doesn't consume your quota.
Please note:
- Uploading files consumes your pages quota, and uploading same files again will still consume quota.
- We'll keep your uploaded file for one year. If you do not make another payment for the API after one year, the file will be permanently deleted.
- Usage instructions for the OCR field:
OCR Pages Package must be used in conjunction with PDF Pages Package; both packages are deducted equally.
Three parameters are available: defaults to disable, with optional values auto or force.
When OCR is set to force, it allows OCR Pages Package usage for Word documents. However, for other document types like ePub and Markdown, it won't take effect.
For reference, other statuses are as follows:
UN_PARSED = 1 file uploaded or collection has no document
LINK_UN_PARSED = 10 file link submitted
PARSING = 12 parsing, mainly used for collection
LINK_DOWNLOADING = 15 file link downloading
PDF_CONVERTING = 20 docx to pdf converting
PDF_CONVERTED = 30 docx to pdf success
TEXT_PARSING = 40 text embedding(when element parse timeout 2min)
ELEMENT_PARSING = 50 element embedding
INSIGHT_CALLBACK = 70 element parse success
TEXT_PARSED = 210 text embedding success
ELEMENT_PARSED = 300 element embedding success
TEXT_PARSE_ERROR = -1 text embedding failed
ELEMENT_PARED_ERROR = -2 element embedding failed
PDF_CONVERT_ERROR = -3 docx to pdf failed
LINK_DOWNLOAD_ERROR = -4 file link download failed
EXCEED_SIZE_ERROR = -5 file size exceed limit
EXCEED_TOKENS_ERROR = -6 exceed tokens limit
PAGE_PACKAGE_NOT_ENOUGH_ERROR = -9 page package not enough
PAGE_LIMIT_ERROR = -10 page limit error
TITLE_COMPLETE_ERROR = -11 complete title failed
READ_TMP_FILE_ERROR = -12 read tmp file error
OCR_PAGE_LIMIT_ERROR = -13 ocr page limit error
CONTENT_POLICY_ERROR = -14 content security check did not pass
CONTENT_DECODE_ERROR = -15 file content decode error
HTML_CONVERT_ERROR = -16 html convert error
HTML_EMPTY_BODY_ERROR = -17 content is empty
HTML_PARSE_ERROR = -18 html parse error
HTML_DOWNLOAD_ERROR = -19 html download error from website
PACKAGE_NOT_ENOUGH_ERROR = -25 package not enough