til: aws_ocr-pdf-textract.md

This data as json

path	topic	title	url	body	html	shot	created	created_utc	updated	updated_utc	shot_hash	slug
aws_ocr-pdf-textract.md	aws	Running OCR against a PDF file with AWS Textract	https://github.com/simonw/til/blob/main/aws/ocr-pdf-textract.md	[Textract](https://aws.amazon.com/textract/) is the AWS OCR API. It's very good - I've fed it hand-written notes from the 1890s and it read them better than I could. It can be run directly against JPEG or PNG images up to 5MB, but if you want to run OCR against a PDF file you have to first upload it to an S3 bucket. Update 30th June 2022: I used what I learned in this TIL [to build s3-ocr](https://simonwillison.net/2022/Jun/30/s3-ocr/), a command line utility for running OCR against PDFs in an S3 bucket. ## Try it out first You don't need to use the API at all to try Textract out against a document: they offer a demo tool in the AWS console: https://us-west-1.console.aws.amazon.com/textract/home?region=us-west-1#/demo <img alt="Screenshot of the demo interface showing uploaded image and resulting text" src="https://user-images.githubusercontent.com/9599/176274424-441aee18-8e8c-44bf-9748-f53e33e3fa76.png" width="600"> ## Limits Relevant [limits](https://docs.aws.amazon.com/textract/latest/dg/limits.html) for PDF files: > For asynchronous operations, JPEG and PNG files have a 10MB size limit. PDF and TIFF files have a 500MB limit. PDF and TIFF files have a limit of 3,000 pages. > > For PDFs: The maximum height and width is 40 inches and 2880 points. PDFs cannot be password protected. PDFs can contain JPEG 2000 formatted images. ## Uploading to S3 I used my [s3-credentials](https://github.com/simonw/s3-credentials/) tool to create an S3 bucket with credentials for uploading files to it: ``` ~ % s3-credentials create sfms-history -c Created bucket: sfms-history Created user: 's3.read-write.sfms-history' with permissions boundary: 'arn:aws:iam::aws:policy/AmazonS3FullAccess' Attached policy s3.read-write.sfms-history to user s3.read-write.sfms-history Created access key for user: s3.read-write.sfms-history { "UserName": "s3.read-write.sfms-history", "AccessKeyId": "AKIAWXFXAIOZBOQM4XUH", "Status": "Active", "SecretAccessKey": "...", "CreateDate": "2022-06-28 17:55:10+00:00" } ``` I stored the secret access key in 1Password, then used it in [Transmit](https://panic.com/transmit/) to upload the PDF files. ## Starting a text detection job For PDFs you need to run in async mode, where you get back a job ID and then poll for completion. You can ask it to send you notifications via an SNS queue too, but this is optional. You can ignore SNS entirely, which is what I did. To start the job, provide it with the bucket and the name of the file to process: ```python import boto3 textract = boto3.client("textract") response = textract.start_document_text_detection( DocumentLocation={ 'S3Object': { 'Bucket': "sfms-history", 'Name': "Meetings and Minutes/Minutes/1946-1949/1946-10-04_SFMS_MeetingMinutes.pdf" } } ) job_id = response["JobId"] ``` ## Polling for completion You can then use that `job_id` to poll for completion. The `textract.get_document_text_detection` call returns a `JobStatus` key of `IN_PROGRESS` if it is still processing. Here's a function I wrote to poll for completion: ```python import time def poll_until_done(job_id): while True: response = textract.get_document_text_detection(JobId=job_id) status = response["JobStatus"] if status != "IN_PROGRESS": return response print(".", end="") time.sleep(10) # Usage, given a response from textract.start_document_text_detection: completion_response = poll_until_done(response["JobId"]) ``` This can take a surprisingly long time - it took seven minutes for a 6 page typewritten PDF file for me, and ten minutes for a 56 page handwritten one. I was wondering how long you have to retrieve the results of a job. The [get_document_text_detection()](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract.html#Textract.Client.get_document_text_detection) documentation says: > A `JobId` value is only valid for 7 days. ## Fetching the results The response that you get back at the end is paginated. Here's a function to gather all of the "blocks" of text that it detected across multiple pages: ```python def get_all_blocks(job_id): blocks = [] next_token = None first = True while first or next_token: first = False kwargs = {"JobId": job_id} if next_token: kwargs["NextToken"] = next_token response = textract.get_document_text_detection(**kwargs) blocks.extend(response["Blocks"]) next_token = response.get("NextToken") return blocks ``` (I could have used [this boto3 pagination trick](https://til.simonwillison.net/aws/helper-for-boto-aws-pagination) instead.) Blocks come in three types: `LINE`, `WORD`, and `PAGE`. The `PAGE` blocks do not contain any text, just indications of which lines and words were on the page. The `LINE` and `WORD` blocks duplicate each other - you probably just want the `LINE` blocks. Here's an example of a `LINE` block: ```json { "BlockType": "LINE", "Confidence": 90.4699478149414, "Text": "1", "Geometry": { "BoundingBox": { "Width": 0.00758015550673008, "Height": 0.011477531865239143, "Left": 0.9904273152351379, "Top": 0.00909337680786848 }, "Polygon": [ { "X": 0.9904273152351379, "Y": 0.00909337680786848 }, { "X": 0.9980074763298035, "Y": 0.00909337680786848 }, { "X": 0.9980074763298035, "Y": 0.0205709096044302 }, { "X": 0.9904273152351379, "Y": 0.0205709096044302 } ] }, "Id": "6b04b8df-bec1-42d3-bfff-29f0edd38976", "Relationships": [ { "Type": "CHILD", "Ids": [ "58890ca7-5ed5-4b14-ad60-475e5d0dd79e" ] } ], "Page": 1 } ``` I found that joining together those lines on a `\n` gave me the results I needed: ```python print("\n".join([block["Text"] for block in blocks if block["BlockType"] == "LINE"])) ``` Truncated output: ``` 1 ORGANIZATION MEETING of the SAN FRANCISCO MICROSCOPICAL SOCIETY October 4, 1946 The meeting ws.s held at 8:00 P.M. on October 4, 1946, in the Auditorium of the San Francisco Department of Health, 101 Grove Street, San Francisco. Chairman George Herbert Needham called the audience of sixty- five persons to order. He told of the high aims, ideals, and fine fellow- ship enjoyed by the original society which was organized in 1870 and incor- porated in 1872, but which was dissolved following the San Francisco fire of 1906. He related his efforts to find a surviving member which finally resulted in a telegram of greeting from Dr. Kaspar Pischell of Ross, Cali- fornia, which read as follows: "BEST WISHES AT THIS REUNION. I AM SORRY I CANNOT BE WITH YOU." ```	<p><a href="https://aws.amazon.com/textract/" rel="nofollow">Textract</a> is the AWS OCR API. It's very good - I've fed it hand-written notes from the 1890s and it read them better than I could.</p> <p>It can be run directly against JPEG or PNG images up to 5MB, but if you want to run OCR against a PDF file you have to first upload it to an S3 bucket.</p> <p><strong>Update 30th June 2022</strong>: I used what I learned in this TIL <a href="https://simonwillison.net/2022/Jun/30/s3-ocr/" rel="nofollow">to build s3-ocr</a>, a command line utility for running OCR against PDFs in an S3 bucket.</p> <h2> <a id="user-content-try-it-out-first" class="anchor" href="#try-it-out-first" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Try it out first</h2> <p>You don't need to use the API at all to try Textract out against a document: they offer a demo tool in the AWS console:</p> <p><a href="https://us-west-1.console.aws.amazon.com/textract/home?region=us-west-1#/demo" rel="nofollow">https://us-west-1.console.aws.amazon.com/textract/home?region=us-west-1#/demo</a></p> <p><a href="https://user-images.githubusercontent.com/9599/176274424-441aee18-8e8c-44bf-9748-f53e33e3fa76.png" target="_blank" rel="nofollow"><img alt="Screenshot of the demo interface showing uploaded image and resulting text" src="https://user-images.githubusercontent.com/9599/176274424-441aee18-8e8c-44bf-9748-f53e33e3fa76.png" width="600" style="max-width:100%;"></a></p> <h2> <a id="user-content-limits" class="anchor" href="#limits" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Limits</h2> <p>Relevant <a href="https://docs.aws.amazon.com/textract/latest/dg/limits.html" rel="nofollow">limits</a> for PDF files:</p> <blockquote> <p>For asynchronous operations, JPEG and PNG files have a 10MB size limit. PDF and TIFF files have a 500MB limit. PDF and TIFF files have a limit of 3,000 pages.</p> <p>For PDFs: The maximum height and width is 40 inches and 2880 points. PDFs cannot be password protected. PDFs can contain JPEG 2000 formatted images.</p> </blockquote> <h2> <a id="user-content-uploading-to-s3" class="anchor" href="#uploading-to-s3" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Uploading to S3</h2> <p>I used my <a href="https://github.com/simonw/s3-credentials/">s3-credentials</a> tool to create an S3 bucket with credentials for uploading files to it:</p> <pre><code>~ % s3-credentials create sfms-history -c Created bucket: sfms-history Created user: 's3.read-write.sfms-history' with permissions boundary: 'arn:aws:iam::aws:policy/AmazonS3FullAccess' Attached policy s3.read-write.sfms-history to user s3.read-write.sfms-history Created access key for user: s3.read-write.sfms-history { "UserName": "s3.read-write.sfms-history", "AccessKeyId": "AKIAWXFXAIOZBOQM4XUH", "Status": "Active", "SecretAccessKey": "...", "CreateDate": "2022-06-28 17:55:10+00:00" } </code></pre> <p>I stored the secret access key in 1Password, then used it in <a href="https://panic.com/transmit/" rel="nofollow">Transmit</a> to upload the PDF files.</p> <h2> <a id="user-content-starting-a-text-detection-job" class="anchor" href="#starting-a-text-detection-job" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Starting a text detection job</h2> <p>For PDFs you need to run in async mode, where you get back a job ID and then poll for completion.</p> <p>You can ask it to send you notifications via an SNS queue too, but this is optional. You can ignore SNS entirely, which is what I did.</p> <p>To start the job, provide it with the bucket and the name of the file to process:</p> <div class="highlight highlight-source-python"><pre><span class="pl-k">import</span> <span class="pl-s1">boto3</span> <span class="pl-s1">textract</span> <span class="pl-c1">=</span> <span class="pl-s1">boto3</span>.<span class="pl-en">client</span>(<span class="pl-s">"textract"</span>) <span class="pl-s1">response</span> <span class="pl-c1">=</span> <span class="pl-s1">textract</span>.<span class="pl-en">start_document_text_detection</span>( <span class="pl-v">DocumentLocation</span><span class="pl-c1">=</span>{ <span class="pl-s">'S3Object'</span>: { <span class="pl-s">'Bucket'</span>: <span class="pl-s">"sfms-history"</span>, <span class="pl-s">'Name'</span>: <span class="pl-s">"Meetings and Minutes/Minutes/1946-1949/1946-10-04_SFMS_MeetingMinutes.pdf"</span> } } ) <span class="pl-s1">job_id</span> <span class="pl-c1">=</span> <span class="pl-s1">response</span>[<span class="pl-s">"JobId"</span>]</pre></div> <h2> <a id="user-content-polling-for-completion" class="anchor" href="#polling-for-completion" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Polling for completion</h2> <p>You can then use that <code>job_id</code> to poll for completion. The <code>textract.get_document_text_detection</code> call returns a <code>JobStatus</code> key of <code>IN_PROGRESS</code> if it is still processing.</p> <p>Here's a function I wrote to poll for completion:</p> <div class="highlight highlight-source-python"><pre><span class="pl-k">import</span> <span class="pl-s1">time</span> <span class="pl-k">def</span> <span class="pl-en">poll_until_done</span>(<span class="pl-s1">job_id</span>): <span class="pl-k">while</span> <span class="pl-c1">True</span>: <span class="pl-s1">response</span> <span class="pl-c1">=</span> <span class="pl-s1">textract</span>.<span class="pl-en">get_document_text_detection</span>(<span class="pl-v">JobId</span><span class="pl-c1">=</span><span class="pl-s1">job_id</span>) <span class="pl-s1">status</span> <span class="pl-c1">=</span> <span class="pl-s1">response</span>[<span class="pl-s">"JobStatus"</span>] <span class="pl-k">if</span> <span class="pl-s1">status</span> <span class="pl-c1">!=</span> <span class="pl-s">"IN_PROGRESS"</span>: <span class="pl-k">return</span> <span class="pl-s1">response</span> <span class="pl-en">print</span>(<span class="pl-s">"."</span>, <span class="pl-s1">end</span><span class="pl-c1">=</span><span class="pl-s">""</span>) <span class="pl-s1">time</span>.<span class="pl-en">sleep</span>(<span class="pl-c1">10</span>) <span class="pl-c"># Usage, given a response from textract.start_document_text_detection:</span> <span class="pl-s1">completion_response</span> <span class="pl-c1">=</span> <span class="pl-en">poll_until_done</span>(<span class="pl-s1">response</span>[<span class="pl-s">"JobId"</span>])</pre></div> <p>This can take a surprisingly long time - it took seven minutes for a 6 page typewritten PDF file for me, and ten minutes for a 56 page handwritten one.</p> <p>I was wondering how long you have to retrieve the results of a job. The <a href="https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/textract.html#Textract.Client.get_document_text_detection" rel="nofollow">get_document_text_detection()</a> documentation says:</p> <blockquote> <p>A <code>JobId</code> value is only valid for 7 days.</p> </blockquote> <h2> <a id="user-content-fetching-the-results" class="anchor" href="#fetching-the-results" aria-hidden="true"><span aria-hidden="true" class="octicon octicon-link"></span></a>Fetching the results</h2> <p>The response that you get back at the end is paginated. Here's a function to gather all of the "blocks" of text that it detected across multiple pages:</p> <div class="highlight highlight-source-python"><pre><span class="pl-k">def</span> <span class="pl-en">get_all_blocks</span>(<span class="pl-s1">job_id</span>): <span class="pl-s1">blocks</span> <span class="pl-c1">=</span> [] <span class="pl-s1">next_token</span> <span class="pl-c1">=</span> <span class="pl-c1">None</span> <span class="pl-s1">first</span> <span class="pl-c1">=</span> <span class="pl-c1">True</span> <span class="pl-k">while</span> <span class="pl-s1">first</span> <span class="pl-c1">or</span> <span class="pl-s1">next_token</span>: <span class="pl-s1">first</span> <span class="pl-c1">=</span> <span class="pl-c1">False</span> <span class="pl-s1">kwargs</span> <span class="pl-c1">=</span> {<span class="pl-s">"JobId"</span>: <span class="pl-s1">job_id</span>} <span class="pl-k">if</span> <span class="pl-s1">next_token</span>: <span class="pl-s1">kwargs</span>[<span class="pl-s">"NextToken"</span>] <span class="pl-c1">=</span> <span class="pl-s1">next_token</span> <span class="pl-s1">response</span> <span class="pl-c1">=</span> <span class="pl-s1">textract</span>.<span class="pl-en">get_document_text_detection</span>(<span class="pl-c1">**</span><span class="pl-s1">kwargs</span>) <span class="pl-s1">blocks</span>.<span class="pl-en">extend</span>(<span class="pl-s1">response</span>[<span class="pl-s">"Blocks"</span>]) <span class="pl-s1">next_token</span> <span class="pl-c1">=</span> <span class="pl-s1">response</span>.<span class="pl-en">get</span>(<span class="pl-s">"NextToken"</span>) <span class="pl-k">return</span> <span class="pl-s1">blocks</span></pre></div> <p>(I could have used <a href="https://til.simonwillison.net/aws/helper-for-boto-aws-pagination" rel="nofollow">this boto3 pagination trick</a> instead.)</p> <p>Blocks come in three types: <code>LINE</code>, <code>WORD</code>, and <code>PAGE</code>. The <code>PAGE</code> blocks do not contain any text, just indications of which lines and words were on the page. The <code>LINE</code> and <code>WORD</code> blocks duplicate each other - you probably just want the <code>LINE</code> blocks.</p> <p>Here's an example of a <code>LINE</code> block:</p> <div class="highlight highlight-source-json"><pre>{ <span class="pl-ent">"BlockType"</span>: <span class="pl-s"><span class="pl-pds">"</span>LINE<span class="pl-pds">"</span></span>, <span class="pl-ent">"Confidence"</span>: <span class="pl-c1">90.4699478149414</span>, <span class="pl-ent">"Text"</span>: <span class="pl-s"><span class="pl-pds">"</span>1<span class="pl-pds">"</span></span>, <span class="pl-ent">"Geometry"</span>: { <span class="pl-ent">"BoundingBox"</span>: { <span class="pl-ent">"Width"</span>: <span class="pl-c1">0.00758015550673008</span>, <span class="pl-ent">"Height"</span>: <span class="pl-c1">0.011477531865239143</span>, <span class="pl-ent">"Left"</span>: <span class="pl-c1">0.9904273152351379</span>, <span class="pl-ent">"Top"</span>: <span class="pl-c1">0.00909337680786848</span> }, <span class="pl-ent">"Polygon"</span>: [ { <span class="pl-ent">"X"</span>: <span class="pl-c1">0.9904273152351379</span>, <span class="pl-ent">"Y"</span>: <span class="pl-c1">0.00909337680786848</span> }, { <span class="pl-ent">"X"</span>: <span class="pl-c1">0.9980074763298035</span>, <span class="pl-ent">"Y"</span>: <span class="pl-c1">0.00909337680786848</span> }, { <span class="pl-ent">"X"</span>: <span class="pl-c1">0.9980074763298035</span>, <span class="pl-ent">"Y"</span>: <span class="pl-c1">0.0205709096044302</span> }, { <span class="pl-ent">"X"</span>: <span class="pl-c1">0.9904273152351379</span>, <span class="pl-ent">"Y"</span>: <span class="pl-c1">0.0205709096044302</span> } ] }, <span class="pl-ent">"Id"</span>: <span class="pl-s"><span class="pl-pds">"</span>6b04b8df-bec1-42d3-bfff-29f0edd38976<span class="pl-pds">"</span></span>, <span class="pl-ent">"Relationships"</span>: [ { <span class="pl-ent">"Type"</span>: <span class="pl-s"><span class="pl-pds">"</span>CHILD<span class="pl-pds">"</span></span>, <span class="pl-ent">"Ids"</span>: [ <span class="pl-s"><span class="pl-pds">"</span>58890ca7-5ed5-4b14-ad60-475e5d0dd79e<span class="pl-pds">"</span></span> ] } ], <span class="pl-ent">"Page"</span>: <span class="pl-c1">1</span> }</pre></div> <p>I found that joining together those lines on a <code>\n</code> gave me the results I needed:</p> <div class="highlight highlight-source-python"><pre><span class="pl-en">print</span>(<span class="pl-s">"<span class="pl-cce">\n</span>"</span>.<span class="pl-en">join</span>([<span class="pl-s1">block</span>[<span class="pl-s">"Text"</span>] <span class="pl-k">for</span> <span class="pl-s1">block</span> <span class="pl-c1">in</span> <span class="pl-s1">blocks</span> <span class="pl-k">if</span> <span class="pl-s1">block</span>[<span class="pl-s">"BlockType"</span>] <span class="pl-c1">==</span> <span class="pl-s">"LINE"</span>]))</pre></div> <p>Truncated output:</p> <pre><code>1 ORGANIZATION MEETING of the SAN FRANCISCO MICROSCOPICAL SOCIETY October 4, 1946 The meeting ws.s held at 8:00 P.M. on October 4, 1946, in the Auditorium of the San Francisco Department of Health, 101 Grove Street, San Francisco. Chairman George Herbert Needham called the audience of sixty- five persons to order. He told of the high aims, ideals, and fine fellow- ship enjoyed by the original society which was organized in 1870 and incor- porated in 1872, but which was dissolved following the San Francisco fire of 1906. He related his efforts to find a surviving member which finally resulted in a telegram of greeting from Dr. Kaspar Pischell of Ross, Cali- fornia, which read as follows: "BEST WISHES AT THIS REUNION. I AM SORRY I CANNOT BE WITH YOU." </code></pre>	<Binary: 70,060 bytes>	2022-06-28T12:32:43-07:00	2022-06-28T19:32:43+00:00	2022-06-30T15:48:16-07:00	2022-06-30T22:48:16+00:00	752af8acdf6d6838cef061e37bda9b59	ocr-pdf-textract