Google Vision: detect text in PDFs synchronously with PHP

26/03/2021
AI, Google, google vision, PHP, text detection, Uncategorized
3 Comments
Lorenzo

The Vision API now supports online (synchronous) small batch annotation (PDF/TIFF/GIF) for all features. To do so, the relevant documentation is Small batch file annotation online.

Let’s see how can we do this with PHP.

Context

Having PHP >= 7.4, the packages to require are:

google/cloud-vision
google/cloud-storage

Code

How to upload the file in the storage

Soon.

Text detection

Even with PDFs we are going to use ImageAnnotatorClient, the service that performs Google Cloud Vision API detection tasks over client images and returns detected entities from the images.

$path = "gs://mystorage.com/path/to/my/file.pdf";

/* If you have it, you can give an hint about the language in the doc */
$context = new ImageContext();
$context->setLanguageHints(['it']);

/* Here's the annotator described before */
$imageAnnotator = new ImageAnnotatorClient();

/* We create an AnnotateFileRequest instance to annotate one single file */
$file_request = new AnnotateFileRequest();

/* We express our input file in terms of a GcsSource
instance the represents the Google Cloud Storage location */
$gcs_source = (new GcsSource())
    ->setUri($path);

/* Let's specify the feature we need. You can find the options below */
$feature = (new Feature())
    ->setType(Type::DOCUMENT_TEXT_DETECTION);

/* Let's specify the file info: a PDF in that location */
$input_config = (new InputConfig())
    ->setMimeType('application/pdf')
    ->setGcsSource($gcs_source);

/* Some configurations, including the pages of the file to perform image annotation. */
$file_request = $file_request->setInputConfig($input_config)
    ->setFeatures([$feature])
    ->setPages([1]);

/* Annotate the files and get the responses making the synchronous batch request. */
$result = $imageAnnotator->batchAnnotateFiles([$file_request]);

/* We take the first result, because that's 1 page only. */
$res = $result->getResponses();
$offset = $res->offsetGet(0);
$responses = $offset->getResponses();
$res = $responses[0];

/* Finally!!! The annotations! */
$annotations = $res->getFullTextAnnotation();

/* Clean up resources such as threads */
$imageAnnotator->close();

Features

In your request you can set the type of annotation you want to perform on the file. You can check the reference or the features list documentation.

Some examples are:

Face detection
Landmark detection
Logo detection
Label detection
Text and document text detection
..

3 Comments Leave a comment

sankar says:

05/12/2021 at 1:07 PM

If the PDF file in Local path, how do you update the path in the Inputconfig?

- Lorenzo says:
  
  07/12/2021 at 1:05 PM
  
  Hi Sankar, thanks for your reply.
  Last time I checked it was not possible to directly work on a local PDF.
  So the idea is to add an extra step in your code to upload the file to Google Storage, and use the storage file URL as shown in the article.
  
  I will add details into the specific section, but basically you can use the google/cloud-storage module to upload the file into your bucket.
  The setup steps are the same as the ones you already did for the Google Vision part, creating a service account and downloading the JSON key.
  
  After that you will have something like:
  $storage = new StorageClient(); // Here you can pass some info related to the key, if not automatically leader from an env file or similar $bucket = $storage->bucket('your-bucket-name'); $file_content = file_get_contents('/path/to/file.pdf'); // Or from the request $upload_result = $bucket->upload($file_content, ['name' => 'uploads/my-storage-directory/']);
  
  And then you can proceed according to the article.
  Hopefully I will be able to complete the article soon.
  
Google Vision: detect text in PDFs synchronously with PHP – L' informazione in blog says:

03/04/2022 at 11:22 PM

[…] Source link […]