looking at the hOCR format.

https://github.com/tesseract-ocr

There is also GUI programs that use tesseract, packaged into a graphic interface. ie gscan2pdf.
it can also create hocr.
it's a highly structured html..