looking at the hOCR format.
https://github.com/tesseract-ocr
There is also GUI programs that use tesseract, packaged into a graphic interface. ie gscan2pdf.
it can also create hocr.
it's a highly structured html..