looking at the hOCR format. https://github.com/tesseract-ocr There is also GUI programs that use tesseract, packaged into a graphic interface. ie gscan2pdf. it can also create hocr. it's a highly structured html..