-> now live https://bbb.constantvzw.org/b/ale-jzg-d3x

unbinding_pdfs

pdf's as pages
https://archive.leftove.rs/
everything is ocr'd, have full text search option
visually extract things?

https://gitlab.constantvzw.org/osp/tools.pdfutils
a lot about colour conversion
Bolwerk project: extracting the fonts from the pdfs
created merged fonts on a directory level

restyling from original scans?
recontextualize the writing in the scans?
ocr is not good enough
how can you bring in lines/sections?

leftover.puscii.nl
another leftover project - a radio

digitizing pamphlet archive from artist library https://www.anarchistischecamping.nl/archief/
set up interesting work flow, but need more endurance
scan books - upload on server - ocr - fix files in MD - releases of books
-> is this workflow sharable? (documentation is lagging (as always) but anyone interested contact anice, can invite to git/nextcloud folders with scripts etc)

geometry of pdf
lines/boxes
what can you take from documents? if it is not the entire ocr text?
how recombine fragments/pages?
objects that are outside the cropbox/page, thus invisible on screen

cfr 'the book as object' in the Reader
describes form of the book as invention - text as transformation of speech, lines breaks up in series of characters
in workflow OSP: starts from html page, infinite length
problems they encounter is about reintroducing the elements of the book (pages, page numbers...)

if you make ocr, you are transforming lay-out page back into text, you loose visual layer
hierarchy of the document
'being able to navigate': book allows this, you know where piece of information is

describe structure of pdf?
stream of information is essence
different types of data (image/text), with a position on a page
not necessarily a linearity, object on page 2 can be at end of the document
freezing of text in a lay-out, separation of lines is hard coded, hard to extract
relation between texts (ex. header, you only record visual aspects of it)
https://brendanzagaeski.appspot.com/0005.html

pdf -> line -> back out
what kind of information can you extract from ocr?
text - metadata on the page / visualise it in some way

noise, odd characters of results of indexes, different characters with 1 match only
https://archive.leftove.rs/static/index/The%20Commoner/organizations_noise.txt
extra things ocr generates that is not necessarily the text
https://archive.leftove.rs/static/index/

looked for different tactics used in the material
ex. rent strike
is limited, next stage is to think about search terms might be useful

parsing, trying to understand text, locations, concepts, strategies.. different layers of info... language info vs real world info

somehow about [pause] text as encoded information... in PDF there is text itself, with OCR you can extract and try to analyze what info is in the text... visual aspect of PDF that comes with its own layer of info

with algolit (?) we do experiments with text and code, how do algorithms that work with machine learning extract information from text... w one foot in both worlds, algolit member and OSP member dealing w visuals/layout...both perspectives can be interesting.... noise, where this gray zone is

scrapbooks, recombination of different elements in the collection. concrete reason for doing that?
rationale behind the archive is that it is some of our collection that has been digitized and also bringing in sources... british library, anarchist archives, french ultra left... aggregate all that

should be public but isn't bec digital rights management... or used to enbellish the catalogue but not really there to circulate
embellish*
ways of cutting across the collection, not necessarily who made it (author) but what they were doing (rent strikes)

recombination is also about how you make political proaganda now based on historical material?

this was interesting to us, work in progress

thinking in streams, PDF as a tube where different media are combined, output in page0like format... but v different from book of leaflets separated by pages, search OCRing is looking at text and less looking at use of image vis a vis text... column based layout of certain page... how can u approach this tube/flux or information of much info differently ordered... how to use info about placement, layout to go further than text based search?
now all looking at the text, not at interrelation between image and text, or column based lay-out of a page
how can you use information about lay-out to go further than text?
very much communicating on visual level, your recognize style of the 50s
-> would be interesting to explore further, there is geometric information, maybe there is export function for elements of lay-out

cfr reader, Georges Perec
surface of a text

'digital native term'
translation of analog object put together in pdf
everything else you have to add by ocr'ing and adding metadata
if you transform html to pdf, lazy browser will put material in a raw form
pdf will be very large, not recompiled, also quite rich, a lot of extra information is kept
when it is scanned this is gone (?), maybe it is recognizing lay-out elements... (?)

scanning: jpgs are recombined in pdf
there might be some hierarchy in it?
any links to these tools?
commandline based

* * *

Inspecting PDFs.

«I think one of the main problem that I am having is that most of the pdfs we have in the archive are just an image with the OCRed layer so most of the interesting info is perhaps just coming from the layout info from the ocr... not sure if this is correct though»

-- Rosemary

https://stackoverflow.com/questions/3549541/best-tool-for-inspecting-pdf-files
More specifically (command line tools): https://stackoverflow.com/a/29474423

The first step seems to be often to decompress the PDF, turn it into
uncompressed text to make it easier to read, the post gives recipes for
qpdf, mutool and podofouncompress.

There's also a reference to PeePDF, a tool that allows to explore the
structure of a PDF (but as far as I know not to edit) https://eternal-todo.com/tools/peepdf-pdf-analysis-tool (I think I've
used this once).

## Notes of the tour /

# pdfinfo, which comes with xpdf allows to extract metadata from a PDF format
# creator, producer, creationdate, number of pages, pagesize, whether there is javascript inside
# the version of the pdf
pdfinfo in/POSA0004.pdf

# using perl-image-exiftool
# print all available meta information (-a), sorted by groups (-G1):
# Wasn't too interesting ?
exiftool -a -G1 in/POSA0004.pdf

# Imagemagick identify gives per page info
# A suite of software that allows to manipulate and create image files
# interesting infos:
# - page size
# - page color profile
# - color depth
# - crop box (?)
#
#
identify in/POSA0004.pdf

# PDF is a continuation of the Postscript format, a file format aimed at describing images, allows
# to describe both vector and bitmap images.
# The commands encoded in a ps file can also be interpreted by a printer
# postscript is also a programming language. You can create loops etc.
#
# One of the downsides of PDF is that you have to interpret / parse / process the full file before you
# can draw the file.
# In PDF every page is it's own object and can be rendered seperately.
# There is an extensive index within the file to keep track of all the objects and there exact
# location (measured in bytes). It's therefor hard to manipulate but easy / efficient to render
# meant to be read only file
# The identify command gives more information on the objects within the PDF file as pages are objects
# themselves within the format. It supports therefor different page sizes within the same file.
# If you'd compare it to a book, a PDF is more like a container or a bundle of pages.

# Example output of the command:

in/POSA0004.pdf[0] PDF 598x841 598x841+0+0 16-bit sRGB 1.01174MiB 0.000u 0:00.004
in/POSA0004.pdf[1] PDF 598x841 598x841+0+0 16-bit sRGB 884253B 0.000u 0:00.005
in/POSA0004.pdf[2] PDF 598x841 598x841+0+0 16-bit sRGB 1111670B 0.000u 0:00.004
in/POSA0004.pdf[3] PDF 598x841 598x841+0+0 16-bit sRGB 871156B 0.000u 0:00.004
in/POSA0004.pdf[4] PDF 598x841 598x841+0+0 16-bit sRGB 973653B 0.000u 0:00.004
in/POSA0004.pdf[5] PDF 598x841 598x841+0+0 16-bit sRGB 943283B 0.000u 0:00.003
in/POSA0004.pdf[6] PDF 598x841 598x841+0+0 16-bit sRGB 481543B 0.000u 0:00.003
in/POSA0004.pdf[7] PDF 598x841 598x841+0+0 16-bit sRGB 595810B 0.000u 0:00.003
in/POSA0004.pdf[8] PDF 598x841 598x841+0+0 16-bit sRGB 547664B 0.000u 0:00.003
in/POSA0004.pdf[9] PDF 598x841 598x841+0+0 16-bit sRGB 511713B 0.000u 0:00.002
in/POSA0004.pdf[10] PDF 598x841 598x841+0+0 16-bit sRGB 582433B 0.000u 0:00.002
in/POSA0004.pdf[11] PDF 598x841 598x841+0+0 16-bit sRGB 488997B 0.000u 0:00.002
in/POSA0004.pdf[12] PDF 598x841 598x841+0+0 16-bit sRGB 541984B 0.000u 0:00.002
in/POSA0004.pdf[13] PDF 598x841 598x841+0+0 16-bit sRGB 502929B 0.000u 0:00.002
in/POSA0004.pdf[14] PDF 598x841 598x841+0+0 16-bit sRGB 815684B 0.000u 0:00.001
in/POSA0004.pdf[15] PDF 598x841 598x841+0+0 16-bit sRGB 1.00767MiB 0.000u 0:00.001

first column filename
second column file type
third column page size
fourth column boudingbox / cropbox ?
fifth column colordepth
sixth column colorspace
seventh column? file size?
eight column?
ninth column?

PDF allows to concatenate / bundle files from very different sources and or pipelines

# with the verbose flag, much more info!
# some pixel statistics that might give a clue about the content
identify -verbose in/POSA0004.pdf

# A lot of formatting options are availables
# full list at <https://imagemagick.org/script/identify.php>
# Below is an example to get the printing size
identify -format "%[fx:w/72] by %[fx:h/72] inches\n" in/POSA0004.pdf

# Get metadata with PDFTK
# Here we dump to stout using "-" instead of a file
# You can edit and update infos `pdftk Example.pdf update_info Metadata-output.txt output Example-new.pdf`
pdftk in/POSA0004.pdf dump_data output -

The pdftk command gives similar results but allows you to output it to a regular.

Mutool:
Fantastic tool by Artefex (Ghostcript).
https://www.mankier.com/1/mutool

It is possible with mutool to write script and manipulate PDFs (like reordering pages, changing cropbox values etc.)

few exemples from their website

Convert pages 1-10 into 10 PNG images: mutool convert -o image%d.png file.pdf 1-10
Convert pages 2, 3 and 5 of a PDF into text in the standard output: mutool draw -F txt file.pdf 2,3,5
Concatenate two PDFs: mutool merge -o output.pdf input1.pdf input2.pdf
Query information about all content embedded in a PDF: mutool info input.pdf
Extract all images, fonts and resources embedded in a PDF out into the current directory: mutool extract input.pdf

peepdf ( https://eternal-todo.com/tools/peepdf-pdf-analysis-tool )
install: pip install peepdf==0.3.2
Was surprised there weren't may tools to explore a PDF. Also none of them seemed to be complete.

A similar surprise about the lack of editors.

PeePDF is more of a forensic tool to find malicious code in a PDF.

peepdf -i in/A\ System\ partly\ revealed\ _02\ _..pdf

result of the tree command:
https://dpaste.org/QX1Y

Cross references (xref) they allow the same object to be put on multiple pages in the document to save file size
Streams, often bytestreams e.g. images or sounds (in a way individual files within the PDF)
Font objects

There is a font object in the PDF to allow encoding the texts that were put in there by the OCR application

Initially interested by the several boxes supported pdf: mediabox, cropbox. It's possible
not all the documents are available in the PDF.

You can extract fonts, scripts, icc profiles from the PDF.

pdfimages tool allows to extract the image files within a PDF.

With fontforge you can extract the fonts which are encoded in the PDF. If the pdf is optimized
though, only the glyphs (characters) used in the document might be present in those fonts.

*Commands to uncompress a PDF and turn it into plain text to open it in a text editor (they all do the same thing)
*qpdf --qdf --object-streams=disable orig.pdf uncompressed-qpdf.pdf
*mutool clean -d orig.pdf uncompressed-mutool.pdf
*podofouncompress orig.pdf uncompressed-podofo.pdf

Linearize a PDF, as PDF allows the same object to be put on different pages in the same
file, an object on the last page might be put at the start of the file, the last page is
then also dependent on the first page. A linearized PDF would remove these cross linked objects
and make the individual pages independent, but also a bit bigger.

What's interesting to consider as well is that text isn't encoded as a text but as a series
of precisely placed glyphs (individual characters). This allows a PDF to look the same on any
screen but removes the notion of a flown 'text'.

* * *

Trying to redefine pdf mediabox to see if it is possible to grasp unvisible (because in the margins) objects. Seems like it doesn't work with scribus documents (only documents that are inside bleeds can be visible)

* * *

pdfinfo allows to extract metadata from a pdf. It comes with xpdf
Color track

clipping on focus points
https://www.prepressure.com/pdf/basics/page-boxes

Mutool is very interesting. https://www.mankier.com/1/mutool

It allows to execute javascript through the run command, allowing to inspect and manipulate pdf files with javascript, for example to get page boxes

Below a script to clip out 100 x 100 portions of the pdf

var pdf = new PDFDocument(scriptArgs[0]);
var n = pdf.countPages();
for (var i = 0; i < n; ++i) {
var page = pdf.findPage(i);
// print(page.MediaBox);
// print(page.CropBox);
// print(page.TrimBox);
// print(page.BleedBox);
var currentWidth = page.MediaBox[2];
var currentHeight = page.MediaBox[3];
var wcrop = Math.floor(Math.random() * (currentWidth - 100));
var hcrop = Math.floor(Math.random() * (currentHeight - 100));
page.MediaBox = page.CropBox = [wcrop, hcrop, wcrop + 100, hcrop + 100];
}
pdf.save(scriptArgs[1]);

To run, save the script in a file then execute it like so:
mutool run [name_of_script.js] [pdf_to_read.pdf] [pdf_to_write.pdf]

* * *

[[0, [0, 0], [3508, 0], [3508, 5100], [0, 5100]]
[1, [814, 1004], [2670, 1004], [2670, 3558], [814, 3558]]
[2, [414, 876], [2966, 876], [2966, 1684], [414, 1684]]
[2, [444, 3530], [1504, 3530], [1504, 4284], [444, 4284]]
[2, [1828, 3528], [2944, 3528], [2944, 4219], [1828, 4219]]
[3, [2073, 168], [3158, 168], [3158, 1910], [2073, 1910]]
[3, [356, 150], [3158, 150], [3158, 944], [356, 944]]
[3, [434, 1200], [1533, 1200], [1533, 2020], [434, 2020]]
[3, [430, 2177], [1533, 2177], [1533, 2916], [430, 2916]]
[3, [2076, 2152], [3182, 2152], [3182, 3008], [2076, 3008]]
[3, [2086, 3138], [3197, 3138], [3197, 4870], [2086, 4870]]
[3, [3102, 3793], [3197, 3793], [3197, 3884], [3102, 3884]]
[3, [428, 3112], [1492, 3112], [1492, 3916], [428, 3916]]
[3, [360, 4100], [3174, 4100], [3174, 4870], [360, 4870]]
[4, [894, 1818], [1276, 1818], [1276, 2082], [894, 2082]]
[4, [902, 3442], [1286, 3442], [1286, 3694], [902, 3694]]
[4, [1244, 3538], [1286, 3538], [1286, 3586], [1244, 3586]]
[4, [902, 3694], [1056, 3694], [1056, 3724], [902, 3724]]
[4, [142, 1782], [790, 1782], [790, 5076], [142, 5076]]]

* * *

(Alex) Investigating icc profiles.

A lot of the mecanic is explained in a simple form here (french): www.guide-gestion-des-couleurs.com

Color profiles are meant to remap colors accross their different physical instanciations.

Trying to figure out how to manipulate profiles "by hand" I found that:
http://www.colorwiki.com/wiki/Stunt_Profiles

Link profiles: direct remap of colors without going through "device-independant" colorspace
http://www.colorwiki.com/wiki/Device_Link_Profile
https://linux.die.net/man/1/icclink

In later versions of Ghostscript, a lot of controls to change colorspaces (eg. rgb to cmyk conversion)
https://www.ghostscript.com/doc/9.22/GS9_Color_Management.pdf

https://stackoverflow.com/questions/31591554/embed-icc-color-profile-in-pdf

https://colorlibrary.ch/about/about-color-library/

http://www.color.org/specification/ICC1v43_2010-12.pdf

* * *

OCR Layout analysis
Presentation with beautiful slides on layout analysis in Tesseract: https://tesseract-ocr.github.io/docs/das_tutorial2016/5LayoutAnalysis.pdf (page 6 and on)
Dataset for layout analysis: https://www.primaresearch.org/dataset/

Extract layout analysis: https://github.com/mauvilsa/tesseract-recognize

Installation notes, requirements I missed:
libtesseract-dev
libopencv-dev
libleptonica-dev
libgs-dev

Sample output from tesseract-recognize. Unfortunately it only stores the text regions.
<?xml version="1.0" encoding="utf-8"?>
<PcGts xmlns="http://schema.primaresearch.org/PAGE/gts/pagecontent/2013-07-15">
<Metadata>
    <Creator>tesseract-recognize_v2020.01.10 tesseract_v4.0.0-beta.1 (PageXML 2020.02.10)</Creator>
    <Created>2020-06-05T08:10:24Z</Created>
    <LastChange>2020-06-05T08:10:38Z</LastChange>
    <Process started="2020-06-05T08:10:24Z" time="13.9768" tool="tesseract-recognize_v2020.01.10 tesseract_v4.0.0-beta.1"/>
</Metadata>
<Page imageFilename="img-002.png" imageHeight="2550" imageWidth="1755">
    <Property key="readingDirection" value="left-to-right"/>
    <Property key="textLineOrder" value="top-to-bottom"/>
    <Property key="deskewAngle" value="0.00474983"/>
    <TextRegion id="b1">
      <Coords points="1016,49 1529,49 1529,68 1016,68"/>
    </TextRegion>
    <TextRegion id="b2">
      <Coords points="96,98 1575,98 1575,370 96,370"/>
    </TextRegion>
    <TextRegion id="b3">
      <Coords points="190,886 775,886 775,1750 190,1750"/>
    </TextRegion>
    <TextRegion id="b4">
      <Coords points="188,2156 771,2156 771,2511 188,2511"/>
    </TextRegion>
    <TextRegion id="b5">
      <Coords points="902,881 1488,881 1488,1751 902,1751"/>
    </TextRegion>
    <TextRegion id="b6">
      <Coords points="911,2151 1492,2151 1492,2508 911,2508"/>
    </TextRegion>
</Page>
</PcGts>

Internally tesseract seems very precise in the boxes it recognizes
0 PT_UNKNOWN,        // Type is not yet known. Keep as the first element.
1 PT_FLOWING_TEXT,   // Text that lives inside a column.
2 PT_HEADING_TEXT,   // Text that spans more than one column.
3 PT_PULLOUT_TEXT,   // Text that is in a cross-column pull-out region.
4 PT_EQUATION,       // Partition belonging to an equation region.
5 PT_INLINE_EQUATION, // Partition has inline equation.
6 PT_TABLE,          // Partition belonging to a table region.
7 PT_VERTICAL_TEXT, // Text-line runs vertically.
8 PT_CAPTION_TEXT,   // Text that belongs to an image.
9 PT_FLOWING_IMAGE, // Image that lives inside a column.
10 PT_HEADING_IMAGE, // Image that spans more than one column.
11 PT_PULLOUT_IMAGE, // Image that is in a cross-column pull-out region.
12 PT_HORZ_LINE,      // Horizontal Line.
13 PT_VERT_LINE,      // Vertical Line.
14 PT_NOISE,          // Lies outside of any column.

Tools for visual document analysis: http://www.primaresearch.org/tools
On the PAGE (page Analysis and Ground-thruth Elements) format: https://www.primaresearch.org/www/assets/papers/ICPR2010_Pletschacher_PAGE.pdf

* * *

Outcome of recropping: https://cloud.constantvzw.org/s/7Xf3ArjGp2HJXCA?path=%2FRecropped%20PDF%27s

* * *

http://scantailor.org/

Presentation on what regions Tesseract should be able to detect
https://static.googleusercontent.com/media/research.google.com/en//pubs/archive/35522.pdf