Publiek Domein
http://www.constantvzw.org
An Mertens, Femke Snelting
Publiek Domein Dag Brussel
http://www.constantvzw.org/site/Happy-Publiek-Domein-2014.html
7-2-2015
Public Domain Day Communia
http://publicdomainday.org/
Wetgeving 'wat is publiek domein'
http://economie.fgov.be/nl/ondernemingen/Intellectuele_Eigendom/auteursrecht/Bescherming_door_auteursrecht/publiek_domein/#.VEfOtYXl2Bu
Death of the Authors: waarom interesse/gebruik
http://publicdomainday.constantvzw.org/
Artikel over launch Europeana 2010: grote belofte van 'publiek domein' materiaal
http://www.vlaamse-erfgoedbibliotheek.be/nieuws/2010/10/1440-europeana-gebruikt-merkteken-creative-commons-publiek-domein
Een reality check ...
Europeana: reality check voor Paul Van Ostaijen, 'Krities Proza', uit 1929 (boek An)
http://www.europeana.eu/portal/search.html?query=paul+van+ostaijen&rows=24
DBNL: alle links van Europeana verwijzen naar dezelfde scans van 'Verzameld werk' Paul Van Ostaijen, 1996, uitgave Bert Bakker
http://www.dbnl.org/tekst/osta002verz02_01/index.php
Krities Proza staat vermeld op auteurspagina: http://www.dbnl.org/auteurs/auteur.php?id=osta002
DBNL voorbeeld: Music Hall
http://www.dbnl.org/titels/titel.php?id=osta002musi01
primaire teksten:
- verzameld werk 1996: scans
- Gedichten 1927: TEKST (hèhè, eindelijk tekst!)
Belgica
http://belgica.kbr.be/nl/accueil_nl.html
digitale bib van KBR: nada
archive.org
https://archive.org/
onze eerste plek voor digitale media
wayback machine!
nada
Gutenberg project
http://www.gutenberg.org/
wat is het?
wat vind je niet? Kleine taalgebieden (Gutenberg totaal 47,028 EN: FR + NL: 'meer dan 50' ), minder gekende auteurs
- 1 werk: Bezette stad, mét tekeningen Oskar Jespers
- weer enkel de scans
Google Books
https://www.google.be/search?q=paul+van+ostaijen+krities+proza&btnG=Boeken+zoeken&tbm=bks&tbo=1&hl=nl&gws_rd=ssl
yes!
ooh, enkel prentje...
Scannen
Wat als ik toch, koppige kunstenaar, met 'Krities Proza' wil werken?
SCANNEN: hoe?
video scan_livre
video abattre_livre (eigenares gaf toestemming om boek onder hakbijl te leggen, akkoord?)
cfr interview voor Verbindingsprotocollen: http://constantvzw.org/verlag/spip.php?page=article&id_article=136&mot_filtre=2&id_lang=0
http://hackerspace.be/scanbot
Zelfgemaakte oplossing
OCR
http://nl.wikipedia.org/wiki/Optical_character_recognition
- waarom willen we tekst?
- hoe van scan naar tekst?
Aan de slag!
- voorbeeld gescande pagina (An scant op voorhand)
- OCR met gscan2pdf (standaard low-res OCR)
tesseract:
$ pdftk scannedbook.pdf burst (voorbereid)
$ convert -units pixelsperinch -density 300x300 -colorspace Gray -depth 8 page01.pdf page01.tif
$ convert page01.tif +dither -monochrome -normalize pg11.tif
$ sudo apt-get install tesseract-ocr-nld
$ tesseract page01.tif pgocr -l nld
wat is er gebeurd? hoe werkt OCR
An legt principe uit hoe het werkt: analyse pagina, letters, vergelijking, opnieuw samenstellen van het woord? facsimile [5 min]
Femke vertelt over de politieke realiteit van de databases achter de OCR-programma's [5 min]
Distributed proofreading: http://www.pgdp.net/c/
http://www.pgdp.net/c/faq/proofreading_guidelines_dutch.php
Oefening
http://pad.constantvzw.org/p/PULSE01
http://pad.constantvzw.org/p/PULSE02
http://pad.constantvzw.org/p/PULSE03
http://pad.constantvzw.org/p/PULSE04
- ocr in 4 verschillende pads (voorbereid) [5 min]
- corrigeren en terugbrengen tot 1 document [10 min]
- text document downloadbaar maken ergens -> archive.org 1060Brussel [5 min]
////////////////////////////////////////////////////////////////////////////////////////////
NOTITIES
Intro: Constant [5 min]
Publiek domein dag [15 min]
Materiaal in het Publiek Domein [30min]
- Hoe vind je dat materiaal (we concentreren ons op boeken maar ook foto's, films ...)
Michael Hart founded Project Gutenberg in 1971. His idea was: anything that can be entered into a computer can be reproduced indefinitely. This led to the concept of entering books into computers and sharing these books with the whole world. These Electronic Texts (E-texts) would be made available in the simplest, easiest to use form. This means "Plain Vanilla ASCII." Italics, underlines, and bolds would be converted to ASCII. In the same vein, the books selected would be those that appealed to the greatest number of people possible. Due to copyright laws, it is only legal to do this with older books (in general, copyrighted before 1923). As a result, Project Gutenberg is mostly comprised of the "Classics." [In 2004, we average 300-400 proofreaders participating each day from countries all over the world, and we finish 4000-7000 pages per day. That's about 4 pages every minute of every day!]
http://www.pgdp.net/c/
If you really catch Distributed Proofreading fever, you may want to become a Project Manager. Project managers mainly shepherd a project ("book") through the uploading, proofreading and post-processing processes on this website. Sometimes they do most of the tasks themselves; sometimes they coordinate others who are working on the tasks. You can also donate books (Public Domain) by shipping them to us for scanning (better if they do not need to be returned). You can also scan the books and send us the images (best if you want to keep the book)
- Wat vind je zoal niet
- Kleine taalgebieden (Gutenberg totaal 47,028 EN: FR + NL: 'meer dan 50' ), minder gekende auteurs
- Wat kun je daar aan doen
- Voor recente uitgaves (digitaal opgemaakt): uitgever contacteren
- Zelf scannen!
- Waarom zou je dat doen (discussie, vragen)
Oefening [40min]
- Laten zien hoe scan (we hebben het boek) -> ocr (verschillende scanners: industrieel vs DIY) [5 min]
- waarom OCR? Digitale boeken; afbeeldingen is niet genoeg
- vb OCR met gscan2pdf (standaard low-res OCR)
- [An] principe hoe het werkt: analyse pagina, letters, vergelijking, opnieuw samenstellen van het woord ? facsimile [5 min]
- There are two basic types of core OCR algorithm, which may produce a ranked list of candidate characters.[12]
-
-
Matrix matching involves comparing an image to a stored glyph on a pixel-by-pixel basis; it is also known as "pattern matching" or "pattern recognition".[9] This relies on the input glyph being correctly isolated from the rest of the image, and on the stored glyph being in a similar font and at the same scale. This technique works best with typewritten text and does not work well when new fonts are encountered. This is the technique the early physical photocell-based OCR implemented, rather directly.
-
-
Feature extraction decomposes glyphs into "features" like lines, closed loops, line direction, and line intersections. These are compared with an abstract vector-like representation of a character, which might reduce to one or more glyph prototypes. General techniques of feature detection in computer vision are applicable to this type of OCR, which is commonly seen in "intelligent" handwriting recognition and indeed most modern OCR software.[8] Nearest neighbour classifiers such as the k-nearest neighbors algorithm are used to compare image features with stored glyph features and choose the nearest match.[13]
-
- Software such as Cuneiform and Tesseract use a two-pass approach to character recognition. The second pass is known as "adaptive recognition" and uses the letter shapes recognized with high confidence on the first pass to better recognize the remaining letters on the second pass. This is advantageous for unusual fonts or low-quality scans where the font is distorted (e.g. blurred or faded).[11]
- Post-processing
-
- OCR accuracy can be increased if the output is constrained by a lexicon – a list of words that are allowed to occur in a document.[7] This might be, for example, all the words in the English language, or a more technical lexicon for a specific field. This technique can be problematic if the document contains words not in the lexicon, like proper nouns. Tesseract uses its dictionary to influence the character segmentation step, for improved accuracy.[11]
-
- The output stream may be a plain text stream or file of characters, but more sophisticated OCR systems can preserve the original layout of the page and produce, for example, an annotated PDF that includes both the original image of the page and a searchable textual representation.
-
-
"Near-neighbor analysis" can make use of co-occurrence frequencies to correct errors, by noting that certain words are often seen together.[9] For example, "Washington, D.C." is generally far more common in English than "Washington DOC".
-
- Knowledge of the grammar of the language being scanned can also help determine if a word is likely to be a verb or a noun, for example, allowing greater accuracy.
- [FS] de politieke realiteit van de databases achter de OCR-programma's [5 min]
- uitleg & vb scriptcommands [5 min]
-
- $ pdftk scannedbook.pdf burst (voorbereid)
- $ convert -units pixelsperinch -density 300x300 -colorspace Gray -depth 8 page01.pdf page01.tif
- $ convert page01.tif +dither -monochrome -normalize pg11.tif
- $ sudo apt-get install tesseract-ocr-nld
- $ tesseract page01.tif pgocr -l nld
- ocr in 4 verschillende pads (voorbereid) [5 min]
- corrigeren en terugbrengen tot 1 document [10 min]
- text document downloadbaar maken ergens -> archive.org 1060Brussel [5 min]
Print guidelines!! of zie ook in project gutenberg ---> Nederlandse zoeken en printen!
Distributed proofreading: http://www.pgdp.net/c/
http://www.pgdp.net/c/faq/proofreading_guidelines_dutch.php
During proofreading, volunteers are presented with a scanned page image and the corresponding OCR text on a single web page. This allows the text to be easily compared to the image, proofread, and sent back to the site. A second volunteer is then presented with the first volunteer's work and the same page image, verifies and corrects the work as necessary, and submits it back to the site. The book then similarly progresses through two formatting rounds using the same web interface.
Aanvullingen vanuit de muzieksector:
- open source muzieknotatie software : www.musescore.org
- platform voor het uitwisselen van partituren : www.pianofiles.com
- digitalisering van de Bibliotca Vaticana : www.alamirefoundation.org
meer info via : Stef Coninx stef@muziekcentrum.be