writing_with_film

Welcome to Etherpad!

This pad text is synchronized as you type, so that everyone viewing this page sees the same text. This allows you to collaborate seamlessly on documents!

Get involved with Etherpad at http://etherpad.org
*****************************
********** README ***********
*****************************

writing-with-film
=================

NOTE: all scripts are written in Python2.7 (because of Pattern of Python)

First install the dependencies:

>>> sudo apt-get install mongodb libsphinxbase1 swig

MongoDB: database we use to store our words. https://www.mongodb.org/
PocketSphinx: a lightweight speech recognition engine that we use to create subtitle files using speech detection. http://cmusphinx.sourceforge.net/
Swig: Simplified Wrapper and Interface Generator is a tool used to connect computer programs or libraries written in C or C++ with scripting languages, which we need as a dependency for PocketSphinx. http://swig.org/

* virtual environment*

A virtual environment is a way to work in a 'closed' environment when working on a python project.
This enables to install different python packages, and not let them conflict with other packages that are already installed earlier.

>>> virtualenv venv
>>> . venv/bin/activate

you can then install the requirements, which will then only stay within this virtual environment.
the following requirements are then installed:

Pattern==2.6
argparse==1.2.1
decorator==4.0.9
imageio==1.5
moviepy==0.2.2.11
numpy==1.10.4
pocketsphinx==0.0.9
pymongo==3.2.1
srt==1.1.0
tqdm==3.8.0
wsgiref==0.1.2
youtube-dl==2016.02.22

>>> pip install -r requirements.txt

*fromsrt.py*

The subtitle files are parsed in fromsrt.py and added to the database in the following formats:

sentence = {

'filename': videofile,
'text': row[1],
'start': row[0][0],
'end': row[0][1],
'duration': row[0][1] - row[0][0],
'words': []

}

sentence['words'].append({

'word': word,
'start': sentence['start'] + word_start,
'end': sentence['start'] + word_end,
'duration': word_duration,
'tag': tag,

})

and then (at the very bottom of the script) added to the database.
edit the 'collectionname' in this line, to make different collections of srt files:

db.collectionname.insert(sentence)

* database *

The parsed .srt files are placed into a MongoDB database called 'algolit', which gives an interface to the database, and enables us to write specific queries later.
To show all the databases in your Mongo installation, run:

to enter the algolit database, run:
>>> mongo algolit

to show all the collections, run:
>>> show collections

to print all the items in the collection:
>>> db.collectionname.find()

* sources *

The video sources we used to built the vocabulary are listed here:
http://pad.constantvzw.org/p/video-sources-links