writing_with_film

*****************************
********** README ***********
*****************************

writing-with-film
=================

NOTE: all scripts are written in Python2.7 (because of Pattern of Python)

First install the dependencies:

>>> sudo apt-get install mongodb libsphinxbase1 swig

MongoDB: database we use to store our words. https://www.mongodb.org/
PocketSphinx: a lightweight speech recognition engine that we use to create subtitle files using speech detection. http://cmusphinx.sourceforge.net/
Swig: Simplified Wrapper and Interface Generator is a tool used to connect computer programs or libraries written in C or C++ with scripting languages, which we need as a dependency for PocketSphinx. http://swig.org/

* virtual environment*

A virtual environment is a way to work in a 'closed' environment when working on a python project.
This enables to install different python packages, and not let them conflict with other packages that are already installed earlier.

>>> virtualenv venv
>>> . venv/bin/activate

you can then install the requirements, which will then only stay within this virtual environment.
the following requirements are then installed:

Pattern==2.6
argparse==1.2.1
decorator==4.0.9
imageio==1.5
moviepy==0.2.2.11
numpy==1.10.4
pocketsphinx==0.0.9
pymongo==3.2.1
srt==1.1.0
tqdm==3.8.0
wsgiref==0.1.2
youtube-dl==2016.02.22

>>> pip install -r requirements.txt

*fromsrt.py*

The subtitle files are parsed in fromsrt.py and added to the database in the following formats:

sentence = {

'filename': videofile,
'text': row[1],
'start': row[0][0],
'end': row[0][1],
'duration': row[0][1] - row[0][0],
'words': []

}

sentence['words'].append({

'word': word,
'start': sentence['start'] + word_start,
'end': sentence['start'] + word_end,
'duration': word_duration,
'tag': tag,

})

and then (at the very bottom of the script) added to the database.
edit the 'collectionname' in this line, to make different collections of srt files:

db.collectionname.insert(sentence)

* database *

The parsed .srt files are placed into a MongoDB database called 'algolit', which gives an interface to the database, and enables us to write specific queries later.
To show all the databases in your Mongo installation, run:

to enter the algolit database, run:
>>> mongo algolit

to show all the collections, run:
>>> show collections

to print all the items in the collection:
>>> db.collectionname.find()

* sources *

The video sources we used to built the vocabulary are listed here:
http://pad.constantvzw.org/p/video-sources-links