*****************************
********** README ***********
*****************************

writing-with-film
=================

NOTE: all scripts are written in Python2.7 (because of Pattern of Python)

First install the dependencies:

>>> sudo apt-get install mongodb libsphinxbase1 swig

*MongoDB: database we use to store our words. https://www.mongodb.org/
*PocketSphinx: a lightweight speech recognition engine that we use to create subtitle files using speech detection. http://cmusphinx.sourceforge.net/
*Swig: Simplified Wrapper and Interface Generator is a tool used to connect computer programs or libraries written in C or C++ with scripting languages, which we need as a dependency for PocketSphinx. http://swig.org/

* virtual environment*

*A virtual environment is a way to work in a 'closed' environment when working on a python project.
*This enables to install different python packages, and not let them conflict with other packages that are already installed earlier.

>>> virtualenv venv
>>> . venv/bin/activate
you can then install the requirements, which will then only stay within this virtual environment.
the following requirements are then installed:

*Pattern==2.6
*argparse==1.2.1
*decorator==4.0.9
*imageio==1.5
*moviepy==0.2.2.11
*numpy==1.10.4
*pocketsphinx==0.0.9
*pymongo==3.2.1
*srt==1.1.0
*tqdm==3.8.0
*wsgiref==0.1.2
*youtube-dl==2016.02.22

>>> pip install -r requirements.txt

*fromsrt.py*

The subtitle files are parsed in fromsrt.py and added to the database in the following formats:

sentence = {
*'filename': videofile,
*'text': row[1],
*'start': row[0][0],
*'end': row[0][1],
*'duration': row[0][1] - row[0][0],
*'words': []
}

sentence['words'].append({
*'word': word,
*'start': sentence['start'] + word_start,
*'end': sentence['start'] + word_end,
*'duration': word_duration,
*'tag': tag,
})

and then (at the very bottom of the script) added to the database.
edit the 'collectionname' in this line, to make different collections of srt files:

*db.collectionname.insert(sentence)

* database *

The parsed .srt files are placed into a MongoDB database called 'algolit', which gives an interface to the database, and enables us to write specific queries later.
To show all the databases in your Mongo installation, run:

to enter the algolit database, run:
>>> mongo algolit

to show all the collections, run:
>>> show collections

to print all the items in the collection:
>>> db.collectionname.find()

* sources *

The video sources we used to built the vocabulary are listed here:
http://pad.constantvzw.org/p/video-sources-links