***************************** ********** README *********** ***************************** writing-with-film ================= NOTE: all scripts are written in Python2.7 (because of Pattern of Python) First install the dependencies: >>> sudo apt-get install mongodb libsphinxbase1 swig *MongoDB: database we use to store our words. https://www.mongodb.org/ *PocketSphinx: a lightweight speech recognition engine that we use to create subtitle files using speech detection. http://cmusphinx.sourceforge.net/ *Swig: Simplified Wrapper and Interface Generator is a tool used to connect computer programs or libraries written in C or C++ with scripting languages, which we need as a dependency for PocketSphinx. http://swig.org/ * virtual environment* *A virtual environment is a way to work in a 'closed' environment when working on a python project. *This enables to install different python packages, and not let them conflict with other packages that are already installed earlier. >>> virtualenv venv >>> . venv/bin/activate you can then install the requirements, which will then only stay within this virtual environment. the following requirements are then installed: *Pattern==2.6 *argparse==1.2.1 *decorator==4.0.9 *imageio==1.5 *moviepy==0.2.2.11 *numpy==1.10.4 *pocketsphinx==0.0.9 *pymongo==3.2.1 *srt==1.1.0 *tqdm==3.8.0 *wsgiref==0.1.2 *youtube-dl==2016.02.22 >>> pip install -r requirements.txt *fromsrt.py* The subtitle files are parsed in fromsrt.py and added to the database in the following formats: sentence = { *'filename': videofile, *'text': row[1], *'start': row[0][0], *'end': row[0][1], *'duration': row[0][1] - row[0][0], *'words': [] } sentence['words'].append({ *'word': word, *'start': sentence['start'] + word_start, *'end': sentence['start'] + word_end, *'duration': word_duration, *'tag': tag, }) and then (at the very bottom of the script) added to the database. edit the 'collectionname' in this line, to make different collections of srt files: *db.collectionname.insert(sentence) * database * The parsed .srt files are placed into a MongoDB database called 'algolit', which gives an interface to the database, and enables us to write specific queries later. To show all the databases in your Mongo installation, run: to enter the algolit database, run: >>> mongo algolit to show all the collections, run: >>> show collections to print all the items in the collection: >>> db.collectionname.find() * sources * The video sources we used to built the vocabulary are listed here: http://pad.constantvzw.org/p/video-sources-links