*****************************
********** README ***********
*****************************
writing-with-film
=================
NOTE: all scripts are written in Python2.7 (because of Pattern of Python)
First install the dependencies:
>>> sudo apt-get install mongodb libsphinxbase1 swig
- MongoDB: database we use to store our words. https://www.mongodb.org/
- PocketSphinx: a lightweight speech recognition engine that we use to create subtitle files using speech detection. http://cmusphinx.sourceforge.net/
- Swig: Simplified Wrapper and Interface Generator is a tool used to connect computer programs or libraries written in C or C++ with scripting languages, which we need as a dependency for PocketSphinx. http://swig.org/
* virtual environment*
- A virtual environment is a way to work in a 'closed' environment when working on a python project.
- This enables to install different python packages, and not let them conflict with other packages that are already installed earlier.
>>> virtualenv venv
>>> . venv/bin/activate
you can then install the requirements, which will then only stay within this virtual environment.
the following requirements are then installed:
- Pattern==2.6
- argparse==1.2.1
- decorator==4.0.9
- imageio==1.5
- moviepy==0.2.2.11
- numpy==1.10.4
- pocketsphinx==0.0.9
- pymongo==3.2.1
- srt==1.1.0
- tqdm==3.8.0
- wsgiref==0.1.2
- youtube-dl==2016.02.22
>>> pip install -r requirements.txt
*fromsrt.py*
The subtitle files are parsed in fromsrt.py and added to the database in the following formats:
sentence = {
- 'filename': videofile,
- 'text': row[1],
- 'start': row[0][0],
- 'end': row[0][1],
- 'duration': row[0][1] - row[0][0],
- 'words': []
}
sentence['words'].append({
- 'word': word,
- 'start': sentence['start'] + word_start,
- 'end': sentence['start'] + word_end,
- 'duration': word_duration,
- 'tag': tag,
})
and then (at the very bottom of the script) added to the database.
edit the 'collectionname' in this line, to make different collections of srt files:
- db.collectionname.insert(sentence)
* database *
The parsed .srt files are placed into a MongoDB database called 'algolit', which gives an interface to the database, and enables us to write specific queries later.
To show all the databases in your Mongo installation, run:
to enter the algolit database, run:
>>> mongo algolit
to show all the collections, run:
>>> show collections
to print all the items in the collection:
>>> db.collectionname.find()
* sources *
The video sources we used to built the vocabulary are listed here:
http://pad.constantvzw.org/p/video-sources-links