Project for exploring Swedish social media data to find dialectal differences in the language.
- Install dependencies
pip install -r requirements.txt
(TODO) - Set up a MySQL database
- If your on OSX and like it easy try
brew install mysql
in terminal - If you dont have homebrew run
ruby <(curl -fsSk https://raw.github.com/mxcl/homebrew/go)
first in terminal
- Edit
config.example.py
with your MySQL credentials and rename toconfig.py
- Run
setupDB.py
to create all tables. - Run
mysql> set character_set_client = 'utf8'; set character_set_connection = 'utf8'; set character_set_database = 'utf8'; set character_set_results = 'utf8'; set character_set_server = 'utf8';
in mysql - Get data:
- Run spider scripts in
/Spiders
- or bring your own data
- Import the data with corresponding importscript
<source>2mysql.py
- Add the fulltext indexing to make searches decently fast (on 58 GB of data this took 24h so keep that in mind)
- Run
setupFulltext.py
- Note that for me the fulltext index needed 250 GB free hard drive space temporarily while generating the index on 58 GB.
- Run
metadata2coordinates.py
to convert metadata to coordinates trough Google geoencoding API. This is limited to 2000 requests per 24h so if all your data is not converted, keep the script running for some days untill it's finished (if your bored you can continue but without all data avalible in the GUI). - Run
runGUI.py
to start webserver - Point browser to
http://localhost:5000/sinus/
for GUI to explore the data. See usage for more details. - Run
getWordlist.py
to generate a wordlist. - Run
getEntropy.py
to start collecting words with low entropy. This corresponds to words being used very locally. E.g.nypotatis
will have low entropy because that it's mainly used in southern Sweden. Words likeoch, att, på
will have high entropy as they are used in the whole country. The findings ofgetEntropy.py
can be found in the web GUI under/explore
.
To use the geotagger to tag data without metadata:
- Run
fetchTweets.py
for a couple of months. This will get geotagged tweets to be used as a training set. - Run
compileGMMs.py
to build the model. - Run
tagData.py
to give your data coordinates (of resonable accuracy) if it is without metadata. - Search in the GUI with the flag
lowqualdata: 1
. This includes the inferred datatagData.py
has produced.
The GUI is usually accessible through http://localhost:5000/sinus/
. Email maxberggren@gmail.com
if you want to try our setup.
Results will be on logaritimic frequency when only one search term is used.
This animation uses xbins: <number>
that specifies how many bins/pixels that should be used on the x-axis. This coresponds to how fine grained resolution you want.
Here scatter: 1
will force it to produce a scatterplot instead. This can be better in some cases where the hits in the database are very few.
binthreshold: <number>
will set how many hits per bin that is required for it to count. Default is 5 if not specified.
uselowqualdata: 1
will use data of low rank. That means that e.g. tagged with the geotagger will be used.
Search results will now be in percent. E.g. av search for tipspromenad
vs tipsrunda
vs poängpromenad
(a common swedish game) will show in percent how many of the hits corresponding to each term against the other.
And let's try searching of phrases rather than just words. flak öl
vs platta öl
vs karta öl
(different phrases in swedish describing 24 beer cans). Notice how it used a scatterplot since the last term karta öl
had so few hits in the database.
The GUI can also access the geotagger that is used to tag data that have no metadata.