README

Data stream learning with change detection

Thanks to Latent Dirichlet Allocation and the ADWIN Algorithm, we realize topic modeling and concept drift detection among a corpus.

This project was coded by Lucas Maison and Antoine Moulin, students at Télécom Paris, under the supervision of Pierre-Alexandre Murena and Marie Al-Ghossein.

There are several types of files in this project :

The GUI file (gui.py), which contains a Python program to run a GUI
onlineldavb.py and onlineLDAWrapper.py that code our model (Latent Dirichlet Allocation)
twitter_stream.py that allows us to retrieve tweets
text_preprocessing.py that allows us to clean tweets
corpus.py that allows us to work with our data
The folder twitter that contains our data set and our learnt models

We are sorry, the code is not fully clean yet. Maybe we will come back on this later. However, the code is working.

WARNING : Please make sure you do have these python libraries before you start the program :

PyQt5
pickle
numpy
scipy
nltk
preprocessor
tweepy
pylab
skmultiflow (you might struggle with its installation; as a last resort, you could directly put the folder scikit-multiflow-master into the folder that contains your Python distribution (e.g. Anaconda3/))

As the models were learnt during the Football World Cup 2018, you may erase all of them and start the learning from scratch. When starting the GUI, set the parameters (e.g. number of top words) and click on Begin Streaming.

Here is what the GUI looks like :

Let explain what all the frames contain :

The first frame allows the user to : save/load a model, train a model from the data set or to retrieve tweets from Twitter. Concerning this last part, the streaming goes on until the user stops it, and the model is re-trained when our program detects a concept drift (thanks to ADWIN).
The second frame just shows five tweets that have been retrieved.
The third frame describes each topic thanks to its top words.
The fourth frame shows how the training is going, how many tweets have been retrieved and how many drifts have been detected.
The fifth and final frame shows the evolution of the perplexity with the tweets. A red line indicates that our program detected a drift. This line may look like a fail, but it actually does not appear exactly on a drift. Indeed, a drift is not necessarly instantaneous and as ADWIN uses means to detect it, it is not immediately detected.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

README

Data stream learning with change detection

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
__pycache__		__pycache__
twitter		twitter
GUI.PNG		GUI.PNG
README.md		README.md
corpus.py		corpus.py
gui.py		gui.py
onlineLDAWrapper.py		onlineLDAWrapper.py
onlineldavb.py		onlineldavb.py
text_preprocessing.py		text_preprocessing.py
twitter_stream.py		twitter_stream.py

antoine-moulin/datastream-learning

Folders and files

Latest commit

History

Repository files navigation

README

Data stream learning with change detection

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages