BI-text-mining

An academic project that aims to extract and process text from a large amount of articles scrapped from many Moroccan news Websites.

This project is divided into 5 parts, each part is in an independent directory:

[Part 1]: Scraping
[Part 2]: Text_processing
[Part 3]: Automating
[Part 4]: Reporting
[Part 5]: Mining

Further Details

1. Scraping :

Scrapping articles (Title, publiction date, Image, Link, Full text…)from Moroccan news websites(BeatifulSoup and requests).

Ressources :

Data was retrieved from the following websites:

Le matin.ma
La vie eco
Challenge.ma

Data structure :

We scrapped the Economy subcategory pages for each news website. for each article we got its :

title

publication date

image

link

full text.

2. Text_processing :

Apply some text mining methods and algorithms(TF,IDF, NMF, TOPIC MODELING).

Texts are pre-treated and cleaned using the basic text processing techniques such as :
- removing stop words
- lemmatization
- stemming
- tokenization
- removing punctuation
- removing numbers
- and
- removing special characters.
Then we’ve applied some text mining algorithms such as :
- TF-IDF
- NMF
- Topic Modeling.

3. Automating :

Automate the process of scraping, text processing, Datawarehousing and loading Data into Postgresql Database(Airflow, Docker…). The Datapipeline architecture is as follows:

4. Reporting :

Present results and key measures in a dashboard (Web app with Flask).

Reporting results via a simple dashboard as follows:

5. Mining :

Extract association rules (R and python).

BI-text-mining

This project aims to extract articles from Moroccan news websites and process them using some text mining algorithms and techniques

BI-text-mining

Further Details

1. Scraping :

Ressources :

Data structure :

2. Text_processing :

3. Automating :

4. Reporting :

5. Mining :

Contrubutors :