Romain Semler | Presid-Analyzes

Presid-Analyzes

Lexical analyzes of the 2017 presidential campaigns

Presid-Analyzes

Presid-Analyzes is the name given to a work of study and research (TER) carried out in trio during my first year of Master's degree in Computer Science in 2017.

The objective of this project was to retrieve the lexical data of the 2017 presidential candidates for analysis purposes. For this, our work focused on the social network Twitter and more specifically on the tweets of the candidates. These were retrieved using an API, inserted into a database and processed to extract relevant information.

The analyzes carried out on the tweets were realized with a Python library specialized in the automatic processing of natural language (NLP). The extracted information was as follows :

The "distance" that separates one candidate from another
The most used words by a candidate
Which candidate used which word ?

A web interface has been set up to consult this information :

Although we limited ourselves to the social network Twitter, the work could have been even greater by extending the data recovery to other social networks (knowing that the publications of a candidate being numerous, the quantity of data becomes quickly very important).

How processing works

1) Retrieval of candidates' Twitter identifiers

To identify candidates' Twitter accounts, we used their Twitter ID. Indeed, each Twitter user has a unique identifier representing their account. It is easy to find a Twitter ID from an account name using various tools on the web (here is an example). This identifier is necessary because it allows you to target the Twitter account on which you want to perform operations, including recovering tweets.

The first step we took was therefore to retrieve the Twitter IDs of all the 2017 presidential candidates and store them in the project database.

2) Creation of the tweets recovery script

We then created a Python script which consists of retrieving the tweets of all the candidates previously saved in the database. This is possible thanks to an API provided by Twitter and allowing remote operations to be carried out on different accounts. The operation that interests us here is the retrieval of tweets. It is simply programmed as follows :

import twitter

# Connect to API with tokens
api = twitter.Api(
    consumer_key = '*******',
    consumer_secret = '*******',
    access_token_key = '******',
    access_token_secret = '*******'
)

# API call to retrieve tweets (here 200 tweets)
tweets = api.GetUserTimeline(ID_Twitter, count=200)

The GetUserTimeline method allows us to retrieve all tweets from a Twitter user account. The "ID_Twitter" parameter will be precisely valued (via a more substantial algorithm) by the candidate's identifier. A call to the API is therefore made for each candidate.

In order to not obtaining an excessive flow of data, at the end of this call, we decided not to collect tweets before 2016. Likewise, in order to ensure that any new tweets are taken into account candidates, rather than recovering everything again, we decided to keep the identifier of the most recent tweet in order to start from it for the next execution of the script.

All these tweets were finally inserted into the project's database, in order to carry out the necessary analyses.

3) Creation of the tweet analysis script

This script, also programmed in Python, performs an in-depth analysis of the tweets retrieved in order to extract the words and their type (adjective, adverb, common noun, etc.).
This is possible thanks to the use of a Python library (Syntatic Parser), specialized in automatic natural language processing (NLP): (NLP) :

from syntactic_parser.parser import Parser

# Language to read
parser = Parser(language='fr')

# Read of a tweet encoded in UTF-8
result = parser.process_document(tweet.encode("utf8"))

# Fetching words and types from previous read
for sentence in result:
    for token in sentence:
        lemma = token['lemma']
        pos = token['pos']
        if(pos in ['np', 'nc', 'adj', 'adv', 'v']):
            rows_mot.append((lemma, pos))
            rows_contenir.append((id_tweet, lemma))

Two characteristics of the words are retrieved here :

The lemma which is the root of a word in a language (more precisely its canonical form). For example, the conjugated verb "eaten" will have the word "eat" as its lemma. This form makes it possible to identify the generic form associated with a word.
The pos which is the type of the word (verb, adverb, adjective...).

This information is then persisted in the database with also the number of uses of the words by the candidate. It is this number which will make it possible to calculate the distance which separates a candidate from another.

4) Creation of the distance calculation script

Still produced in Python, this script will calculate, via a series of adapted mathematical formulas, the difference between one candidate and another (the distance will be represented by a numerical value).
The idea is to first retrieve all the words used by the candidates and their number of uses (see SQL query below) then, to retrieve all the words that have been extracted from the tweets (all candidates combined) . This will test every word on every pair of candidates :

from syntactic_parser.parser import Parser

# SQL queries to execute
selection_mots = u"SELECT valeur_mot FROM mot;"
selection_candidats = u"SELECT nom_candidat FROM candidat;"
selection_resultats_analyse = u"SELECT nom_candidat, valeur_mot, compte_resultat FROM resultat_mot;"

# Fetching all words
cur.execute(selection_mots)
result = cur.fetchAll()
mots = []
for mot in result
    mots.append(mot[0])

# Fetching candidates' name
cur.execute(selection_candidats)
result = cur.fetchAll()
nom_candidats = []
for nom in result
    nom_candidats.append(nom[0])

# Fetching analysis result
cur.execute(selection_resultats_analyse)
result = cur.fetchall()
candidats = {}
for res in result:
    nom = res[0]
    mot = res[1]
    nb_utilisation = res[2]
    candidats[nom][mot] = nb_utilisation

# Calculating the distance between two candidates
for nom1 in nom_candidats:
    for nom2 in nom_candidats:
        if nom1 != nom2:
            distance = distance(candidats[nom1], candidats[nom2], mots)

The method calculating the distance, called "distance" is as follows :

def distance(mots_candidat_1, mots_candidat_2, mots):
    scal = 0
    cnorm1 = 0
    cnorm2 = 0
    for mot in mots:
        scal += mots_candidat_1[mot] * mots_candidat_2[mot]
        cnorm1 += pow(mots_candidat_1[mot], 2)
        cnorm2 += pow(mots_candidat_2[mot], 2)

    cnorm1 = math.sqrt(cnorm1)
    cnorm2 = math.sqrt(cnorm2)
    dist = scal / (cnorm1 * cnorm2)
    return dist

The calculated distance is finally persisted in the database in order to transcribe it on the interface.

5) Transcription on the web interface

The distance is represented in the form of a radar-like graph :

The word most used by a candidate is represented by a horizontal histogram :

Finally, the words most used by a candidate are represented in the form of a word cloud :

Informations about the project

Title	Presid-Analyses
Description	Project concerning a lexical analysis of the 2017 presidential campaigns.
Used languages	Python, PHP, JS, CSS, HTML
Year of work	2017
Access to code	Link