Tweet Awareness

Can we model awareness to specific events in different countries around the globe ? Can we predict the levels of reactions on social media using cultural metrics or simply using distance metrics between countries?




In past, the main source of information for people was their local newspaper, their radio or television newscast or even natural conversation. Information could take days to propagate in the whole world and then only few topics where selected and broadcasted by journalists. Because of that the awareness of people to world events was very limited, and it was virtually impossible to quantify or qualify the level of reaction to a specific event.

Since then, we have seen the rise of Internet and Social Medias. The world has become so connected that people can virtually be aware of everything happening on the planet at any moment. As there are no limits to the availability of information anymore, one could think that people's awareness would have expanded. But it seems that people still feel only touched by events that are physically close or events that touch people who are socially or culturally close. The awareness or reaction level in some sense would be linked to the empathy a country could have for another one.

Can we predict awareness to specific events in different countries around the globe ? Can we model the awareness and predict reaction on social media using cultural and proximity metrics between countries?



Estimating reactions and awareness through hashtags

Over the past few years Twitter has become a platform where people all over the world express themselves. Hashtags are prevalent on Twitter and have become more and more common place in various social networks over the past few years. When referring to major events there are always specific Hashtags which pop up in a multitude of languages and with various spellings therefore demonstrating that there are people from various origins which react to a given event.

By using Twitter the idea was to try to observe global trends regarding events which objectively should generate similar levels of reactions worldwide. But how can we define similar events? How can we choose events so that they may be objectively comparable?

Choosing events similar in nature is already a good start towards selecting events. It is also important that the impact of the selected event on the population generate sufficient reactions both in-house with the potential of generating reactions world-wide. We then need to ask ourselves in recent times what has generated the biggest buzz?

More often than not people react to events which have an impact on people's lives, on their safety which is why terrorist related events are interesting for our study. One would imagine that the more people are injured, the higher the reaction. Therefore, in order to ensure that we have enough tweets, we selected events in different countries which had a large number of civilian casualties. For comparison's sake, we also selected a number of events of different magnitude around a different time period.


To account for the fluctuation of activity on twitter over the different years the events were selected to be in the same overall timeframe. The tweets were then acquired using relevant hashtags for an entire week following the event.



Geolocating Tweets And Measuring Reactions

Selecting events and scraping the tweets may have been a challenge, but one of the biggest ones to face was that of actually determining where the tweets came from. Besides the fact that locations are not always provided, the locations are input manually. Anybody is free to put anything they want and that often leads to locations which are completely absurd!

Examples of Absurd Locations Provided


That is why it was important to create a robust mapping which would be able to correctly map the largest number of locations possible. To do so we relied on three levels of mapping.

  1. A first level where we map the country names and capitals in different languages and with multiple spellings to the ISO2 country code

  2. A second level where we map the top 20'000 cities in the world names and capitals to the ISO2 country code

  3. A final level where we take the GEONAMES database to map all the cities in the world to their respective countries. The importance of this mapping lies in the fact that cities with the same names are selected based on population. Therefore given a city, the output will be the country corresponding to the city with the largest population.

So how does this mapping work? Once each of the cities with the different name variants were linked to their respective country the important steps towards identifying the location are illustrated as follows.


In the end, how many tweets were we actually able to geolocate overall? Considering only those for which we initially had a location and inspite all the absurdities found, we were able to gather quite a few tweets per event but as we will quickly realize, there are inequalities in reactions between the different events.


We see that in most cases the locations are given 50% of the time by the country name or capital. The only time where that is not the case is for the event in the United States. This most likely stems from the fact that Americans most probably relate to their state or town than people in other countries around the world. Another interesting thing we can see is that only 10% of tweets with locations are not geolocated, mostly due to the absurd locations provided. A few examples of those were provided in the word cloud above.

Visualizing Reactions To Different Events


In the following map we can visualize the number of tweets containing hashtags related to specific events. The events are given by the red circles. Cliking on one will load the corresponding map. The map can display either the raw number of tweets or the relative reaction obtained through normalization by the estimated baseline twitter activity. This baseline consists in an average number of tweets acquired over one week per country.

Raw Number of Tweets Normalized By Average Number of Tweets

Strong unbalance in Reactions


Looking at the reactions themselves, for the events in Nigeria we can see that the reaction increases with the amplitude of the event. The massacre which lead to over 200 deaths generated almost 10 times more tweets than the shooting which happend a year later. But this depends largely on the event. As we can see, Charlie Hebdo is the perfect example of an extreme reaction to an event with an objectively small number of casualties.

We can visualize this by using the same scale for all events. This highlights the imbalance of reactions between the different events.

Raw Number of Tweets Normalized By Average Number of Tweets


Modelling the Reactions


How can we predict reactions? People often react to events that happen closeby. Given the visualization of the reactions it does not seem as though distance metrics on their own would suffice. That is why cultural metrics are also important to take into account.Therefore, can a combination of cultural and proximity metrics be used as predictors? Can we create a model which, given the country where an event occurs, would be able to predict the reactions worldwide with respect to the baseline twitter activity?

General Metrics

Countries can be characterized by general attributes such as population, size, gross domestic product (GDP), poverty line, internet users and so forth. These metrics are interesting because they are in direct link to the number of tweets in a conutry and can be used as normalizing factors.

Distance Metrics

Distance tends to play an important role in determining people's interest regarding a specific event. But there are several types of distances which can be considered when speaking of countries. The first and most evident is the birds-eye-view distance between the countries. This can be computed based on the positions of the center of each country. But this is not enough. Take for example the US and Canada, the distance between the center of thees two countries is much larger than the distance between France and any of its neighboring countries.

So what else could we consider to have a more representative representation of distance between the different countries? We came up with three other distance metrics which combined would be more complete.

Relative Importance of Neighbors

To take into account the fact that two countries can be neighbors and still have a big distance between them we decided to create a metric which would give importance to countries which are direct neighbors. We also wanted to make sure to give each neighbor the importance it is due. For example France has multiple countries at its border, but that does not mean that each of these countries are of similar importance. That is why we weighted this metric by the size of its neighbours. Therefore a small neighbouring country such as Luxemburg would have a smaller weight than Germany for example.

Hop Distance

This second metric accounts not only for direct neighbors but also for the smallest number of borders which would need to be traversed to connect any two countries in the world.

Flight Routes

This last metric accounts for movement of populations between the different countries. The assumption is that the existence of flight routes is due to the fact that people exhibit a certain interest for the other country. The more often you visit a place, the more likely you are to be interested in what is going on there.


Cultural Metrics

People tend to feel close to people they can relate to and who share similar values. One would therefore assume that people are interested in events that occur in areas where there are people that are culturally close to you. But how can you quantify cultural proximity?

Languages

Languages are representative of culture and most of the time, countries which speak the same language have a given proximity. However speaking or not speaking the same language is not sufficient to quantify proximity. Many languages are similar in the sense of understanding the other language, learning it, etc. That why we used phylogenetic trees to approximate Linguistic distances between the different languages and compute a corresponding linguistic distance between the different countries.

Religions

Religion relate to people's mentalities and affects the way people think. Religion can create bonds between different populations and countries and therefore have an impact on the awareness to events that happen in countries with the similar religious backgrounds.

Visualizing the Metrics

For each of the metrics we wanted to visualize how countries would react to an event which occured somewhere around the globe. If an event occurs in Switzerland for example and looking at the relative importance of neighbors metric, we can see that Italy would react more than Germany or France would.

Language Distance Real Distance Hop Distance Religion Distance Percentage of outbound flights Relative Importance of Neighbors

Creating the Graph With Latent Dirichlet Allocation and Using Diffusion to Predict Reactions

Given all of these metrics we wanted to create a graph linking all countries. The more countries have attributes in common, the more likely they are to react to what is going on. But how do we quantify that two countries share things in common? Our original idea was to use what we see in recommender systems and topic detection to find amongst all our features the ones which best link the countries together. The results of this can be seen in the maps below.

Here give the choice of observing the diffusion using the same scale for all countries or having a different scale per country. The scale per country is useful to compare the predicted and estimated reaction levels whereas the unique scale helps observe the differences in diffusion between events happening in different countries.



Unique Scale Adaptative Scale


Looking at the graph we can see that there are many incoherences at the global scale. For example when looking at what happens in Turkey with the adaptative scale, it is highly unlikely that there be a reaction of similar magnitude in Argentina and nowhere else in the world. But even more concretely, this does not match the reaction levels of the tweets we have for the event in Istanbul.



Critical Assessment

The goal of this project was to determine whether a set of simple metrics could be used to create a model which would adequately represent the links between countries and predict the reactions towards specific events on social media. As we have shown the problem is far more complex than one would have hoped.

Twitter is inherently biased because it is a social media platform. This implies that people are not necessarily registered to the network, and if they are they might not be very active. In general social media is relatively new and tends to attract younger generations which are generally less aware about what is going on not only around the world but even in their immediate environment.

When people tweet they do not necessarily provide their location and this may also be cultural. Depending on the countries, people are more or less aware of the dangers of providing their locations on social media and therefore more or less inclined to provide personal information in general. This was partially corrected by using a normalizing factor with the tweets but is still an inherent bias in the original dataset.

The absence of reaction to an event on social media does not necessarily imply that people are not interested in what is going on. We assume that the proportion of people who react to an event is always the same with respect to the overall number of tweeters but there is no way to be sure.

Not having standardized locations required creating a mapping which in itself is imperfect and implies loss of a given amount of data.

Our metrics are also imperfect and can difficultly model on their own a complex system which is highly linked to social, behavioural, cultural and historical aspects.

Furthermore, creating a model requires much more data than what we had at hand, both in terms of number of tweets and in terms of events analyzed.

For all of these reasons it was evident that creating a metric would be challenging. In general it would seem that we need much more complex pararmeters to describe our system than the ones we have currently in order to correclty predict what is going on especially since awareness does not necessarily imply reaction. Considering the problem on a smaller scale, for example at the level of a country such as the United States or of a continent such as Europe, may have lead to more pertinent results as the complex inter-cultural factors would have been more similar between the considered countries.