How many languages are on Twitter?

According to our most recent experiment, the answer is:  67 .

These are the top languages

languagesTwitter

This experiment was based on a random sample of 22.632.977 Tweets collected using Twitter Streaming API, 23.4.-7.5.2015.

We used the language attribute (“lang”) which comes as a data field of a Tweet to determine the language.

This experiment is still ongoing, the official results will be published in a research paper.

Sensing Labour Strikes in Indonesian Factories on Twitter

As part of my PhD project, I am currently working on a study to analyze how people tweet about labour strikes in factories. I decided to focus my study on Insonesian factories, as Indonesia shows a high emergence of factories as well as social media use.

Many international brands are supplied by factories in Indonesia, consumers often do not know much about the conditions under which products are produced. In a recent study on factories in Indonesia we found that many supplier factories in Indonesia are represented on Foursquare. This indicates that the manufacturing industry is already reflected – to some extent – on social media.

In a next step, I want to analyze, whether problems with working conditions are apparent in online data. However, first of all, it is hard to define what a “problem” actually is. I decided to particularly look at labor strike events, because these are events were workers themselves stand-up in order to raise public awareness for circumstances they find problematic in form of a protest.

This diagram shows the increase of Tweets during a labour strike in an Adidas Factory on the island Batam mentioning the brand or the island.
Example Strike on Twitter
I repeatedly observed peaks in the amount of Tweets mentioning a factory name or city at the time of the event of a labour-strike.

Some of the questions I want to address next are:

  • Who are the people tweeting about strikes? (media, workers, workers unions?)
  • What are the topics discussed before | during | after a strike?
  • Are there certain phases that can be observed across several strike events?
  • Which language(s) are used? (dialects, formal, informal)

In the middle of November I get the chance to travel to Indonesia to the International Conference on Data and Software Engineering 2014 in Bandung and I will stay about four weeks in Indonesia. I am looking forward to listen to the opinions of Indonesian researchers.

Are Twitter predictions a result of researchers expectations?

In the last years, several researchers showed that Twitter data can be used to predict real-world events, like earthquakes [1], the development of stock-market indicators [2], the outcome of political elections [3], the spread of diseases  [4] or movie box-office sales [5]. Indeed studies provide some promising results that Twitter data can be successfully used for predictions, however, recently several researchers questioned both the predictive power of twitter and applied research methods [6, 7].

It seems there are several challenges which make it hard to verify whether and how well proposed methods actually work:

  • It is expensive to obtain historic Twitter data therefore experiments can not be repeated under same conditions
  • A multitude of decisions have to be taken during data collection (Which API is used?, Which keywords or filtering criteria are used? Which time period is captured?) often these decisions are not sufficiently documented which make it hard to repeat experiments and to apply the method in different settings
  • Many of proposed methods require a predefined list of keywords to filter tweets (e.g. “flu”, “cough”, “H1N1″ … if you want to track a disease) however it’s not quite clear how to compile these lists, so methods rely on the ability of the researcher to define such lists and it is difficult to apply methods in a different context, e.g. countries with a different language.

Given this multitude of decisions and predefined knowledge that is required to conduct the experiments combined with the difficulty to repeat experiments for other researchers, it seems in Twitter prediction research could be at risk to be influenced by the observer-expectancy effect, which means that the researcher subconciously effects the research result.

Or as David Hand wrote, in other words:

“It is quite possible that the most interesting patterns we discover during a data mining exercise will have resulted from measurement inaccuracies, distorted samples or some other unsuspected difference between the reality of the data and our perception of it.” [8]

My colleague Amal Almansour from Kings College in London and I, we were particularly interested into the decisions made during Twitter Prediction research, and we just finished a literature survey and cricially analyzed 24 existing Twitter Prediction studies. In this study, we identified the different actors involved in the typical Twitter research process and their potential impact on the prediction method and respectively the prediction result.

This study is currently in the peer-review process, results will be stated here soon.

Analysing Supplier Locations: A Case Study Based on Indonesian Factories – iKnow 2014

In September, I presented our latest study at the International Conference on Knowledge Technologies and Data-Driven Business (iKnow 2014) in Graz.

In this study, we explored how social and semantic data can be used to monitor risks around supplier factories. We focused our study on Indonesia, as it exhibits both an important position as an outsourcing country for several major brands as well as a high social media usage.

Data sample

We compiled a sample of 139 factories in Indonesia supplying 4 popular companies in the textile, sports and electronics industry. Each factory is described by its name and its address. All data was retrieved from the respective company website.

Main research question

  1. Can user-generated data help to determine the physical location (GPS-coordinates) of supplier factories?
  2. How could we link semantic data to attain risk information about supplier factories?

The most interesting facts and results

1. Mapping Services could map only few factory addresses

Using Google Maps, Nokia Here Maps, Bing Maps, Open Street Maps (Nominatim) to transform the address information into GPS-coordinates we could only retrieve accurate GPS-coordinates for few (20/139) factories. There were considerable differences in the number of addresses which could be transformed to GPS coordinates, and precision levels.

geocoding

2.Most of the factories in our sample have a Foursquare profile

For most of the factories (122/139) we could find a profile on the geo-social network “Foursquare”. Foursquare profiles are created by users, those might in this case be workers or people living around the production site.
Typically users register a location with its name and purpose using mobile devices. Thereby maps are created collectively.

Example4Square

3.Most of the factories were tagged on Wikimapia

Most of the factories (94/139) were tagged by users on the crowdsourced map “Wikimapia”. On Wikimapia users can tag buildings with their names or purpose on satellite pictures, thereby they create maps.

Example Factory Tagged on Wikimapia

(more…)