Sentiment Analysis in the Portuguese Language

  • by
Análise de Sentimentos

Computer-aided sentiment detection has garnered a lot of attention in recent years both by universities and companies. One of the reasons for this interest is the rise in content generated by people in the Intenet, especially when they express opinions. Supervied machine learning is among the most used techniques for sentiment detection. This involves classifiers that use previously labeled data to learn patterns and predict sentiments for new input data. Training these classifiers requires a lot of data, thus the availability of datasets is essential when performing research and developing applications in this field. However, datasets with examples in Portuguese are still scarce, which limits applications in this language.

Considering this necessity, this work aims to collect and label tweets, which are messages shared on Twitter, to create a dataset for sentiment analysis in Portuguese. To this end, we implemented a message scraper using Twitter's API. Then we developed a web application where voluntaries can label collected tweets regarding their sentiment (positive, negative or neutral). 2,787 tweets were labeled in total, of which 888 were positive, 881 were negative and 1,018 neutral. The dataset is available at: https://github.com/arialab/tash-pt.

A paper (written in Portuguese) related to this project and titled "Um Conjunto de Dados Extraído do Twitter para Análise de Sentimentos na Língua Portuguesa" (A Dataset Extracted from Twitter for Sentiment Analysis in Portuguese) can be found at: http://comissoes.sbc.org.br/ce-pln/stil2019/proceedings-stil-2019-Final-Publicacao.pdf