Research Project Management, part 3: Data Collection and Analysis

This is the third article in a series of four articles about postgraduate research project management.

Published on 22 June 2022

Research Project Management, part 3: Data Collection and Analysis

Most sentiment analysis research methods tend to lean towards five simple steps: defining a problem to solve, collecting textual data, pre-processing collected data, analysing the data, and visualising and evaluating the results (Mandloi and Patel, 2020; Birjali, Kasri and Beni-Hssane, 2021).

This sequence of tasks can be defined as a procedure that should be followed when conducting sentiment analysis (Wongkar and Angdresey, 2019). However, the previous literature shows little or no mention of ethical or social issues. Contrary, there is a significant amount of research papers that apply web scraping techniques to collect data from places such as review websites, blogs, and social media platforms. The question is whether it is right to assume that the data collected for the purpose of the research paper has been collected ethically and kept secure or shared, for example, just for the study reproduction purposes? Moreover, is it right to use such papers in the literature review, even if the results provide valuable knowledge? Therefore, the goal of this article is to identify and explore data collection and analysis methods that align with the ethical, professional, and legal considerations, and concludes with the final remarks regarding the social implications of such a research project.

Request Data, Not Scrape It

Social media platforms are increasingly regarded as a “data gold mine.” It is estimated that well over 550 million posts are published daily (Mandloi and Patel, 2020). For this reason, platforms like Twitter, attracted researchers from many different disciplines, not just social computing (Fiesler, Beard and Keegan, 2020). It seems like Twitter realised that this vast amount of data can be made available not only to developers wanting to build apps but also to the research community (Kaburuan et al., 2019). For this purpose, Twitter build an Application Programming Interface (or API) that can be used to easily send HTTP requests to receive filtered data collection from the platform (Birjali, Kasri and Beni-Hassane, 2021). Therefore, accessing data through API is more convenient because retrieved data is to some extent organised (e.g., data is in a key-value pair format) and hence is easier to use and manage (Fiesler, Beard and Keegan, 2020). Moreover, using the API is faster, easier, and more importantly, the right way to do it.

However, before collecting user-generated content from Twitter, it is important to consider what data is relevant to the context of the study and what constitutes ‘public’ and ‘private’ information (Taylor and Pagliari, 2018), as well as whether the collected data contains personally identifiable information (Fisher, 2018). Generally, a principle of good scientific research is to make ethical choices when collecting data and respecting legal constraints protecting the owner of such data (Birjali, Kasri and Beni-Hassane, 2021). According to Fisher (2018), it is important to have a strong data governance. Consequently, data governance could lead to a highly professional and ethical research. However, to create a data governance framework, there are several documents that can be treated as a guideline. Firstly, by using Twitter API, the researcher agrees to adhere to the Terms of Service, despite arguments that those terms are often too ambiguous (Fiesler, Beard and Keegan, 2020). Secondly, I believe that other documents, such as UK’s Data Protection Act 2018 or the EU’s General Data Protection Regulation could be a good place to look for clarification of those ambiguities. However, if the clarification can’t be found, then what has been interpreted from those documents and your own or collective judgment, should be made to decide whether the data collection methods or data itself contributes to ethical research.

Data Analysis Methods

The purpose of this research is to understand the public opinion about sending military, medical, and financial aid to Ukraine. Therefore, the main analysis method for a project of this type will be quantitative analysis. For this reason, a numerical measurement of positive, neutral, and negative sentiments will be the key mode of measure. However, according to Greenfield and Greener (2016), there are other basic measurements that can be applied in quantitative analysis, such as averages (mean, median, mode), dispersion, and standard deviation. Moreover, the distribution of a sample could be useful for identifying outliers, and the categorical data can be organised into nominal and ordinal variables (Greenfield and Greener, 2016). Since the project aims to explore supervised machine learning approaches to classify data, the research will involve both linear and probabilistic approaches, such as Support Vector Machine (SVM), Naive Bayes, and Maximum Entropy. According to Birjali, Kasri and Beni-Hssane (2021), SVM is a non-probabilistic classifier that is effective at separating data linearly and non-linearly. On the other hand, Naive Bayes uses Bag of Words to extract features from the text and is the most simplistic classifier. Lastly, there is a Maximum Entropy classifier that uses Parts of Speech tags and is known to have high accuracy (Birjali, Kasri and Beni-Hassane, 2021).

From the ethical perspective, when analysing the data, it is important to keep in mind that the results will often inform and support decision-makers. If the results are misleading, for example, due to bad data quality (Fisher, 2018) or inappropriate choice of analysis methods lead to inaccurate results, the choices made by the decision-makers might have an opposite effect. Therefore, admitting that the data is inaccurate or that there is a mistake in data analysis is better thing to do than hiding it. However, to avoid this from happening it is always advisable to test and learn to avoid potential errors (Fisher, 2018).

Conclusion

In summary, this article provided a high-level overview of data collection techniques and analysis methods. It has been decided that Twitter API is the most appropriate way to collect data for this type of project. This article introduced three supervised machine learning classifiers that will be applied in sentiment analysis, followed by ethical and professional considerations when conducting data analysis.

References

Birjali, M., Kasri, M. and Beni-Hssane, A. (2021). A comprehensive survey on sentiment analysis: Approaches, challenges and trends. Knowledge-Based Systems, [online] 226, p.107134. doi:10.1016/j.knosys.2021.107134.

Fiesler, C., Beard, N. and Keegan, B.C. (2020). No Robots, Spiders, or Scrapers: Legal and Ethical Regulation of Data Collection Methods in Social Media Terms of Service. Proceedings of the International AAAI Conference on Web and Social Media, [online] 14, pp.187–196. Available at: https://ojs.aaai.org/index.php/ICWSM/article/view/7290 [Accessed 7 Jun. 2022].

Fisher, L. (2018). Blast Analytics. [online] Blast Analytics. Available at: https://www.blastanalytics.com/blog/code-of-ethics-for-data-analysts-8-guidelines [Accessed 24 Jul. 2022].

Greenfield, T. and Greener, S. (eds) 2016. Research Methods for Postgraduates, John Wiley & Sons, Incorporated, New York.

Kaburuan, E.R., Lin Lindawati, A.S., Surjandy, Siswantini, Putra, M.R. and Utama, D.N. (2019). A Model Configuration of Social Media Text Mining for Projecting the Online-Commerce Transaction (Case: Twitter Tweets Scraping). 2019 7th International Conference on Cyber and IT Service Management (CITSM). [online] doi:10.1109/citsm47753.2019.8965417.

Mandloi, L. and Patel, R. (2020). Twitter Sentiments Analysis Using Machine Learninig Methods. 2020 International Conference for Emerging Technology (INCET). [online] doi:10.1109/incet49848.2020.9154183.

Taylor, J. and Pagliari, C. (2017). Mining social media data: How are research sponsors and researchers addressing the ethical challenges? Research Ethics, 14(2), pp.1–39. doi:10.1177/1747016117738559.

Wongkar, M. and Angdresey, A. (2019). Sentiment Analysis Using Naive Bayes Algorithm Of The Data Crawler: Twitter. 2019 Fourth International Conference on Informatics and Computing (ICIC). [online] doi:10.1109/icic47613.2019.8985884.

A view of the Newcastle-Gateshead Quaside from the Tyne Bridge

Let's work together to bring your digital dream to life.

Get in touch to book a free consultation